├── README
├── _static
├── image1.png
├── image2.png
├── image3.png
├── image4.png
├── image5.png
├── image6.png
├── image7.png
├── image8.png
├── image9.png
└── image10.png
├── _build
└── latex
│ └── MultivariateAnalysis.pdf
├── LittleBookofRMultivariateAnalysis
└── src
│ └── multivariateanalysis.rst
├── index.rst
├── make.bat
├── conf.py
└── src
├── installr.rst
└── multivariateanalysis.rst
/README:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/_static/image1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image1.png
--------------------------------------------------------------------------------
/_static/image2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image2.png
--------------------------------------------------------------------------------
/_static/image3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image3.png
--------------------------------------------------------------------------------
/_static/image4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image4.png
--------------------------------------------------------------------------------
/_static/image5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image5.png
--------------------------------------------------------------------------------
/_static/image6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image6.png
--------------------------------------------------------------------------------
/_static/image7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image7.png
--------------------------------------------------------------------------------
/_static/image8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image8.png
--------------------------------------------------------------------------------
/_static/image9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image9.png
--------------------------------------------------------------------------------
/_static/image10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image10.png
--------------------------------------------------------------------------------
/_build/latex/MultivariateAnalysis.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_build/latex/MultivariateAnalysis.pdf
--------------------------------------------------------------------------------
/LittleBookofRMultivariateAnalysis/src/multivariateanalysis.rst:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/LittleBookofRMultivariateAnalysis/src/multivariateanalysis.rst
--------------------------------------------------------------------------------
/index.rst:
--------------------------------------------------------------------------------
1 | Welcome to a Little Book of R for Multivariate Analysis!
2 | ========================================================
3 |
4 | By `Avril Coghlan `_,
5 | Wellcome Trust Sanger Institute, Cambridge, U.K.
6 | Email: alc@sanger.ac.uk
7 |
8 | This is a simple introduction to multivariate analysis using the R statistics software.
9 |
10 | There is a pdf version of this booklet available at:
11 | `https://media.readthedocs.org/pdf/little-book-of-r-for-multivariate-analysis/latest/little-book-of-r-for-multivariate-analysis.pdf `_.
12 |
13 | If you like this booklet, you may also like to check out my booklet on using
14 | R for biomedical statistics,
15 | `http://a-little-book-of-r-for-biomedical-statistics.readthedocs.org/
16 | `_,
17 | and my booklet on using R for time series analysis,
18 | `http://a-little-book-of-r-for-time-series.readthedocs.org/
19 | `_.
20 |
21 | Contents:
22 |
23 | .. toctree::
24 | :maxdepth: 3
25 |
26 | src/installr.rst
27 | src/multivariateanalysis.rst
28 |
29 | Acknowledgements
30 | ----------------
31 |
32 | Thank you to Noel O'Boyle for helping in using Sphinx, `http://sphinx.pocoo.org `_, to create
33 | this document, and github, `https://github.com/ `_, to store different versions of the document
34 | as I was writing it, and readthedocs, `http://readthedocs.org/ `_, to build and distribute
35 | this document.
36 |
37 | Contact
38 | -------
39 |
40 | I will be very grateful if you will send me (`Avril Coghlan `_) corrections or suggestions for improvements to
41 | my email address alc@sanger.ac.uk
42 |
43 | License
44 | -------
45 |
46 | The content in this book is licensed under a `Creative Commons Attribution 3.0 License
47 | `_.
48 |
49 |
--------------------------------------------------------------------------------
/make.bat:
--------------------------------------------------------------------------------
1 | @ECHO OFF
2 |
3 | REM Command file for Sphinx documentation
4 |
5 | set SPHINXBUILD=C:\Python26\Scripts\sphinx-build
6 | set BUILDDIR=_build
7 | set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% .
8 | if NOT "%PAPER%" == "" (
9 | set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS%
10 | )
11 |
12 | if "%1" == "" goto help
13 |
14 | if "%1" == "help" (
15 | :help
16 | echo.Please use `make ^` where ^ is one of
17 | echo. html to make standalone HTML files
18 | echo. dirhtml to make HTML files named index.html in directories
19 | echo. pickle to make pickle files
20 | echo. json to make JSON files
21 | echo. htmlhelp to make HTML files and a HTML help project
22 | echo. qthelp to make HTML files and a qthelp project
23 | echo. latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter
24 | echo. changes to make an overview over all changed/added/deprecated items
25 | echo. linkcheck to check all external links for integrity
26 | echo. doctest to run all doctests embedded in the documentation if enabled
27 | goto end
28 | )
29 |
30 | if "%1" == "clean" (
31 | for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i
32 | del /q /s %BUILDDIR%\*
33 | goto end
34 | )
35 |
36 | if "%1" == "html" (
37 | %SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html
38 | echo.
39 | echo.Build finished. The HTML pages are in %BUILDDIR%/html.
40 | goto end
41 | )
42 |
43 | if "%1" == "dirhtml" (
44 | %SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml
45 | echo.
46 | echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml.
47 | goto end
48 | )
49 |
50 | if "%1" == "pickle" (
51 | %SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle
52 | echo.
53 | echo.Build finished; now you can process the pickle files.
54 | goto end
55 | )
56 |
57 | if "%1" == "json" (
58 | %SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json
59 | echo.
60 | echo.Build finished; now you can process the JSON files.
61 | goto end
62 | )
63 |
64 | if "%1" == "htmlhelp" (
65 | %SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp
66 | echo.
67 | echo.Build finished; now you can run HTML Help Workshop with the ^
68 | .hhp project file in %BUILDDIR%/htmlhelp.
69 | goto end
70 | )
71 |
72 | if "%1" == "qthelp" (
73 | %SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp
74 | echo.
75 | echo.Build finished; now you can run "qcollectiongenerator" with the ^
76 | .qhcp project file in %BUILDDIR%/qthelp, like this:
77 | echo.^> qcollectiongenerator %BUILDDIR%\qthelp\sampledoc.qhcp
78 | echo.To view the help file:
79 | echo.^> assistant -collectionFile %BUILDDIR%\qthelp\sampledoc.ghc
80 | goto end
81 | )
82 |
83 | if "%1" == "latex" (
84 | %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
85 | echo.
86 | echo.Build finished; the LaTeX files are in %BUILDDIR%/latex.
87 | goto end
88 | )
89 |
90 | if "%1" == "changes" (
91 | %SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes
92 | echo.
93 | echo.The overview file is in %BUILDDIR%/changes.
94 | goto end
95 | )
96 |
97 | if "%1" == "linkcheck" (
98 | %SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck
99 | echo.
100 | echo.Link check complete; look for any errors in the above output ^
101 | or in %BUILDDIR%/linkcheck/output.txt.
102 | goto end
103 | )
104 |
105 | if "%1" == "doctest" (
106 | %SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest
107 | echo.
108 | echo.Testing of doctests in the sources finished, look at the ^
109 | results in %BUILDDIR%/doctest/output.txt.
110 | goto end
111 | )
112 |
113 | if "%1" == "pdf" (
114 | %SPHINXBUILD% -b pdf %ALLSPHINXOPTS% %BUILDDIR%/pdf
115 | echo.
116 | echo.Build finished. The PDF files are in _build/pdf.
117 | goto end
118 | )
119 |
120 | :end
121 |
--------------------------------------------------------------------------------
/conf.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #
3 | # sampledoc documentation build configuration file, created by
4 | # sphinx-quickstart on Sat Jan 09 18:21:28 2010.
5 | #
6 | # This file is execfile()d with the current directory set to its containing dir.
7 | #
8 | # Note that not all possible configuration values are present in this
9 | # autogenerated file.
10 | #
11 | # All configuration values have a default; values that are commented out
12 | # serve to show the default.
13 |
14 | import sys, os
15 |
16 | # If extensions (or modules to document with autodoc) are in another directory,
17 | # add these directories to sys.path here. If the directory is relative to the
18 | # documentation root, use os.path.abspath to make it absolute, like shown here.
19 | #sys.path.append(os.path.abspath('.'))
20 |
21 | # -- General configuration -----------------------------------------------------
22 |
23 | # Add any Sphinx extension module names here, as strings. They can be extensions
24 | # coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
25 | # (Noel) adding rst2pdf
26 | extensions = ['sphinx.ext.autodoc'] # ,'rst2pdf.pdfbuilder']
27 |
28 | # Add any paths that contain templates here, relative to this directory.
29 | templates_path = ['_templates']
30 |
31 | # The suffix of source filenames.
32 | source_suffix = '.rst'
33 |
34 | # The encoding of source files.
35 | #source_encoding = 'utf-8'
36 |
37 | # The master toctree document.
38 | master_doc = 'index'
39 |
40 | # General information about the project.
41 | project = u'Multivariate Analysis'
42 | copyright = u'2010, Avril Coghlan'
43 |
44 | # The version info for the project you're documenting, acts as replacement for
45 | # |version| and |release|, also used in various other places throughout the
46 | # built documents.
47 | #
48 | # The short X.Y version.
49 | version = '0.1'
50 | # The full version, including alpha/beta/rc tags.
51 | release = '0.1'
52 |
53 | # The language for content autogenerated by Sphinx. Refer to documentation
54 | # for a list of supported languages.
55 | #language = None
56 |
57 | # There are two options for replacing |today|: either, you set today to some
58 | # non-false value, then it is used:
59 | #today = ''
60 | # Else, today_fmt is used as the format for a strftime call.
61 | #today_fmt = '%B %d, %Y'
62 |
63 | # List of documents that shouldn't be included in the build.
64 | #unused_docs = []
65 |
66 | # List of directories, relative to source directory, that shouldn't be searched
67 | # for source files.
68 | exclude_trees = ['_build']
69 |
70 | # The reST default role (used for this markup: `text`) to use for all documents.
71 | #default_role = None
72 |
73 | # If true, '()' will be appended to :func: etc. cross-reference text.
74 | #add_function_parentheses = True
75 |
76 | # If true, the current module name will be prepended to all description
77 | # unit titles (such as .. function::).
78 | #add_module_names = True
79 |
80 | # If true, sectionauthor and moduleauthor directives will be shown in the
81 | # output. They are ignored by default.
82 | #show_authors = False
83 |
84 | # The name of the Pygments (syntax highlighting) style to use.
85 | pygments_style = 'sphinx'
86 |
87 | # A list of ignored prefixes for module index sorting.
88 | #modindex_common_prefix = []
89 |
90 |
91 | # -- Options for HTML output ---------------------------------------------------
92 |
93 | # The theme to use for HTML and HTML Help pages. Major themes that come with
94 | # Sphinx are currently 'default' and 'sphinxdoc'.
95 | # html_theme = 'default'
96 | html_theme = 'sphinxdoc'
97 |
98 | # Theme options are theme-specific and customize the look and feel of a theme
99 | # further. For a list of options available for each theme, see the
100 | # documentation.
101 | #html_theme_options = {}
102 |
103 | # Add any paths that contain custom themes here, relative to this directory.
104 | #html_theme_path = []
105 |
106 | # The name for this set of Sphinx documents. If None, it defaults to
107 | # " v documentation".
108 | #html_title = None
109 |
110 | # A shorter title for the navigation bar. Default is the same as html_title.
111 | #html_short_title = None
112 |
113 | # The name of an image file (relative to this directory) to place at the top
114 | # of the sidebar.
115 | #html_logo = None
116 |
117 | # The name of an image file (within the static path) to use as favicon of the
118 | # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
119 | # pixels large.
120 | #html_favicon = None
121 |
122 | # Add any paths that contain custom static files (such as style sheets) here,
123 | # relative to this directory. They are copied after the builtin static files,
124 | # so a file named "default.css" will overwrite the builtin "default.css".
125 | html_static_path = ['_static']
126 |
127 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
128 | # using the given strftime format.
129 | #html_last_updated_fmt = '%b %d, %Y'
130 |
131 | # If true, SmartyPants will be used to convert quotes and dashes to
132 | # typographically correct entities.
133 | #html_use_smartypants = True
134 |
135 | # Custom sidebar templates, maps document names to template names.
136 | #html_sidebars = {}
137 |
138 | # Additional templates that should be rendered to pages, maps page names to
139 | # template names.
140 | #html_additional_pages = {}
141 |
142 | # If false, no module index is generated.
143 | #html_use_modindex = True
144 |
145 | # If false, no index is generated.
146 | #html_use_index = True
147 |
148 | # If true, the index is split into individual pages for each letter.
149 | #html_split_index = False
150 |
151 | # If true, links to the reST sources are added to the pages.
152 | #html_show_sourcelink = True
153 |
154 | # If true, an OpenSearch description file will be output, and all pages will
155 | # contain a tag referring to it. The value of this option must be the
156 | # base URL from which the finished HTML is served.
157 | #html_use_opensearch = ''
158 |
159 | # If nonempty, this is the file name suffix for HTML files (e.g. ".xhtml").
160 | #html_file_suffix = ''
161 |
162 | # Output file base name for HTML help builder.
163 | htmlhelp_basename = 'sampledocdoc'
164 |
165 |
166 | # -- Options for LaTeX output --------------------------------------------------
167 |
168 | # The paper size ('letter' or 'a4').
169 | latex_paper_size = 'a4'
170 |
171 | # The font size ('10pt', '11pt' or '12pt').
172 | #latex_font_size = '10pt'
173 |
174 | # Grouping the document tree into LaTeX files. List of tuples
175 | # (source start file, target name, title, author, documentclass [howto/manual]).
176 | latex_documents = [
177 | ('index', 'MultivariateAnalysis.tex', u'A Little Book of R For Multivariate Analysis',
178 | u'Avril Coghlan', 'manual'),
179 | ]
180 |
181 | # The name of an image file (relative to this directory) to place at the top of
182 | # the title page.
183 | #latex_logo = None
184 |
185 | # For "manual" documents, if this is true, then toplevel headings are parts,
186 | # not chapters.
187 | #latex_use_parts = False
188 |
189 | # Additional stuff for the LaTeX preamble.
190 | #latex_preamble = ''
191 |
192 | # Documents to append as an appendix to all manuals.
193 | #latex_appendices = []
194 |
195 | # If false, no module index is generated.
196 | #latex_use_modindex = True
197 |
198 | # (Noel) The following is all from rst2pdf
199 | # -- Options for PDF output --------------------------------------------------
200 |
201 | # Grouping the document tree into PDF files. List of tuples
202 | # (source start file, target name, title, author, options).
203 | #
204 | # If there is more than one author, separate them with \\.
205 | # For example: r'Guido van Rossum\\Fred L. Drake, Jr., editor'
206 | #
207 | # The options element is a dictionary that lets you override
208 | # this config per-document.
209 | # For example,
210 | # ('index', u'MyProject', u'My Project', u'Author Name',
211 | # dict(pdf_compressed = True))
212 | # would mean that specific document would be compressed
213 | # regardless of the global pdf_compressed setting.
214 |
215 | pdf_documents = [
216 | ('index', u'MyProject', u'My Project', u'Author Name'),
217 | ]
218 |
219 | # A comma-separated list of custom stylesheets. Example:
220 | pdf_stylesheets = ['sphinx','kerning','a4']
221 |
222 | # Create a compressed PDF
223 | # Use True/False or 1/0
224 | # Example: compressed=True
225 | #pdf_compressed = False
226 |
227 | # A colon-separated list of folders to search for fonts. Example:
228 | # pdf_font_path = ['/usr/share/fonts', '/usr/share/texmf-dist/fonts/']
229 |
230 | # Language to be used for hyphenation support
231 | #pdf_language = "en_US"
232 |
233 | # Mode for literal blocks wider than the frame. Can be
234 | # overflow, shrink or truncate
235 | #pdf_fit_mode = "shrink"
236 |
237 | # Section level that forces a break page.
238 | # For example: 1 means top-level sections start in a new page
239 | # 0 means disabled
240 | #pdf_break_level = 0
241 |
242 | # When a section starts in a new page, force it to be 'even', 'odd',
243 | # or just use 'any'
244 | #pdf_breakside = 'any'
245 |
246 | # Insert footnotes where they are defined instead of
247 | # at the end.
248 | #pdf_inline_footnotes = True
249 |
250 | # verbosity level. 0 1 or 2
251 | #pdf_verbosity = 0
252 |
253 | # If false, no index is generated.
254 | #pdf_use_index = True
255 |
256 | # If false, no modindex is generated.
257 | #pdf_use_modindex = True
258 |
259 | # If false, no coverpage is generated.
260 | #pdf_use_coverpage = True
261 |
262 | # Documents to append as an appendix to all manuals.
263 | #pdf_appendices = []
264 |
265 | # Enable experimental feature to split table cells. Use it
266 | # if you get "DelayedTable too big" errors
267 | #pdf_splittables = False
268 |
269 | # Set the default DPI for images
270 | #pdf_default_dpi = 72
271 |
272 | # Enable rst2pdf extension modules (default is empty list)
273 | # you need vectorpdf for better sphinx's graphviz support
274 | #pdf_extensions = ['vectorpdf']
275 |
276 |
277 |
--------------------------------------------------------------------------------
/src/installr.rst:
--------------------------------------------------------------------------------
1 | How to install R
2 | ================
3 |
4 | Introduction to R
5 | -----------------
6 |
7 | This little booklet has some information on how to use R for time series analysis.
8 |
9 | R (`www.r-project.org `_) is a commonly used
10 | free Statistics software. R allows you to carry out statistical
11 | analyses in an interactive mode, as well as allowing simple programming.
12 |
13 | Installing R
14 | ------------
15 |
16 | To use R, you first need to install the R program on your computer.
17 |
18 | How to check if R is installed on a Windows PC
19 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
20 |
21 | Before you install R on your computer, the first thing to do is to check whether
22 | R is already installed on your computer (for example, by a previous user).
23 |
24 | These instructions will focus on installing R on a Windows PC. However, I will also
25 | briefly mention how to install R on a Macintosh or Linux computer (see below).
26 |
27 | If you are using a Windows PC, there are two ways you can check whether R is
28 | already isntalled on your computer:
29 |
30 | 1. Check if there is an "R" icon on the desktop of the computer that you are using.
31 | If so, double-click on the "R" icon to start R. If you cannot find an "R" icon, try step 2 instead.
32 | 2. Click on the "Start" menu at the bottom left of your Windows desktop, and then move your
33 | mouse over "All Programs" in the menu that pops up. See if "R" appears in the list
34 | of programs that pops up. If it does, it means that R is already installed on your
35 | computer, and you can start R by selecting "R" (or R X.X.X, where X.X.X gives the version of R,
36 | eg. R 2.10.0) from the list.
37 |
38 | If either (1) or (2) above does succeed in starting R, it means that R is already installed
39 | on the computer that you are using. (If neither succeeds, R is not installed yet).
40 | If there is an old version of R installed on the Windows PC that you are using,
41 | it is worth installing the latest version of R, to make sure that you have all the
42 | latest R functions available to you to use.
43 |
44 | Finding out what is the latest version of R
45 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
46 |
47 | To find out what is the latest version of R, you can look at the CRAN (Comprehensive
48 | R Network) website, `http://cran.r-project.org/ `_.
49 |
50 | Beside "The latest release" (about half way down the page), it will say something like
51 | "R-X.X.X.tar.gz" (eg. "R-2.12.1.tar.gz"). This means that the latest release of R is X.X.X (for
52 | example, 2.12.1).
53 |
54 | New releases of R are made very regularly (approximately once a month), as R is actively being
55 | improved all the time. It is worthwhile installing new versions of R regularly, to make sure
56 | that you have a recent version of R (to ensure compatibility with all the latest versions of
57 | the R packages that you have downloaded).
58 |
59 | Installing R on a Windows PC
60 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
61 |
62 | To install R on your Windows computer, follow these steps:
63 |
64 | 1. Go to `http://ftp.heanet.ie/mirrors/cran.r-project.org `_.
65 | 2. Under "Download and Install R", click on the "Windows" link.
66 | 3. Under "Subdirectories", click on the "base" link.
67 | 4. On the next page, you should see a link saying something like "Download R 2.10.1 for Windows" (or R X.X.X, where X.X.X gives the version of R, eg. R 2.11.1).
68 | Click on this link.
69 | 5. You may be asked if you want to save or run a file "R-2.10.1-win32.exe". Choose "Save" and
70 | save the file on the Desktop. Then double-click on the icon for the file to run it.
71 | 6. You will be asked what language to install it in - choose English.
72 | 7. The R Setup Wizard will appear in a window. Click "Next" at the bottom of the R Setup wizard
73 | window.
74 | 8. The next page says "Information" at the top. Click "Next" again.
75 | 9. The next page says "Information" at the top. Click "Next" again.
76 | 10. The next page says "Select Destination Location" at the top.
77 | By default, it will suggest to install R in "C:\\Program Files" on your computer.
78 | 11. Click "Next" at the bottom of the R Setup wizard window.
79 | 12. The next page says "Select components" at the top. Click "Next" again.
80 | 13. The next page says "Startup options" at the top. Click "Next" again.
81 | 14. The next page says "Select start menu folder" at the top. Click "Next" again.
82 | 15. The next page says "Select additional tasks" at the top. Click "Next" again.
83 | 16. R should now be installed. This will take about a minute. When R has finished, you will
84 | see "Completing the R for Windows Setup Wizard" appear. Click "Finish".
85 | 17. To start R, you can either follow step 18, or 19:
86 | 18. Check if there is an "R" icon on the desktop of the computer that you are using.
87 | If so, double-click on the "R" icon to start R. If you cannot find an "R" icon, try step 19 instead.
88 | 19. Click on the "Start" button at the bottom left of your computer screen, and then
89 | choose "All programs", and start R by selecting "R" (or R X.X.X, where
90 | X.X.X gives the version of R, eg. R 2.10.0) from the menu of programs.
91 | 20. The R console (a rectangle) should pop up:
92 |
93 | |image3|
94 |
95 | How to install R on non-Windows computers (eg. Macintosh or Linux computers)
96 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
97 |
98 | The instructions above are for installing R on a Windows PC. If you want to install R
99 | on a computer that has a non-Windows operating system (for example, a Macintosh or computer running Linux,
100 | you should download the appropriate R installer for that operating system at
101 | `http://ftp.heanet.ie/mirrors/cran.r-project.org
102 | `_ and
103 | follow the R installation instructions for the appropriate operating system at
104 | `http://ftp.heanet.ie/mirrors/cran.r-project.org/doc/FAQ/R-FAQ.html#How-can-R-be-installed_003f
105 | `_).
106 |
107 | Installing R packages
108 | ---------------------
109 |
110 | R comes with some standard packages that are installed when you install R. However, in this
111 | booklet I will also tell you how to use some additional R packages that are useful, for example,
112 | the "rmeta" package. These additional packages do not come with the standard installation of R,
113 | so you need to install them yourself.
114 |
115 | How to install an R package
116 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
117 |
118 | Once you have installed R on a Windows computer (following the steps above), you can install
119 | an additional package by following the steps below:
120 |
121 | 1. To start R, follow either step 2 or 3:
122 | 2. Check if there is an "R" icon on the desktop of the computer that you are using.
123 | If so, double-click on the "R" icon to start R. If you cannot find an "R" icon, try step 3 instead.
124 | 3. Click on the "Start" button at the bottom left of your computer screen, and then
125 | choose "All programs", and start R by selecting "R" (or R X.X.X, where
126 | X.X.X gives the version of R, eg. R 2.10.0) from the menu of programs.
127 | 4. The R console (a rectangle) should pop up.
128 | 5. Once you have started R, you can now install an R package (eg. the "rmeta" package) by
129 | choosing "Install package(s)" from the "Packages" menu at the top of the R console.
130 | This will ask you what website you want to download the package from, you should choose
131 | "Ireland" (or another country, if you prefer). It will also bring up a list of available
132 | packages that you can install, and you should choose the package that you want to install
133 | from that list (eg. "rmeta").
134 | 6. This will install the "rmeta" package.
135 | 7. The "rmeta" package is now installed. Whenever you want to use the "rmeta" package after this,
136 | after starting R, you first have to load the package by typing into the R console:
137 |
138 | .. highlight:: r
139 |
140 | ::
141 |
142 | > library("rmeta")
143 |
144 | Note that there are some additional R packages for bioinformatics that are part of a special
145 | set of R packages called Bioconductor (`www.bioconductor.org `_)
146 | such as the "yeastExpData" R package, the "Biostrings" R package, etc.).
147 | These Bioconductor packages need to be installed using a different, Bioconductor-specific procedure
148 | (see `How to install a Bioconductor R package`_ below).
149 |
150 | How to install a Bioconductor R package
151 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
152 |
153 | The procedure above can be used to install the majority of R packages. However, the
154 | Bioconductor set of bioinformatics R packages need to be installed by a special procedure.
155 | Bioconductor (`www.bioconductor.org `_)
156 | is a group of R packages that have been developed for bioinformatics. This includes
157 | R packages such as "yeastExpData", "Biostrings", etc.
158 |
159 | To install the Bioconductor packages, follow these steps:
160 |
161 | 1. To start R, follow either step 2 or 3:
162 | 2. Check if there is an "R" icon on the desktop of the computer that you are using.
163 | If so, double-click on the "R" icon to start R. If you cannot find an "R" icon, try step 3 instead.
164 | 3. Click on the "Start" button at the bottom left of your computer screen, and then choose "All programs", and start R by selecting "R" (or R X.X.X, where X.X.X gives the version of R, eg. R 2.10.0) from the menu of programs.
165 | 4. The R console (a rectangle) should pop up.
166 | 5. Once you have started R, now type in the R console:
167 |
168 | .. highlight:: r
169 |
170 | ::
171 |
172 | > source("http://bioconductor.org/biocLite.R")
173 | > biocLite()
174 |
175 | 6. This will install a core set of Bioconductor packages ("affy", "affydata", "affyPLM",
176 | "annaffy", "annotate", "Biobase", "Biostrings", "DynDoc", "gcrma", "genefilter",
177 | "geneplotter", "hgu95av2.db", "limma", "marray", "matchprobes", "multtest", "ROC",
178 | "vsn", "xtable", "affyQCReport").
179 | This takes a few minutes (eg. 10 minutes).
180 | 7. At a later date, you may wish to install some extra Bioconductor packages that do not belong
181 | to the core set of Bioconductor packages. For example, to install the Bioconductor package called
182 | "yeastExpData", start R and type in the R console:
183 |
184 | .. highlight:: r
185 |
186 | ::
187 |
188 | > source("http://bioconductor.org/biocLite.R")
189 | > biocLite("yeastExpData")
190 |
191 | 8. Whenever you want to use a package after installing it, you need to load it into R by typing:
192 |
193 | .. highlight:: r
194 |
195 | ::
196 |
197 | > library("yeastExpData")
198 |
199 | Running R
200 | -----------
201 |
202 | To use R, you first need to start the R program on your computer.
203 | You should have already installed R on your computer (see above).
204 |
205 | To start R, you can either follow step 1 or 2:
206 | 1. Check if there is an "R" icon on the desktop of the computer that you are using.
207 | If so, double-click on the "R" icon to start R. If you cannot find an "R" icon, try step 2 instead.
208 | 2. Click on the "Start" button at the bottom left of your computer screen, and then choose "All programs", and start R by selecting "R" (or R X.X.X, where X.X.X gives the version of R, eg. R 2.10.0) from the menu of programs.
209 |
210 | This should bring up a new window, which is the *R console*.
211 |
212 | A brief introduction to R
213 | -------------------------
214 |
215 | You will type R commands into the R console in order to carry out
216 | analyses in R. In the R console you will see:
217 |
218 | .. highlight:: r
219 |
220 | ::
221 |
222 | >
223 |
224 | This is the R prompt. We type the commands needed for a particular
225 | task after this prompt. The command is carried out after you hit
226 | the Return key.
227 |
228 | Once you have started R, you can start typing in commands, and the
229 | results will be calculated immediately, for example:
230 |
231 | .. highlight:: r
232 |
233 | ::
234 |
235 | > 2*3
236 | [1] 6
237 | > 10-3
238 | [1] 7
239 |
240 | All variables (scalars, vectors, matrices, etc.) created by R are
241 | called *objects*. In R, we assign values to variables using an
242 | arrow. For example, we can assign the value 2\*3 to the variable
243 | *x* using the command:
244 |
245 | .. highlight:: r
246 |
247 | ::
248 |
249 | > x <- 2*3
250 |
251 | To view the contents of any R object, just type its name, and the
252 | contents of that R object will be displayed:
253 |
254 | .. highlight:: r
255 |
256 | ::
257 |
258 | > x
259 | [1] 6
260 |
261 | There are several possible different types of objects in R,
262 | including scalars, vectors, matrices, arrays, data frames, tables,
263 | and lists. The scalar variable *x* above is one example of an R
264 | object. While a scalar variable such as *x* has just one element, a
265 | vector consists of several elements. The elements in a vector are
266 | all of the same type (eg. numeric or characters), while lists may
267 | include elements such as characters as well as numeric quantities.
268 |
269 | To create a vector, we can use the c() (combine) function. For
270 | example, to create a vector called *myvector* that has elements
271 | with values 8, 6, 9, 10, and 5, we type:
272 |
273 | .. highlight:: r
274 |
275 | ::
276 |
277 | > myvector <- c(8, 6, 9, 10, 5)
278 |
279 | To see the contents of the variable *myvector*, we can just type
280 | its name:
281 |
282 | .. highlight:: r
283 |
284 | ::
285 |
286 | > myvector
287 | [1] 8 6 9 10 5
288 |
289 | The [1] is the index of the first element in the vector. We can
290 | extract any element of the vector by typing the vector name with
291 | the index of that element given in square brackets. For example, to
292 | get the value of the 4th element in the vector *myvector*, we
293 | type:
294 |
295 | .. highlight:: r
296 |
297 | ::
298 |
299 | > myvector[4]
300 | [1] 10
301 |
302 | In contrast to a vector, a list can contain elements of different
303 | types, for example, both numeric and character elements. A list can
304 | also include other variables such as a vector. The list() function
305 | is used to create a list. For example, we could create a list
306 | *mylist* by typing:
307 |
308 | .. highlight:: r
309 |
310 | ::
311 |
312 | > mylist <- list(name="Fred", wife="Mary", myvector)
313 |
314 | We can then print out the contents of the list *mylist* by typing
315 | its name:
316 |
317 | .. highlight:: r
318 |
319 | ::
320 |
321 | > mylist
322 | $name
323 | [1] "Fred"
324 |
325 | $wife
326 | [1] "Mary"
327 |
328 | [[3]]
329 | [1] 8 6 9 10 5
330 |
331 | The elements in a list are numbered, and can be referred to using
332 | indices. We can extract an element of a list by typing the list
333 | name with the index of the element given in double square brackets
334 | (in contrast to a vector, where we only use single square
335 | brackets). Thus, we can extract the second and third elements from
336 | *mylist* by typing:
337 |
338 | .. highlight:: r
339 |
340 | ::
341 |
342 | > mylist[[2]]
343 | [1] "Mary"
344 | > mylist[[3]]
345 | [1] 8 6 9 10 5
346 |
347 | Elements of lists may also be named, and in this case the elements
348 | may be referred to by giving the list name, followed by "$",
349 | followed by the element name. For example, *mylist$name* is the
350 | same as *mylist[[1]]* and *mylist$wife* is the same as
351 | *mylist[[2]]*:
352 |
353 | .. highlight:: r
354 |
355 | ::
356 |
357 | > mylist$wife
358 | [1] "Mary"
359 |
360 | We can find out the names of the named elements in a list by using
361 | the attributes() function, for example:
362 |
363 | .. highlight:: r
364 |
365 | ::
366 |
367 | > attributes(mylist)
368 | $names
369 | [1] "name" "wife" ""
370 |
371 | When you use the attributes() function to find the named elements
372 | of a list variable, the named elements are always listed under a
373 | heading "$names". Therefore, we see that the named elements of the
374 | list variable *mylist* are called "name" and "wife", and we can
375 | retrieve their values by typing *mylist$name* and *mylist$wife*,
376 | respectively.
377 |
378 | Another type of object that you will encounter in R is a *table*
379 | variable. For example, if we made a vector variable *mynames*
380 | containing the names of children in a class, we can use the table()
381 | function to produce a table variable that contains the number of
382 | children with each possible name:
383 |
384 | .. highlight:: r
385 |
386 | ::
387 |
388 | > mynames <- c("Mary", "John", "Ann", "Sinead", "Joe", "Mary", "Jim", "John", "Simon")
389 | > table(mynames)
390 | mynames
391 | Ann Jim Joe John Mary Simon Sinead
392 | 1 1 1 2 2 1 1
393 |
394 | We can store the table variable produced by the function table(),
395 | and call the stored table "mytable", by typing:
396 |
397 | .. highlight:: r
398 |
399 | ::
400 |
401 | > mytable <- table(mynames)
402 |
403 | To access elements in a table variable, you need to use double
404 | square brackets, just like accessing elements in a list. For
405 | example, to access the fourth element in the table *mytable* (the
406 | number of children called "John"), we type:
407 |
408 | .. highlight:: r
409 |
410 | ::
411 |
412 | > mytable[[4]]
413 | [1] 2
414 |
415 | Alternatively, you can use the name of the fourth element in
416 | the table ("John") to find the value of that table element:
417 |
418 | .. highlight:: r
419 |
420 | ::
421 |
422 | > mytable[["John"]]
423 | [1] 2
424 |
425 | Functions in R usually require *arguments*, which are input
426 | variables (ie. objects) that are passed to them, which they then
427 | carry out some operation on. For example, the log10() function is
428 | passed a number, and it then calculates the log to the base 10 of
429 | that number:
430 |
431 | .. highlight:: r
432 |
433 | ::
434 |
435 | > log10(100)
436 | 2
437 |
438 | In R, you can get help about a particular function by using the
439 | help() function. For example, if you want help about the log10()
440 | function, you can type:
441 |
442 | .. highlight:: r
443 |
444 | ::
445 |
446 | > help("log10")
447 |
448 | When you use the help() function, a box or webpage will pop up with
449 | information about the function that you asked for help with.
450 |
451 | If you are not sure of the name of a function, but think you know
452 | part of its name, you can search for the function name using the
453 | help.search() and RSiteSearch() functions. The help.search() function
454 | searches to see if you already have a function installed (from one of
455 | the R packages that you have installed) that may be related to some
456 | topic you're interested in. The RSiteSearch() function searches all
457 | R functions (including those in packages that you haven't yet installed)
458 | for functions related to the topic you are interested in.
459 |
460 | For example, if you want to know if there
461 | is a function to calculate the standard deviation of a set of
462 | numbers, you can search for the names of all installed functions containing
463 | the word "deviation" in their description by typing:
464 |
465 | .. highlight:: r
466 |
467 | ::
468 |
469 | > help.search("deviation")
470 | Help files with alias or concept or title matching
471 | 'deviation' using fuzzy matching:
472 |
473 | genefilter::rowSds
474 | Row variance and standard deviation of
475 | a numeric array
476 | nlme::pooledSD Extract Pooled Standard Deviation
477 | stats::mad Median Absolute Deviation
478 | stats::sd Standard Deviation
479 | vsn::meanSdPlot Plot row standard deviations versus row
480 |
481 | Among the functions that were found, is the function sd() in the
482 | "stats" package (an R package that comes with the standard R
483 | installation), which is used for calculating the standard deviation.
484 |
485 | In the example above, the help.search() function found a relevant
486 | function (sd() here). However, if you did not find what you were looking
487 | for with help.search(), you could then use the RSiteSearch() function to
488 | see if a search of all functions described on the R website may find
489 | something relevant to the topic that you're interested in:
490 |
491 | .. highlight:: r
492 |
493 | ::
494 |
495 | > RSiteSearch("deviation")
496 |
497 | The results of the RSiteSearch() function will be hits to descriptions
498 | of R functions, as well as to R mailing list discussions of those
499 | functions.
500 |
501 | We can perform computations with R using objects such as scalars
502 | and vectors. For example, to calculate the average of the values in
503 | the vector *myvector* (ie. the average of 8, 6, 9, 10 and 5), we
504 | can use the mean() function:
505 |
506 | .. highlight:: r
507 |
508 | ::
509 |
510 | > mean(myvector)
511 | [1] 7.6
512 |
513 | We have been using built-in R functions such as mean(),
514 | length(), print(), plot(), etc. We can also create our own
515 | functions in R to do calculations that you want to carry out very
516 | often on different input data sets. For example, we can create a
517 | function to calculate the value of 20 plus square of some input
518 | number:
519 |
520 | .. highlight:: r
521 |
522 | ::
523 |
524 | > myfunction <- function(x) { return(20 + (x*x)) }
525 |
526 | This function will calculate the square of a number (*x*), and then
527 | add 20 to that value. The return() statement returns the calculated
528 | value. Once you have typed in this function, the function is then
529 | available for use. For example, we can use the function for
530 | different input numbers (eg. 10, 25):
531 |
532 | .. highlight:: r
533 |
534 | ::
535 |
536 | > myfunction(10)
537 | [1] 120
538 | > myfunction(25)
539 | [1] 645
540 |
541 | To quit R, type:
542 |
543 | .. highlight:: r
544 |
545 | ::
546 |
547 | > q()
548 |
549 |
550 | Links and Further Reading
551 | -------------------------
552 |
553 | Some links are included here for further reading.
554 |
555 | For a more in-depth introduction to R, a good online tutorial is
556 | available on the "Kickstarting R" website,
557 | `cran.r-project.org/doc/contrib/Lemon-kickstart `_.
558 |
559 | There is another nice (slightly more in-depth) tutorial to R
560 | available on the "Introduction to R" website,
561 | `cran.r-project.org/doc/manuals/R-intro.html `_.
562 |
563 | Acknowledgements
564 | ----------------
565 |
566 | For very helpful comments and suggestions for improvements on the installation instructions, thank you very much to Friedrich Leisch and Phil Spector.
567 |
568 | Contact
569 | -------
570 |
571 | I will be very grateful if you will send me (`Avril Coghlan `_) corrections or suggestions for improvements to
572 | my email address alc@sanger.ac.uk
573 |
574 | License
575 | -------
576 |
577 | The content in this book is licensed under a `Creative Commons Attribution 3.0 License
578 | `_.
579 |
580 | .. |image3| image:: ../_static/image3.png
581 |
--------------------------------------------------------------------------------
/src/multivariateanalysis.rst:
--------------------------------------------------------------------------------
1 | Using R for Multivariate Analysis
2 | =================================
3 |
4 | Multivariate Analysis
5 | ---------------------
6 |
7 | This booklet tells you how to use the R statistical software to carry out some simple multivariate analyses,
8 | with a focus on principal components analysis (PCA) and linear discriminant analysis (LDA).
9 |
10 | This booklet assumes that the reader has some basic knowledge of multivariate analyses, and
11 | the principal focus of the booklet is not to explain multivariate analyses, but rather
12 | to explain how to carry out these analyses using R.
13 |
14 | If you are new to multivariate analysis, and want to learn more about any of the concepts
15 | presented here, I would highly recommend the Open University book
16 | "Multivariate Analysis" (product code M249/03), available from
17 | from `the Open University Shop `_.
18 |
19 | In the examples in this booklet, I will be using data sets from the UCI Machine
20 | Learning Repository, `http://archive.ics.uci.edu/ml `_.
21 |
22 | There is a pdf version of this booklet available at
23 | `https://media.readthedocs.org/pdf/little-book-of-r-for-multivariate-analysis/latest/little-book-of-r-for-multivariate-analysis.pdf `_.
24 |
25 | If you like this booklet, you may also like to check out my booklet on using
26 | R for biomedical statistics,
27 | `http://a-little-book-of-r-for-biomedical-statistics.readthedocs.org/
28 | `_,
29 | and my booklet on using R for time series analysis,
30 | `http://a-little-book-of-r-for-time-series.readthedocs.org/
31 | `_.
32 |
33 | Reading Multivariate Analysis Data into R
34 | -----------------------------------------
35 |
36 | The first thing that you will want to do to analyse your multivariate data will be to read
37 | it into R, and to plot the data. You can read data into R using the read.table() function.
38 |
39 | For example, the file `http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data
40 | `_
41 | contains data on concentrations of 13 different chemicals in wines grown in the same region in Italy that are
42 | derived from three different cultivars.
43 |
44 | The data set looks like this:
45 |
46 | .. highlight:: r
47 |
48 | ::
49 |
50 | 1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
51 | 1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
52 | 1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185
53 | 1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480
54 | 1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735
55 | ...
56 |
57 | There is one row per wine sample.
58 | The first column contains the cultivar of a wine sample (labelled 1, 2 or 3), and the following thirteen columns
59 | contain the concentrations of the 13 different chemicals in that sample.
60 | The columns are separated by commas.
61 |
62 | When we read the file into R using the read.table() function, we need to use the "sep="
63 | argument in read.table() to tell it that the columns are separated by commas.
64 | That is, we can read in the file using the read.table() function as follows:
65 |
66 | .. highlight:: r
67 |
68 | ::
69 |
70 | > wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data",
71 | sep=",")
72 | > wine
73 | V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
74 | 1 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.640000 1.040 3.92 1065
75 | 2 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.380000 1.050 3.40 1050
76 | 3 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.680000 1.030 3.17 1185
77 | 4 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.800000 0.860 3.45 1480
78 | 5 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.320000 1.040 2.93 735
79 | ...
80 | 176 3 13.27 4.28 2.26 20.0 120 1.59 0.69 0.43 1.35 10.200000 0.590 1.56 835
81 | 177 3 13.17 2.59 2.37 20.0 120 1.65 0.68 0.53 1.46 9.300000 0.600 1.62 840
82 | 178 3 14.13 4.10 2.74 24.5 96 2.05 0.76 0.56 1.35 9.200000 0.610 1.60 560
83 |
84 | In this case the data on 178 samples of wine has been read into the variable 'wine'.
85 |
86 | Plotting Multivariate Data
87 | --------------------------
88 |
89 | Once you have read a multivariate data set into R, the next step is usually to make a plot of the data.
90 |
91 | A Matrix Scatterplot
92 | ^^^^^^^^^^^^^^^^^^^^
93 |
94 | One common way of plotting multivariate data is to make a "matrix scatterplot", showing each pair of
95 | variables plotted against each other. We can use the "scatterplotMatrix()" function from the "car"
96 | R package to do this. To use this function, we first need to install the "car" R package
97 | (for instructions on how to install an R package, see `How to install an R package
98 | <./installr.html#how-to-install-an-r-package>`_).
99 |
100 | Once you have installed the "car" R package, you can load the "car" R package by typing:
101 |
102 | .. highlight:: r
103 |
104 | ::
105 |
106 | > library("car")
107 |
108 | You can then use the "scatterplotMatrix()" function to plot the multivariate data.
109 |
110 | To use the scatterplotMatrix() function, you need to give it as its input the variables
111 | that you want included in the plot. Say for example, that we just want to include the
112 | variables corresponding to the concentrations of the first five chemicals. These are stored in
113 | columns 2-6 of the variable "wine". We can extract just these columns from the variable
114 | "wine" by typing:
115 |
116 | ::
117 |
118 | > wine[2:6]
119 | V2 V3 V4 V5 V6
120 | 1 14.23 1.71 2.43 15.6 127
121 | 2 13.20 1.78 2.14 11.2 100
122 | 3 13.16 2.36 2.67 18.6 101
123 | 4 14.37 1.95 2.50 16.8 113
124 | 5 13.24 2.59 2.87 21.0 118
125 | ...
126 |
127 | To make a matrix scatterplot of just these 13 variables using the scatterplotMatrix() function we type:
128 |
129 | ::
130 |
131 | > scatterplotMatrix(wine[2:6])
132 |
133 |
134 | |image1|
135 |
136 |
137 | In this matrix scatterplot, the diagonal cells show histograms of each of the variables, in this
138 | case the concentrations of the first five chemicals (variables V2, V3, V4, V5, V6).
139 |
140 | Each of the off-diagonal cells is a scatterplot of two of the five chemicals, for example, the second cell in the
141 | first row is a scatterplot of V2 (y-axis) against V3 (x-axis).
142 |
143 | A Scatterplot with the Data Points Labelled by their Group
144 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
145 |
146 | If you see an interesting scatterplot for two variables in the matrix scatterplot, you may want to
147 | plot that scatterplot in more detail, with the data points labelled by their group (their cultivar in this case).
148 |
149 | For example, in the matrix scatterplot above, the cell in the third column of the fourth row down is a scatterplot
150 | of V5 (x-axis) against V4 (y-axis). If you look at this scatterplot, it appears that there may be a
151 | positive relationship between V5 and V4.
152 |
153 | We may therefore decide to examine the relationship between V5 and V4 more closely, by plotting a scatterplot
154 | of these two variable, with the data points labelled by their group (their cultivar). To plot a scatterplot
155 | of two variables, we can use the "plot" R function. The V4 and V5 variables are stored in the columns
156 | V4 and V5 of the variable "wine", so can be accessed by typing wine$V4 or wine$V5. Therefore, to plot
157 | the scatterplot, we type:
158 |
159 | ::
160 |
161 | > plot(wine$V4, wine$V5)
162 |
163 | |image2|
164 |
165 | If we want to label the data points by their group (the cultivar of wine here), we can use the "text" function
166 | in R to plot some text beside every data point. In this case, the cultivar of wine is stored in the column
167 | V1 of the variable "wine", so we type:
168 |
169 | ::
170 |
171 | > text(wine$V4, wine$V5, wine$V1, cex=0.7, pos=4, col="red")
172 |
173 | If you look at the help page for the "text" function, you will see that "pos=4" will plot the text just to the
174 | right of the symbol for a data point. The "cex=0.5" option will plot the text at half the default size, and
175 | the "col=red" option will plot the text in red. This gives us the following plot:
176 |
177 | |image4|
178 |
179 | We can see from the scatterplot of V4 versus V5 that the wines from cultivar 2 seem to have
180 | lower values of V4 compared to the wines of cultivar 1.
181 |
182 | A Profile Plot
183 | ^^^^^^^^^^^^^^
184 |
185 | Another type of plot that is useful is a "profile plot", which shows the variation in each of the
186 | variables, by plotting the value of each of the variables for each of the samples.
187 |
188 | The function "makeProfilePlot()" below can be used to make a profile plot. This function requires
189 | the "RColorBrewer" library. To use this function, we first need to install the "RColorBrewer" R package
190 | (for instructions on how to install an R package, see `How to install an R package
191 | <./installr.html#how-to-install-an-r-package>`_).
192 |
193 | ::
194 |
195 | > makeProfilePlot <- function(mylist,names)
196 | {
197 | require(RColorBrewer)
198 | # find out how many variables we want to include
199 | numvariables <- length(mylist)
200 | # choose 'numvariables' random colours
201 | colours <- brewer.pal(numvariables,"Set1")
202 | # find out the minimum and maximum values of the variables:
203 | mymin <- 1e+20
204 | mymax <- 1e-20
205 | for (i in 1:numvariables)
206 | {
207 | vectori <- mylist[[i]]
208 | mini <- min(vectori)
209 | maxi <- max(vectori)
210 | if (mini < mymin) { mymin <- mini }
211 | if (maxi > mymax) { mymax <- maxi }
212 | }
213 | # plot the variables
214 | for (i in 1:numvariables)
215 | {
216 | vectori <- mylist[[i]]
217 | namei <- names[i]
218 | colouri <- colours[i]
219 | if (i == 1) { plot(vectori,col=colouri,type="l",ylim=c(mymin,mymax)) }
220 | else { points(vectori, col=colouri,type="l") }
221 | lastxval <- length(vectori)
222 | lastyval <- vectori[length(vectori)]
223 | text((lastxval-10),(lastyval),namei,col="black",cex=0.6)
224 | }
225 | }
226 |
227 | To use this function, you first need to copy and paste it into R. The arguments to the
228 | function are a vector containing the names of the varibles that you want to plot, and
229 | a list variable containing the variables themselves.
230 |
231 | For example, to make a profile plot of the concentrations of the first five chemicals in the wine samples
232 | (stored in columns V2, V3, V4, V5, V6 of variable "wine"), we type:
233 |
234 | ::
235 |
236 | > library(RColorBrewer)
237 | > names <- c("V2","V3","V4","V5","V6")
238 | > mylist <- list(wine$V2,wine$V3,wine$V4,wine$V5,wine$V6)
239 | > makeProfilePlot(mylist,names)
240 |
241 | |image5|
242 |
243 | It is clear from the profile plot that the mean and standard deviation for V6 is
244 | quite a lot higher than that for the other variables.
245 |
246 | .. xxx why did they do quite a different profile plot in the assignment answer? I sent a Q to the forum
247 |
248 | Calculating Summary Statistics for Multivariate Data
249 | ----------------------------------------------------
250 |
251 | Another thing that you are likely to want to do is to calculate summary statistics such as the
252 | mean and standard deviation for each of the variables in your multivariate data set.
253 |
254 | .. sidebar:: sapply
255 |
256 | The "sapply()" function can be used to apply some other function to each column
257 | in a data frame, eg. sapply(mydataframe,sd) will calculate the standard deviation of
258 | each column in a dataframe "mydataframe".
259 |
260 | This is easy to do, using the "mean()" and "sd()" functions in R. For example, say we want
261 | to calculate the mean and standard deviations of each of the 13 chemical concentrations in the
262 | wine samples. These are stored in columns 2-14 of the variable "wine". So we type:
263 |
264 | ::
265 |
266 | > sapply(wine[2:14],mean)
267 | V2 V3 V4 V5 V6 V7
268 | 13.0006180 2.3363483 2.3665169 19.4949438 99.7415730 2.2951124
269 | V8 V9 V10 V11 V12 V13
270 | 2.0292697 0.3618539 1.5908989 5.0580899 0.9574494 2.6116854
271 | V14
272 | 746.8932584
273 |
274 | This tells us that the mean of variable V2 is 13.0006180, the mean of V3 is 2.3363483, and so on.
275 |
276 | Similarly, to get the standard deviations of the 13 chemical concentrations, we type:
277 |
278 | ::
279 |
280 | > sapply(wine[2:14],sd)
281 | V2 V3 V4 V5 V6 V7
282 | 0.8118265 1.1171461 0.2743440 3.3395638 14.2824835 0.6258510
283 | V8 V9 V10 V11 V12 V13
284 | 0.9988587 0.1244533 0.5723589 2.3182859 0.2285716 0.7099904
285 | V14
286 | 314.9074743
287 |
288 | We can see here that it would make sense to standardise in order to compare the variables because the variables
289 | have very different standard deviations - the standard deviation of V14 is 314.9074743, while the standard deviation
290 | of V9 is just 0.1244533. Thus, in order to compare the variables, we need to standardise each variable so that
291 | it has a sample variance of 1 and sample mean of 0. We will explain below how to standardise the variables.
292 |
293 | Means and Variances Per Group
294 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
295 |
296 | It is often interesting to calculate the means and standard deviations for just the samples
297 | from a particular group, for example, for the wine samples from each cultivar. The cultivar
298 | is stored in the column "V1" of the variable "wine".
299 |
300 | To extract out the data for just cultivar 2, we can type:
301 |
302 | ::
303 |
304 | > cultivar2wine <- wine[wine$V1=="2",]
305 |
306 | We can then calculate the mean and standard deviations of the 13 chemicals' concentrations, for
307 | just the cultivar 2 samples:
308 |
309 | ::
310 |
311 | > sapply(cultivar2wine[2:14],mean)
312 | V2 V3 V4 V5 V6 V7 V8
313 | 12.278732 1.932676 2.244789 20.238028 94.549296 2.258873 2.080845
314 | V9 V10 V11 V12 V13 V14
315 | 0.363662 1.630282 3.086620 1.056282 2.785352 519.507042
316 | > sapply(cultivar2wine[2:14])
317 | V2 V3 V4 V5 V6 V7 V8
318 | 0.5379642 1.0155687 0.3154673 3.3497704 16.7534975 0.5453611 0.7057008
319 | V9 V10 V11 V12 V13 V14
320 | 0.1239613 0.6020678 0.9249293 0.2029368 0.4965735 157.2112204
321 |
322 | You can calculate the mean and standard deviation of the 13 chemicals' concentrations for just cultivar 1 samples,
323 | or for just cultivar 3 samples, in a similar way.
324 |
325 | However, for convenience, you might want to use the function "printMeanAndSdByGroup()" below, which
326 | prints out the mean and standard deviation of the variables for each group in your data set:
327 |
328 | ::
329 |
330 | > printMeanAndSdByGroup <- function(variables,groupvariable)
331 | {
332 | # find the names of the variables
333 | variablenames <- c(names(groupvariable),names(as.data.frame(variables)))
334 | # within each group, find the mean of each variable
335 | groupvariable <- groupvariable[,1] # ensures groupvariable is not a list
336 | means <- aggregate(as.matrix(variables) ~ groupvariable, FUN = mean)
337 | names(means) <- variablenames
338 | print(paste("Means:"))
339 | print(means)
340 | # within each group, find the standard deviation of each variable:
341 | sds <- aggregate(as.matrix(variables) ~ groupvariable, FUN = sd)
342 | names(sds) <- variablenames
343 | print(paste("Standard deviations:"))
344 | print(sds)
345 | # within each group, find the number of samples:
346 | samplesizes <- aggregate(as.matrix(variables) ~ groupvariable, FUN = length)
347 | names(samplesizes) <- variablenames
348 | print(paste("Sample sizes:"))
349 | print(samplesizes)
350 | }
351 |
352 | To use the function "printMeanAndSdByGroup()", you first need to copy and paste it into R. The
353 | arguments of the function are the variables that you want to calculate means and standard deviations for,
354 | and the variable containing the group of each sample. For example, to calculate the mean and standard deviation
355 | for each of the 13 chemical concentrations, for each of the three different wine cultivars, we type:
356 |
357 | ::
358 |
359 | > printMeanAndSdByGroup(wine[2:14],wine[1])
360 | [1] "Means:"
361 | V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
362 | 1 1 13.74475 2.010678 2.455593 17.03729 106.3390 2.840169 2.9823729 0.290000 1.899322 5.528305 1.0620339 3.157797 1115.7119
363 | 2 2 12.27873 1.932676 2.244789 20.23803 94.5493 2.258873 2.0808451 0.363662 1.630282 3.086620 1.0562817 2.785352 519.5070
364 | 3 3 13.15375 3.333750 2.437083 21.41667 99.3125 1.678750 0.7814583 0.447500 1.153542 7.396250 0.6827083 1.683542 629.8958
365 | [1] "Standard deviations:"
366 | V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
367 | 1 1 0.4621254 0.6885489 0.2271660 2.546322 10.49895 0.3389614 0.3974936 0.07004924 0.4121092 1.2385728 0.1164826 0.3570766 221.5208
368 | 2 2 0.5379642 1.0155687 0.3154673 3.349770 16.75350 0.5453611 0.7057008 0.12396128 0.6020678 0.9249293 0.2029368 0.4965735 157.2112
369 | 3 3 0.5302413 1.0879057 0.1846902 2.258161 10.89047 0.3569709 0.2935041 0.12413959 0.4088359 2.3109421 0.1144411 0.2721114 115.0970
370 | [1] "Sample sizes:"
371 | V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
372 | 1 1 59 59 59 59 59 59 59 59 59 59 59 59 59
373 | 2 2 71 71 71 71 71 71 71 71 71 71 71 71 71
374 | 3 3 48 48 48 48 48 48 48 48 48 48 48 48 48
375 |
376 | The function "printMeanAndSdByGroup()" also prints out the number of samples in each group. In this case,
377 | we see that there are 59 samples of cultivar 1, 71 of cultivar 2, and 48 of cultivar 3.
378 |
379 | Between-groups Variance and Within-groups Variance for a Variable
380 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
381 |
382 | If we want to calculate the within-groups variance for a particular variable (for example, for a particular
383 | chemical's concentration), we can use the function "calcWithinGroupsVariance()" below:
384 |
385 | ::
386 |
387 | > calcWithinGroupsVariance <- function(variable,groupvariable)
388 | {
389 | # find out how many values the group variable can take
390 | groupvariable2 <- as.factor(groupvariable[[1]])
391 | levels <- levels(groupvariable2)
392 | numlevels <- length(levels)
393 | # get the mean and standard deviation for each group:
394 | numtotal <- 0
395 | denomtotal <- 0
396 | for (i in 1:numlevels)
397 | {
398 | leveli <- levels[i]
399 | levelidata <- variable[groupvariable==leveli,]
400 | levelilength <- length(levelidata)
401 | # get the standard deviation for group i:
402 | sdi <- sd(levelidata)
403 | numi <- (levelilength - 1)*(sdi * sdi)
404 | denomi <- levelilength
405 | numtotal <- numtotal + numi
406 | denomtotal <- denomtotal + denomi
407 | }
408 | # calculate the within-groups variance
409 | Vw <- numtotal / (denomtotal - numlevels)
410 | return(Vw)
411 | }
412 |
413 | .. Checked that this formula is correct.
414 |
415 | You will need to copy and paste this function into R before you can use it.
416 | For example, to calculate the within-groups variance of the variable V2 (the concentration of the first chemical),
417 | we type:
418 |
419 | ::
420 |
421 | > calcWithinGroupsVariance(wine[2],wine[1])
422 | [1] 0.2620525
423 |
424 | Thus, the within-groups variance for V2 is 0.2620525.
425 |
426 | We can calculate the between-groups variance for a particular variable (eg. V2) using the function
427 | "calcBetweenGroupsVariance()" below:
428 |
429 | ::
430 |
431 | > calcBetweenGroupsVariance <- function(variable,groupvariable)
432 | {
433 | # find out how many values the group variable can take
434 | groupvariable2 <- as.factor(groupvariable[[1]])
435 | levels <- levels(groupvariable2)
436 | numlevels <- length(levels)
437 | # calculate the overall grand mean:
438 | grandmean <- mean(variable)
439 | # get the mean and standard deviation for each group:
440 | numtotal <- 0
441 | denomtotal <- 0
442 | for (i in 1:numlevels)
443 | {
444 | leveli <- levels[i]
445 | levelidata <- variable[groupvariable==leveli,]
446 | levelilength <- length(levelidata)
447 | # get the mean and standard deviation for group i:
448 | meani <- mean(levelidata)
449 | sdi <- sd(levelidata)
450 | numi <- levelilength * ((meani - grandmean)^2)
451 | denomi <- levelilength
452 | numtotal <- numtotal + numi
453 | denomtotal <- denomtotal + denomi
454 | }
455 | # calculate the between-groups variance
456 | Vb <- numtotal / (numlevels - 1)
457 | Vb <- Vb[[1]]
458 | return(Vb)
459 | }
460 |
461 | .. In the OU book, I think that they have the wrong formula - had N-G as denominator, I sent an email to the forum xxx
462 |
463 | .. Note the between-groups-variance*(G-1) + within-groups-variance*(N-G) should be equal to TotalSS
464 | .. calcTotalSS <- function(variable)
465 | .. {
466 | .. variable <- variable[[1]]
467 | .. variablelen <- length(variable)
468 | .. print(paste("variablelen=",variablelen))
469 | .. grandmean <- mean(variable)
470 | .. print(paste("grandmean=",grandmean))
471 | .. totalss <- 0
472 | .. for (i in 1:variablelen)
473 | .. {
474 | .. totalss <- totalss + ((variable[i] - grandmean)*(variable[i] - grandmean))
475 | .. }
476 | .. return(totalss)
477 | .. }
478 |
479 | Once you have copied and pasted this function into R, you can use it to calculate the between-groups
480 | variance for a variable such as V2:
481 |
482 | ::
483 |
484 | > calcBetweenGroupsVariance (wine[2],wine[1])
485 | [1] 35.39742
486 |
487 | Thus, the between-groups variance of V2 is 35.39742.
488 |
489 | We can calculate the "separation" achieved by a variable as its between-groups variance devided by its
490 | within-groups variance. Thus, the separation achieved by V2 is calculated as:
491 |
492 | ::
493 |
494 | > 35.39742/0.2620525
495 | [1] 135.0776
496 |
497 | .. Note I think we can also get the within-groups and between-groups variance from the output of ANOVA:
498 | ..
499 | .. summary(aov(wine[,2]~as.factor(wine[,1])))
500 | .. Df Sum Sq Mean Sq F value Pr(>F)
501 | .. as.factor(wine[, 1]) 2 70.795 35.397 135.08 < 2.2e-16 ***
502 | .. Residuals 175 45.859 0.262
503 | .. ---
504 | .. Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
505 | ..
506 | .. Here the within-groups variance is 0.262 (called the mean square of residuals)
507 | .. and the between-groups variance is 35.397. The ratio is 135.08 (the F statistic), which
508 | .. is the same as the separation that I calculate (see above).
509 |
510 | If you want to calculate the separations achieved by all of the variables in a multivariate data set,
511 | you can use the function "calcSeparations()" below:
512 |
513 | ::
514 |
515 | > calcSeparations <- function(variables,groupvariable)
516 | {
517 | # find out how many variables we have
518 | variables <- as.data.frame(variables)
519 | numvariables <- length(variables)
520 | # find the variable names
521 | variablenames <- colnames(variables)
522 | # calculate the separation for each variable
523 | for (i in 1:numvariables)
524 | {
525 | variablei <- variables[i]
526 | variablename <- variablenames[i]
527 | Vw <- calcWithinGroupsVariance(variablei, groupvariable)
528 | Vb <- calcBetweenGroupsVariance(variablei, groupvariable)
529 | sep <- Vb/Vw
530 | print(paste("variable",variablename,"Vw=",Vw,"Vb=",Vb,"separation=",sep))
531 | }
532 | }
533 |
534 | .. I checked the formula and it is fine.
535 |
536 | For example, to calculate the separations for each of the 13 chemical concentrations, we type:
537 |
538 | ::
539 |
540 | > calcSeparations(wine[2:14],wine[1])
541 | [1] "variable V2 Vw= 0.262052469153907 Vb= 35.3974249602692 separation= 135.0776242428"
542 | [1] "variable V3 Vw= 0.887546796746581 Vb= 32.7890184869213 separation= 36.9434249631837"
543 | [1] "variable V4 Vw= 0.0660721013425184 Vb= 0.879611357248741 separation= 13.312901199991"
544 | [1] "variable V5 Vw= 8.00681118121156 Vb= 286.41674636309 separation= 35.7716374073093"
545 | [1] "variable V6 Vw= 180.65777316441 Vb= 2245.50102788939 separation= 12.4295843381499"
546 | [1] "variable V7 Vw= 0.191270475224227 Vb= 17.9283572942847 separation= 93.7330096203673"
547 | [1] "variable V8 Vw= 0.274707514337437 Vb= 64.2611950235641 separation= 233.925872681549"
548 | [1] "variable V9 Vw= 0.0119117022132797 Vb= 0.328470157461624 separation= 27.5754171469659"
549 | [1] "variable V10 Vw= 0.246172943795542 Vb= 7.45199550777775 separation= 30.2713831702276"
550 | [1] "variable V11 Vw= 2.28492308133354 Vb= 275.708000822304 separation= 120.664018441003"
551 | [1] "variable V12 Vw= 0.0244876469432414 Vb= 2.48100991493829 separation= 101.3167953903"
552 | [1] "variable V13 Vw= 0.160778729560982 Vb= 30.5435083544253 separation= 189.972320578889"
553 | [1] "variable V14 Vw= 29707.6818705169 Vb= 6176832.32228483 separation= 207.920373902178"
554 |
555 | Thus, the individual variable which gives the greatest separations between the groups (the wine cultivars) is
556 | V8 (separation 233.9). As we will discuss below, the purpose of linear discriminant analysis (LDA) is to find the
557 | linear combination of the individual variables that will give the greatest separation between the groups (cultivars here).
558 | This hopefully will give a better separation than the best separation achievable by any individual variable (233.9
559 | for V8 here).
560 |
561 | Between-groups Covariance and Within-groups Covariance for Two Variables
562 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
563 |
564 | If you have a multivariate data set with several variables describing sampling units from different groups,
565 | such as the wine samples from different cultivars, it is often of interest to calculate the within-groups
566 | covariance and between-groups variance for pairs of the variables.
567 |
568 | This can be done using the following functions, which you will need to copy and paste into R to use them:
569 |
570 | ::
571 |
572 | > calcWithinGroupsCovariance <- function(variable1,variable2,groupvariable)
573 | {
574 | # find out how many values the group variable can take
575 | groupvariable2 <- as.factor(groupvariable[[1]])
576 | levels <- levels(groupvariable2)
577 | numlevels <- length(levels)
578 | # get the covariance of variable 1 and variable 2 for each group:
579 | Covw <- 0
580 | for (i in 1:numlevels)
581 | {
582 | leveli <- levels[i]
583 | levelidata1 <- variable1[groupvariable==leveli,]
584 | levelidata2 <- variable2[groupvariable==leveli,]
585 | mean1 <- mean(levelidata1)
586 | mean2 <- mean(levelidata2)
587 | levelilength <- length(levelidata1)
588 | # get the covariance for this group:
589 | term1 <- 0
590 | for (j in 1:levelilength)
591 | {
592 | term1 <- term1 + ((levelidata1[j] - mean1)*(levelidata2[j] - mean2))
593 | }
594 | Cov_groupi <- term1 # covariance for this group
595 | Covw <- Covw + Cov_groupi
596 | }
597 | totallength <- nrow(variable1)
598 | Covw <- Covw / (totallength - numlevels)
599 | return(Covw)
600 | }
601 |
602 | .. Checked this works fine.
603 | .. Agrees with formula from Kryzanowski's 'Principles of Multivariate Analysis' pages 294-295:
604 | .. Covw = (1/(N-G)) Sum(from g=1 to G) [ Sum(over i) { (x_ig - x_hat_g)*(y_ig - y_hat_g) } ]
605 |
606 | For example, to calculate the within-groups covariance for variables V8 and V11, we type:
607 |
608 | ::
609 |
610 | > calcWithinGroupsCovariance(wine[8],wine[11],wine[1])
611 | [1] 0.2866783
612 |
613 | ::
614 |
615 | > calcBetweenGroupsCovariance <- function(variable1,variable2,groupvariable)
616 | {
617 | # find out how many values the group variable can take
618 | groupvariable2 <- as.factor(groupvariable[[1]])
619 | levels <- levels(groupvariable2)
620 | numlevels <- length(levels)
621 | # calculate the grand means
622 | variable1mean <- mean(variable1)
623 | variable2mean <- mean(variable2)
624 | # calculate the between-groups covariance
625 | Covb <- 0
626 | for (i in 1:numlevels)
627 | {
628 | leveli <- levels[i]
629 | levelidata1 <- variable1[groupvariable==leveli,]
630 | levelidata2 <- variable2[groupvariable==leveli,]
631 | mean1 <- mean(levelidata1)
632 | mean2 <- mean(levelidata2)
633 | levelilength <- length(levelidata1)
634 | term1 <- (mean1 - variable1mean)*(mean2 - variable2mean)*(levelilength)
635 | Covb <- Covb + term1
636 | }
637 | Covb <- Covb / (numlevels - 1)
638 | Covb <- Covb[[1]]
639 | return(Covb)
640 | }
641 |
642 | .. Formula from Kryzanowski's 'Principles of Multivariate Analysis' pages 294-295
643 | .. Covb = (1/(G-1)) * Sum(from g=1 to G) [ Sum(over i) { (n_g) * (x_hat_g - x_hat) * (y_hat_g - y_hat) } ]
644 | .. xxx Note it doesn't give me the answer given for Q3(a)(ii) of assignment - put Q on forum
645 |
646 | For example, to calculate the between-groups covariance for variables V8 and V11, we type:
647 |
648 | ::
649 |
650 | > calcBetweenGroupsCovariance(wine[8],wine[11],wine[1])
651 | [1] -60.41077
652 |
653 | Thus, for V8 and V11, the between-groups covariance is -60.41 and the within-groups covariance is 0.29.
654 | Since the within-groups covariance is positive (0.29), it means V8 and V11 are positively related within groups:
655 | for individuals from the same group, individuals with a high value of V8 tend to have a high value of V11,
656 | and vice versa. Since the between-groups covariance is negative (-60.41), V8 and V11 are negatively related between groups:
657 | groups with a high mean value of V8 tend to have a low mean value of V11, and vice versa.
658 |
659 | Calculating Correlations for Multivariate Data
660 | ----------------------------------------------
661 |
662 | It is often of interest to investigate whether any of the variables in a multivariate data set are
663 | significantly correlated.
664 |
665 | To calculate the linear (Pearson) correlation coefficient for a pair of variables, you can use
666 | the "cor.test()" function in R. For example, to calculate the correlation coefficient for the first
667 | two chemicals' concentrations, V2 and V3, we type:
668 |
669 | ::
670 |
671 | > cor.test(wine$V2, wine$V3)
672 | Pearson's product-moment correlation
673 | data: wine$V2 and wine$V3
674 | t = 1.2579, df = 176, p-value = 0.2101
675 | alternative hypothesis: true correlation is not equal to 0
676 | 95 percent confidence interval:
677 | -0.05342959 0.23817474
678 | sample estimates:
679 | cor
680 | 0.09439694
681 |
682 | This tells us that the correlation coefficient is about 0.094, which is a very weak correlation.
683 | Furthermore, the P-value for the statistical test of whether the correlation coefficient is
684 | significantly different from zero is 0.21. This is much greater than 0.05 (which we can use here
685 | as a cutoff for statistical significance), so there is very weak evidence that that the correlation is non-zero.
686 |
687 | If you have a lot of variables, you can use "cor.test()" to calculate the correlation coefficient
688 | for each pair of variables, but you might be just interested in finding out what are the most highly
689 | correlated pairs of variables. For this you can use the function "mosthighlycorrelated()" below.
690 |
691 | The function "mosthighlycorrelated()" will print out the linear correlation coefficients for
692 | each pair of variables in your data set, in order of the correlation coefficient. This lets you see
693 | very easily which pair of variables are most highly correlated.
694 |
695 | ::
696 |
697 | > mosthighlycorrelated <- function(mydataframe,numtoreport)
698 | {
699 | # find the correlations
700 | cormatrix <- cor(mydataframe)
701 | # set the correlations on the diagonal or lower triangle to zero,
702 | # so they will not be reported as the highest ones:
703 | diag(cormatrix) <- 0
704 | cormatrix[lower.tri(cormatrix)] <- 0
705 | # flatten the matrix into a dataframe for easy sorting
706 | fm <- as.data.frame(as.table(cormatrix))
707 | # assign human-friendly names
708 | names(fm) <- c("First.Variable", "Second.Variable","Correlation")
709 | # sort and print the top n correlations
710 | head(fm[order(abs(fm$Correlation),decreasing=T),],n=numtoreport)
711 | }
712 |
713 | To use this function, you will first have to copy and paste it into R. The arguments of the function
714 | are the variables that you want to calculate the correlations for, and the number of top correlation
715 | coefficients to print out (for example, you can tell it to print out the largest ten correlation coefficients, or
716 | the largest 20).
717 |
718 | For example, to calculate correlation coefficients between the concentrations of the 13 chemicals
719 | in the wine samples, and to print out the top 10 pairwise correlation coefficients, you can type:
720 |
721 | ::
722 |
723 | > mosthighlycorrelated(wine[2:14], 10)
724 | First.Variable Second.Variable Correlation
725 | 84 V7 V8 0.8645635
726 | 150 V8 V13 0.7871939
727 | 149 V7 V13 0.6999494
728 | 111 V8 V10 0.6526918
729 | 157 V2 V14 0.6437200
730 | 110 V7 V10 0.6124131
731 | 154 V12 V13 0.5654683
732 | 132 V3 V12 -0.5612957
733 | 118 V2 V11 0.5463642
734 | 137 V8 V12 0.5434786
735 |
736 | This tells us that the pair of variables with the highest linear correlation coefficient are
737 | V7 and V8 (correlation = 0.86 approximately).
738 |
739 | Standardising Variables
740 | -----------------------
741 |
742 | If you want to compare different variables that have different units, are very different variances,
743 | it is a good idea to first standardise the variables.
744 |
745 | For example, we found above that the concentrations of the 13 chemicals in the wine samples show a wide range of
746 | standard deviations, from 0.1244533 for V9 (variance 0.01548862) to 314.9074743 for V14 (variance 99166.72).
747 | This is a range of approximately 6,402,554-fold in the variances.
748 |
749 | As a result, it is not a good idea to use the unstandardised chemical concentrations as the input for a
750 | principal component analysis (PCA, see below) of the
751 | wine samples, as if you did that, the first principal component would be dominated by the variables
752 | which show the largest variances, such as V14.
753 |
754 | Thus, it would be a better idea to first standardise the variables so that they all have variance 1 and mean 0,
755 | and to then carry out the principal component analysis on the standardised data. This would allow us to
756 | find the principal components that provide the best low-dimensional representation of the variation in the
757 | original data, without being overly biased by those variables that show the most variance in the original data.
758 |
759 | You can standardise variables in R using the "scale()" function.
760 |
761 | For example, to standardise the concentrations of the 13 chemicals in the wine samples, we type:
762 |
763 | ::
764 |
765 | > standardisedconcentrations <- as.data.frame(scale(wine[2:14]))
766 |
767 | Note that we use the "as.data.frame()" function to convert the output of "scale()" into a
768 | "data frame", which is the same type of R variable that the "wine" variable.
769 |
770 | We can check that each of the standardised variables stored in "standardisedconcentrations"
771 | has a mean of 0 and a standard deviation of 1 by typing:
772 |
773 | ::
774 |
775 | > sapply(standardisedconcentrations,mean)
776 | V2 V3 V4 V5 V6 V7
777 | -8.591766e-16 -6.776446e-17 8.045176e-16 -7.720494e-17 -4.073935e-17 -1.395560e-17
778 | V8 V9 V10 V11 V12 V13
779 | 6.958263e-17 -1.042186e-16 -1.221369e-16 3.649376e-17 2.093741e-16 3.003459e-16
780 | V14
781 | -1.034429e-16
782 | > sapply(standardisedconcentrations,sd)
783 | V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
784 | 1 1 1 1 1 1 1 1 1 1 1 1 1
785 |
786 | We see that the means of the standardised variables are all very tiny numbers and so are
787 | essentially equal to 0, and the standard deviations of the standardised variables are all equal to 1.
788 |
789 | Principal Component Analysis
790 | ----------------------------
791 |
792 | The purpose of principal component analysis is to find the best low-dimensional representation of the variation in a
793 | multivariate data set. For example, in the case of the wine data set, we have 13 chemical concentrations describing
794 | wine samples from three different cultivars. We can carry out a principal component analysis to investigate
795 | whether we can capture most of the variation between samples using a smaller number of new variables (principal
796 | components), where each of these new variables is a linear combination of all or some of the 13 chemical concentrations.
797 |
798 | To carry out a principal component analysis (PCA) on a multivariate data set, the first step is often to standardise
799 | the variables under study using the "scale()" function (see above). This is necessary if the input variables
800 | have very different variances, which is true in this case as the concentrations of the 13 chemicals have
801 | very different variances (see above).
802 |
803 | Once you have standardised your variables, you can carry out a principal component analysis using the "prcomp()"
804 | function in R.
805 |
806 | For example, to standardise the concentrations of the 13 chemicals in the wine samples, and carry out a
807 | principal components analysis on the standardised concentrations, we type:
808 |
809 | ::
810 |
811 | > standardisedconcentrations <- as.data.frame(scale(wine[2:14])) # standardise the variables
812 | > wine.pca <- prcomp(standardisedconcentrations) # do a PCA
813 |
814 | You can get a summary of the principal component analysis results using the "summary()" function on the
815 | output of "prcomp()":
816 |
817 | ::
818 |
819 | > summary(wine.pca)
820 | Importance of components:
821 | PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
822 | Standard deviation 2.169 1.580 1.203 0.9586 0.9237 0.8010 0.7423 0.5903 0.5375 0.5009
823 | Proportion of Variance 0.362 0.192 0.111 0.0707 0.0656 0.0494 0.0424 0.0268 0.0222 0.0193
824 | Cumulative Proportion 0.362 0.554 0.665 0.7360 0.8016 0.8510 0.8934 0.9202 0.9424 0.9617
825 | PC11 PC12 PC13
826 | Standard deviation 0.4752 0.4108 0.32152
827 | Proportion of Variance 0.0174 0.0130 0.00795
828 | Cumulative Proportion 0.9791 0.9920 1.00000
829 |
830 | This gives us the standard deviation of each component, and the proportion of variance explained by
831 | each component. The standard deviation of the components is stored in a named element called "sdev" of the output
832 | variable made by "prcomp":
833 |
834 | ::
835 |
836 | > wine.pca$sdev
837 | [1] 2.1692972 1.5801816 1.2025273 0.9586313 0.9237035 0.8010350 0.7423128 0.5903367
838 | [9] 0.5374755 0.5009017 0.4751722 0.4108165 0.3215244
839 |
840 | The total variance explained by the components is the sum of the variances of the components:
841 |
842 | ::
843 |
844 | > sum((wine.pca$sdev)^2)
845 | [1] 13
846 |
847 | In this case, we see that the total variance is 13, which is equal to the number of standardised variables (13 variables).
848 | This is because for standardised data, the variance of each standardised variable is 1. The total variance is equal to the sum
849 | of the variances of the individual variables, and since the variance of each standardised variable is 1, the
850 | total variance should be equal to the number of variables (13 here).
851 |
852 | Deciding How Many Principal Components to Retain
853 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
854 |
855 | In order to decide how many principal components should be retained,
856 | it is common to summarise the results of a principal components analysis by making a scree plot, which we
857 | can do in R using the "screeplot()" function:
858 |
859 | ::
860 |
861 | > screeplot(wine.pca, type="lines")
862 |
863 | |image6|
864 |
865 | The most obvious change in slope in the scree plot occurs at component 4, which is the "elbow" of the
866 | scree plot. Therefore, it cound be argued based on the basis of the scree plot that the first three
867 | components should be retained.
868 |
869 | Another way of deciding how many components to retain is to use Kaiser's criterion:
870 | that we should only retain principal components for which the variance is above 1 (when principal
871 | component analysis was applied to standardised data). We can check this by finding the variance of each
872 | of the principal components:
873 |
874 | ::
875 |
876 | > (wine.pca$sdev)^2
877 | [1] 4.7058503 2.4969737 1.4460720 0.9189739 0.8532282 0.6416570 0.5510283 0.3484974
878 | [9] 0.2888799 0.2509025 0.2257886 0.1687702 0.1033779
879 |
880 | We see that the variance is above 1 for principal components 1, 2, and 3 (which have variances
881 | 4.71, 2.50, and 1.45, respectively). Therefore, using Kaiser's criterion, we would retain the first
882 | three principal components.
883 |
884 | A third way to decide how many principal components to retain is to decide to keep the number of
885 | components required to explain at least some minimum amount of the total variance. For example, if
886 | it is important to explain at least 80% of the variance, we would retain the first five principal components,
887 | as we can see from the output of "summary(wine.pca)" that the first five principal components
888 | explain 80.2% of the variance (while the first four components explain just 73.6%, so are not sufficient).
889 |
890 | Loadings for the Principal Components
891 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
892 |
893 | The loadings for the principal components are stored in a named element "rotation" of the variable
894 | returned by "prcomp()". This contains a matrix with the loadings of each principal component, where
895 | the first column in the matrix contains the loadings for the first principal component, the second
896 | column contains the loadings for the second principal component, and so on.
897 |
898 | Therefore, to obtain the loadings for the first principal component in our
899 | analysis of the 13 chemical concentrations in wine samples, we type:
900 |
901 | ::
902 |
903 | > wine.pca$rotation[,1]
904 | V2 V3 V4 V5 V6 V7
905 | -0.144329395 0.245187580 0.002051061 0.239320405 -0.141992042 -0.394660845
906 | V8 V9 V10 V11 V12 V13
907 | -0.422934297 0.298533103 -0.313429488 0.088616705 -0.296714564 -0.376167411
908 | V14
909 | -0.286752227
910 |
911 | This means that the first principal component is a linear combination of the variables:
912 | -0.144*Z2 + 0.245*Z3 + 0.002*Z4 + 0.239*Z5 - 0.142*Z6 - 0.395*Z7 - 0.423*Z8 + 0.299*Z9
913 | -0.313*Z10 + 0.089*Z11 - 0.297*Z12 - 0.376*Z13 - 0.287*Z14, where Z2, Z3, Z4...Z14 are
914 | the standardised versions of the variables V2, V3, V4...V14 (that each
915 | have mean of 0 and variance of 1).
916 |
917 | Note that the square of the loadings sum to 1, as this is a constraint used in calculating the loadings:
918 |
919 | ::
920 |
921 | > sum((wine.pca$rotation[,1])^2)
922 | [1] 1
923 |
924 | To calculate the values of the first principal component, we can define our own function to calculate
925 | a principal component given the loadings and the input variables' values:
926 |
927 | ::
928 |
929 | > calcpc <- function(variables,loadings)
930 | {
931 | # find the number of samples in the data set
932 | as.data.frame(variables)
933 | numsamples <- nrow(variables)
934 | # make a vector to store the component
935 | pc <- numeric(numsamples)
936 | # find the number of variables
937 | numvariables <- length(variables)
938 | # calculate the value of the component for each sample
939 | for (i in 1:numsamples)
940 | {
941 | valuei <- 0
942 | for (j in 1:numvariables)
943 | {
944 | valueij <- variables[i,j]
945 | loadingj <- loadings[j]
946 | valuei <- valuei + (valueij * loadingj)
947 | }
948 | pc[i] <- valuei
949 | }
950 | return(pc)
951 | }
952 |
953 | We can then use the function to calculate the values of the first principal component for each sample in our
954 | wine data:
955 |
956 | ::
957 |
958 | > calcpc(standardisedconcentrations, wine.pca$rotation[,1])
959 | [1] -3.30742097 -2.20324981 -2.50966069 -3.74649719 -1.00607049 -3.04167373 -2.44220051
960 | [8] -2.05364379 -2.50381135 -2.74588238 -3.46994837 -1.74981688 -2.10751729 -3.44842921
961 | [15] -4.30065228 -2.29870383 -2.16584568 -1.89362947 -3.53202167 -2.07865856 -3.11561376
962 | [22] -1.08351361 -2.52809263 -1.64036108 -1.75662066 -0.98729406 -1.77028387 -1.23194878
963 | [29] -2.18225047 -2.24976267 -2.49318704 -2.66987964 -1.62399801 -1.89733870 -1.40642118
964 | [36] -1.89847087 -1.38096669 -1.11905070 -1.49796891 -2.52268490 -2.58081526 -0.66660159
965 | ...
966 |
967 | In fact, the values of the first principal component are stored in the variable wine.pca$x[,1]
968 | that was returned by the "prcomp()" function, so we can compare those values to the ones that we
969 | calculated, and they should agree:
970 |
971 | ::
972 |
973 | > wine.pca$x[,1]
974 | [1] -3.30742097 -2.20324981 -2.50966069 -3.74649719 -1.00607049 -3.04167373 -2.44220051
975 | [8] -2.05364379 -2.50381135 -2.74588238 -3.46994837 -1.74981688 -2.10751729 -3.44842921
976 | [15] -4.30065228 -2.29870383 -2.16584568 -1.89362947 -3.53202167 -2.07865856 -3.11561376
977 | [22] -1.08351361 -2.52809263 -1.64036108 -1.75662066 -0.98729406 -1.77028387 -1.23194878
978 | [29] -2.18225047 -2.24976267 -2.49318704 -2.66987964 -1.62399801 -1.89733870 -1.40642118
979 | [36] -1.89847087 -1.38096669 -1.11905070 -1.49796891 -2.52268490 -2.58081526 -0.66660159
980 | ...
981 |
982 | We see that they do agree.
983 |
984 | The first principal component has highest (in absolute value) loadings for V8 (-0.423), V7 (-0.395), V13 (-0.376),
985 | V10 (-0.313), V12 (-0.297), V14 (-0.287), V9 (0.299), V3 (0.245), and V5 (0.239). The loadings for V8, V7, V13,
986 | V10, V12 and V14 are negative, while those for V9, V3, and V5 are positive. Therefore, an interpretation of the
987 | first principal component is that it represents a contrast between the concentrations of V8, V7, V13, V10, V12, and V14,
988 | and the concentrations of V9, V3 and V5.
989 |
990 | Similarly, we can obtain the loadings for the second principal component by typing:
991 |
992 | ::
993 |
994 | > wine.pca$rotation[,2]
995 | V2 V3 V4 V5 V6 V7
996 | 0.483651548 0.224930935 0.316068814 -0.010590502 0.299634003 0.065039512
997 | V8 V9 V10 V11 V12 V13
998 | -0.003359812 0.028779488 0.039301722 0.529995672 -0.279235148 -0.164496193
999 | V14
1000 | 0.364902832
1001 |
1002 | This means that the second principal component is a linear combination of the variables:
1003 | 0.484*Z2 + 0.225*Z3 + 0.316*Z4 - 0.011*Z5 + 0.300*Z6 + 0.065*Z7 - 0.003*Z8 + 0.029*Z9
1004 | + 0.039*Z10 + 0.530*Z11 - 0.279*Z12 - 0.164*Z13 + 0.365*Z14, where Z1, Z2, Z3...Z14
1005 | are the standardised versions of variables V2, V3, ... V14 that each have mean 0 and variance 1.
1006 |
1007 | Note that the square of the loadings sum to 1, as above:
1008 |
1009 | ::
1010 |
1011 | > sum((wine.pca$rotation[,2])^2)
1012 | [1] 1
1013 |
1014 | The second principal component has highest loadings for V11 (0.530), V2 (0.484), V14 (0.365), V4 (0.316),
1015 | V6 (0.300), V12 (-0.279), and V3 (0.225). The loadings for V11, V2, V14, V4, V6 and V3 are positive, while
1016 | the loading for V12 is negative. Therefore, an interpretation of the second principal component is that
1017 | it represents a contrast between the concentrations of V11, V2, V14, V4, V6 and V3, and the concentration of
1018 | V12. Note that the loadings for V11 (0.530) and V2 (0.484) are the largest, so the contrast is mainly between
1019 | the concentrations of V11 and V2, and the concentration of V12.
1020 |
1021 | Scatterplots of the Principal Components
1022 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1023 |
1024 | The values of the principal components are stored in a named element "x" of the variable returned by
1025 | "prcomp()". This contains a matrix with the principal components, where the first column in the matrix
1026 | contains the first principal component, the second column the second component, and so on.
1027 |
1028 | Thus, in our example, "wine.pca$x[,1]" contains the first principal component, and
1029 | "wine.pca$x[,2]" contains the second principal component.
1030 |
1031 | We can make a scatterplot of the first two principal components, and label the data points with the cultivar that the wine
1032 | samples come from, by typing:
1033 |
1034 | ::
1035 |
1036 | > plot(wine.pca$x[,1],wine.pca$x[,2]) # make a scatterplot
1037 | > text(wine.pca$x[,1],wine.pca$x[,2], wine$V1, cex=0.7, pos=4, col="red") # add labels
1038 |
1039 | |image7|
1040 |
1041 | The scatterplot shows the first principal component on the x-axis, and the second principal
1042 | component on the y-axis. We can see from the scatterplot that wine samples of cultivar 1
1043 | have much lower values of the first principal component than wine samples of cultivar 3.
1044 | Therefore, the first principal component separates wine samples of cultivars 1 from those
1045 | of cultivar 3.
1046 |
1047 | We can also see that wine samples of cultivar 2 have much higher values of the second
1048 | principal component than wine samples of cultivars 1 and 3. Therefore, the second principal
1049 | component separates samples of cultivar 2 from samples of cultivars 1 and 3.
1050 |
1051 | Therefore, the first two principal components are reasonably useful for distinguishing wine
1052 | samples of the three different cultivars.
1053 |
1054 | Above, we interpreted the first principal component as a contrast between the concentrations of V8, V7, V13, V10, V12, and V14,
1055 | and the concentrations of V9, V3 and V5. We can check whether this makes sense in terms of the
1056 | concentrations of these chemicals in the different cultivars, by printing out the means of the
1057 | standardised concentration variables in each cultivar, using the "printMeanAndSdByGroup()" function (see above):
1058 |
1059 | ::
1060 |
1061 | > printMeanAndSdByGroup(standardisedconcentrations,wine[1])
1062 | [1] "Means:"
1063 | V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1064 | 1 1 0.9166093 -0.2915199 0.3246886 -0.7359212 0.46192317 0.87090552 0.95419225 -0.57735640 0.5388633 0.2028288 0.4575567 0.7691811 1.1711967
1065 | 2 2 -0.8892116 -0.3613424 -0.4437061 0.2225094 -0.36354162 -0.05790375 0.05163434 0.01452785 0.0688079 -0.8503999 0.4323908 0.2446043 -0.7220731
1066 | 3 3 0.1886265 0.8928122 0.2572190 0.5754413 -0.03004191 -0.98483874 -1.24923710 0.68817813 -0.7641311 1.0085728 -1.2019916 -1.3072623 -0.3715295
1067 |
1068 | Does it make sense that the first principal component can separate cultivar 1 from cultivar 3?
1069 | In cultivar 1, the mean values of V8 (0.954), V7 (0.871), V13 (0.769), V10 (0.539), V12 (0.458) and V14 (1.171)
1070 | are very high compared to the mean values of V9 (-0.577), V3 (-0.292) and V5 (-0.736).
1071 | In cultivar 3, the mean values of V8 (-1.249), V7 (-0.985), V13 (-1.307), V10 (-0.764), V12 (-1.202) and V14 (-0.372)
1072 | are very low compared to the mean values of V9 (0.688), V3 (0.893) and V5 (0.575).
1073 | Therefore, it does make sense that principal component 1 is a contrast between the concentrations of V8, V7, V13, V10, V12, and V14,
1074 | and the concentrations of V9, V3 and V5; and that principal component 1 can separate cultivar 1 from cultivar 3.
1075 |
1076 | Above, we intepreted the second principal component as a contrast between the concentrations of V11,
1077 | V2, V14, V4, V6 and V3, and the concentration of V12.
1078 | In the light of the mean values of these variables in the different cultivars, does
1079 | it make sense that the second principal component can separate cultivar 2 from cultivars 1 and 3?
1080 | In cultivar 1, the mean values of V11 (0.203), V2 (0.917), V14 (1.171), V4 (0.325), V6 (0.462) and V3 (-0.292)
1081 | are not very different from the mean value of V12 (0.458).
1082 | In cultivar 3, the mean values of V11 (1.009), V2 (0.189), V14 (-0.372), V4 (0.257), V6 (-0.030) and V3 (0.893)
1083 | are also not very different from the mean value of V12 (-1.202).
1084 | In contrast, in cultivar 2, the mean values of V11 (-0.850), V2 (-0.889), V14 (-0.722), V4 (-0.444), V6 (-0.364) and V3 (-0.361)
1085 | are much less than the mean value of V12 (0.432).
1086 | Therefore, it makes sense that principal component is a contrast between the concentrations of V11,
1087 | V2, V14, V4, V6 and V3, and the concentration of V12; and that principal component 2 can separate cultivar 2 from cultivars 1 and 3.
1088 |
1089 | Linear Discriminant Analysis
1090 | ----------------------------
1091 |
1092 | The purpose of principal component analysis is to find the best low-dimensional representation of the variation in a
1093 | multivariate data set. For example, in the wine data set, we have 13 chemical concentrations describing wine samples from three cultivars.
1094 | By carrying out a principal component analysis, we found that most of the variation in the chemical concentrations
1095 | between the samples can be captured using the first two principal components,
1096 | where each of the principal components is a particular linear combination of the 13 chemical concentrations.
1097 |
1098 | The purpose of linear discriminant analysis (LDA) is to find the linear combinations of the original variables (the 13
1099 | chemical concentrations here) that gives the best possible separation between the groups (wine cultivars here) in our
1100 | data set. Linear discriminant analysis is also known as "canonical discriminant analysis", or simply "discriminant analysis".
1101 |
1102 | If we want to separate the wines by cultivar, the wines come from three different cultivars, so the number of groups (G) is 3,
1103 | and the number of variables is 13 (13 chemicals' concentrations; p = 13). The maximum number of useful discriminant
1104 | functions that can separate the wines by cultivar is the minimum of G-1 and p, and so in this case it is the minimum of 2 and 13,
1105 | which is 2. Thus, we can find at most 2 useful discriminant functions to separate the wines by cultivar, using the
1106 | 13 chemical concentration variables.
1107 |
1108 | You can carry out a linear discriminant analysis using the "lda()" function from the R "MASS" package.
1109 | To use this function, we first need to install the "MASS" R package
1110 | (for instructions on how to install an R package, see `How to install an R package
1111 | <./installr.html#how-to-install-an-r-package>`_).
1112 |
1113 | For example, to carry out a linear discriminant analysis using the 13 chemical concentrations in the wine samples, we type:
1114 |
1115 | ::
1116 |
1117 | > library("MASS") # load the MASS package
1118 | > wine.lda <- lda(wine$V1 ~ wine$V2 + wine$V3 + wine$V4 + wine$V5 + wine$V6 + wine$V7 +
1119 | wine$V8 + wine$V9 + wine$V10 + wine$V11 + wine$V12 + wine$V13 +
1120 | wine$V14)
1121 |
1122 | Loadings for the Discriminant Functions
1123 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1124 |
1125 | To get the values of the loadings of the discriminant functions for the wine data, we can type:
1126 |
1127 | ::
1128 |
1129 | > wine.lda
1130 | Coefficients of linear discriminants:
1131 | LD1 LD2
1132 | wine$V2 -0.403399781 0.8717930699
1133 | wine$V3 0.165254596 0.3053797325
1134 | wine$V4 -0.369075256 2.3458497486
1135 | wine$V5 0.154797889 -0.1463807654
1136 | wine$V6 -0.002163496 -0.0004627565
1137 | wine$V7 0.618052068 -0.0322128171
1138 | wine$V8 -1.661191235 -0.4919980543
1139 | wine$V9 -1.495818440 -1.6309537953
1140 | wine$V10 0.134092628 -0.3070875776
1141 | wine$V11 0.355055710 0.2532306865
1142 | wine$V12 -0.818036073 -1.5156344987
1143 | wine$V13 -1.157559376 0.0511839665
1144 | wine$V14 -0.002691206 0.0028529846
1145 |
1146 | This means that the first discriminant function is a linear combination of the variables:
1147 | -0.403*V2 + 0.165*V3 - 0.369*V4 + 0.155*V5 - 0.002*V6 + 0.618*V7 - 1.661*V8
1148 | - 1.496*V9 + 0.134*V10 + 0.355*V11 - 0.818*V12 - 1.158*V13 - 0.003*V14, where
1149 | V2, V3, ... V14 are the concentrations of the 14 chemicals found in the wine samples.
1150 | For convenience, the value for each discriminant function (eg. the first discriminant function)
1151 | are scaled so that their mean value is zero (see below).
1152 |
1153 | Note that these loadings are calculated so that the within-group variance of each discriminant
1154 | function for each group (cultivar) is equal to 1, as will be demonstrated below.
1155 |
1156 | These scalings are also stored in the named element "scaling" of the variable returned
1157 | by the lda() function. This element contains a matrix, in which the first column contains
1158 | the loadings for the first discriminant function, the second column contains the loadings
1159 | for the second discriminant function and so on. For example, to extract the loadings for
1160 | the first discriminant function, we can type:
1161 |
1162 | ::
1163 |
1164 | > wine.lda$scaling[,1]
1165 | wine$V2 wine$V3 wine$V4 wine$V5 wine$V6 wine$V7
1166 | -0.403399781 0.165254596 -0.369075256 0.154797889 -0.002163496 0.618052068
1167 | wine$V8 wine$V9 wine$V10 wine$V11 wine$V12 wine$V13
1168 | -1.661191235 -1.495818440 0.134092628 0.355055710 -0.818036073 -1.157559376
1169 | wine$V14
1170 | -0.002691206
1171 |
1172 | To calculate the values of the first discriminant function, we can define our own function "calclda()":
1173 |
1174 | ::
1175 |
1176 | > calclda <- function(variables,loadings)
1177 | {
1178 | # find the number of samples in the data set
1179 | as.data.frame(variables)
1180 | numsamples <- nrow(variables)
1181 | # make a vector to store the discriminant function
1182 | ld <- numeric(numsamples)
1183 | # find the number of variables
1184 | numvariables <- length(variables)
1185 | # calculate the value of the discriminant function for each sample
1186 | for (i in 1:numsamples)
1187 | {
1188 | valuei <- 0
1189 | for (j in 1:numvariables)
1190 | {
1191 | valueij <- variables[i,j]
1192 | loadingj <- loadings[j]
1193 | valuei <- valuei + (valueij * loadingj)
1194 | }
1195 | ld[i] <- valuei
1196 | }
1197 | # standardise the discriminant function so that its mean value is 0:
1198 | ld <- as.data.frame(scale(ld, center=TRUE, scale=FALSE))
1199 | ld <- ld[[1]]
1200 | return(ld)
1201 | }
1202 |
1203 | The function calclda() simply calculates the value of a discriminant function
1204 | for each sample in the data set, for example, for the first disriminant function, for each sample we calculate
1205 | the value using the equation -0.403*V2 - 0.165*V3 - 0.369*V4 + 0.155*V5 - 0.002*V6 + 0.618*V7 - 1.661*V8
1206 | - 1.496*V9 + 0.134*V10 + 0.355*V11 - 0.818*V12 - 1.158*V13 - 0.003*V14. Furthermore, the "scale()"
1207 | command is used within the calclda() function in order to standardise the value of a discriminant function
1208 | (eg. the first discriminant function) so that its mean value (over all the wine samples) is 0.
1209 |
1210 | We can use the function calclda() to calculate the values of the first discriminant function for each sample in our
1211 | wine data:
1212 |
1213 | ::
1214 |
1215 | > calclda(wine[2:14], wine.lda$scaling[,1])
1216 | [1] -4.70024401 -4.30195811 -3.42071952 -4.20575366 -1.50998168 -4.51868934
1217 | [7] -4.52737794 -4.14834781 -3.86082876 -3.36662444 -4.80587907 -3.42807646
1218 | [13] -3.66610246 -5.58824635 -5.50131449 -3.18475189 -3.28936988 -2.99809262
1219 | [19] -5.24640372 -3.13653106 -3.57747791 -1.69077135 -4.83515033 -3.09588961
1220 | [25] -3.32164716 -2.14482223 -3.98242850 -2.68591432 -3.56309464 -3.17301573
1221 | [31] -2.99626797 -3.56866244 -3.38506383 -3.52753750 -2.85190852 -2.79411996
1222 | ...
1223 |
1224 | .. This agrees with the values that we get in SPSS, except that the values in SPSS
1225 | .. multiplied by -1, because the loadings are multiplied by -1, but that is fine.
1226 |
1227 | In fact, the values of the first linear discriminant function can be calculated using the
1228 | "predict()" function in R, so we can compare those to the ones that we calculated, and they
1229 | should agree:
1230 |
1231 | ::
1232 |
1233 | > wine.lda.values <- predict(wine.lda, wine[2:14])
1234 | > wine.lda.values$x[,1] # contains the values for the first discriminant function
1235 | 1 2 3 4 5 6
1236 | -4.70024401 -4.30195811 -3.42071952 -4.20575366 -1.50998168 -4.51868934
1237 | 7 8 9 10 11 12
1238 | -4.52737794 -4.14834781 -3.86082876 -3.36662444 -4.80587907 -3.42807646
1239 | 13 14 15 16 17 18
1240 | -3.66610246 -5.58824635 -5.50131449 -3.18475189 -3.28936988 -2.99809262
1241 | 19 20 21 22 23 24
1242 | -5.24640372 -3.13653106 -3.57747791 -1.69077135 -4.83515033 -3.09588961
1243 | 25 26 27 28 29 30
1244 | -3.32164716 -2.14482223 -3.98242850 -2.68591432 -3.56309464 -3.17301573
1245 | 31 32 33 34 35 36
1246 | -2.99626797 -3.56866244 -3.38506383 -3.52753750 -2.85190852 -2.79411996
1247 | ...
1248 |
1249 | We see that they do agree.
1250 |
1251 | .. The loadings agree with those given in SPSS for the unstandardised variables.
1252 | .. In SPSS I get:
1253 | .. Unstandardised coeffs:
1254 | .. V2: 0.403, 0.872
1255 | .. V3: -0.165, 0.305
1256 | .. V4: 0.369, 2.346
1257 | .. V5: -0.155, -0.146
1258 | .. V6: 0.002, 0.000
1259 | .. V7: -0.618, -0.032
1260 | .. V8: 1.661, -0.492
1261 | .. V9: 1.496, -1.632
1262 | .. V10: -0.134, -0.307
1263 | .. V11: -0.355, 0.253
1264 | .. V12: 0.818, -1.516
1265 | .. V13: 1.158, 0.051
1266 | .. V14: 0.003, 0.003
1267 | .. Standardised coeffs:
1268 | .. V2: 0.207, 0.446
1269 | .. V3: -0.156, 0.288
1270 | .. V4: 0.095, 0.603
1271 | .. V5: -0.438, -0.414
1272 | .. V6: 0.029, -0.006
1273 | .. V7: -0.270, -0.014
1274 | .. V8: 0.871, -0.258
1275 | .. V9: 0.163, -0.178
1276 | .. V10: -0.067, -0.152
1277 | .. V11: -0.537, 0.383
1278 | .. V12: 0.128, -0.237
1279 | .. V13: 0.464, 0.021
1280 | .. V14: 0.464, 0.492
1281 |
1282 | .. Comment:
1283 | .. If you look at the output of calcSeparations, you can see that the within-group variances are 1.
1284 | .. The loadings are in wine.lda$scaling, I think.
1285 | .. The description for scaling in the help for lda() is:
1286 | .. a matrix which transforms observations to discriminant functions, normalized so that within groups
1287 | .. covariance matrix is spherical.
1288 | .. calcpc(wine[2:14], wine.lda$scaling[,1])
1289 | .. -13.931031 -13.532745 -12.651506 -13.436540 -10.740768 -13.749476
1290 | .. -13.758165 -13.379134 -13.091615 -12.597411 -14.036666 -12.658863
1291 | .. -12.896889 -14.819033 -14.732101 -12.415539 -12.520157 -12.228879
1292 | .. -14.477190 -12.367318 -12.808265 -10.921558 -14.065937 -12.326676
1293 | .. -12.552434 -11.375609 -13.213215 -11.916701 -12.793881 -12.403802
1294 | .. -12.227055 -12.799449 -12.615850 -12.758324 -12.082695 -12.024907
1295 | .. -11.988872 -11.408131 -12.260050 -12.501839 -12.151442 -11.467997
1296 | .. -13.930512 -10.461148 -11.812826 -11.813907 -13.119666 -12.680540
1297 | .. mylda1 <- calcpc(wine[2:14], wine.lda$scaling[,1])
1298 | .. summary(aov(mylda1~as.factor(wine[,1])))
1299 | .. Df Sum Sq Mean Sq F value Pr(>F)
1300 | .. as.factor(wine[, 1]) 2 1589.3 794.65 794.65 < 2.2e-16 ***
1301 | .. Residuals 175 175.0 1.00
1302 | .. Do seem to have within-group variance=1.
1303 | .. Put the LDA1 and LDA2 calculated from SPSS in a file, can check if within-group variance=1:
1304 | .. spss <- read.table("C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/OUBooks/MultivariateStats/wine.data_lda.txt",header=FALSE)
1305 | .. summary(aov(spss$V1~as.factor(wine[,1])))
1306 | .. Df Sum Sq Mean Sq F value Pr(>F)
1307 | .. as.factor(wine[, 1]) 2 1589.3 794.65 794.65 < 2.2e-16 ***
1308 | .. Residuals 175 175.0 1.00
1309 | .. Has within-group variance=1.
1310 | .. plot(spss$V1, mylda1) # Have a correlation of -1
1311 | .. summary(mylda1)
1312 | .. Min. 1st Qu. Median Mean 3rd Qu. Max.
1313 | .. -14.820 -12.140 -9.529 -9.231 -6.396 -3.489
1314 | .. summary(spss$V1)
1315 | .. Min. 1st Qu. Median Mean 3rd Qu. Max.
1316 | .. -5.742e+00 -2.835e+00 2.978e-01 -5.618e-08 2.909e+00 5.588e+00
1317 | .. SPSS seems to have centred the data so that the mean of LDA1 is 0.
1318 | ..
1319 | ..
1320 | .. ...
1321 | .. wine.lda.values <- predict(wine.lda, wine[2:14])
1322 | .. wine.lda.values$x[,1] # contains the values for the first discriminant function
1323 | .. 1 2 3 4 5 6
1324 | .. -4.70024401 -4.30195811 -3.42071952 -4.20575366 -1.50998168 -4.51868934
1325 | .. 7 8 9 10 11 12
1326 | .. -4.52737794 -4.14834781 -3.86082876 -3.36662444 -4.80587907 -3.42807646
1327 | .. 13 14 15 16 17 18
1328 | .. -3.66610246 -5.58824635 -5.50131449 -3.18475189 -3.28936988 -2.99809262
1329 | .. 19 20 21 22 23 24
1330 | .. -5.24640372 -3.13653106 -3.57747791 -1.69077135 -4.83515033 -3.09588961
1331 | .. Agrees perfectly with the values from SPSS (except the SPSS values are multiplied by -1, because the loadings are all multipled by
1332 | .. -1, but that doesn't matter).
1333 |
1334 | It doesn't matter whether the input variables for linear discriminant analysis are standardised or not, unlike
1335 | for principal components analysis in which it is often necessary to standardise the input variables.
1336 | However, using standardised variables in linear discriminant analysis makes it easier to interpret the loadings in
1337 | a linear discriminant function.
1338 |
1339 | In linear discriminant analysis, the standardised version of an input variable is defined so that it
1340 | has mean zero and within-groups variance of 1. Thus, we can calculate the "group-standardised" variable
1341 | by subtracting the mean from each value of the variable, and dividing by the within-groups standard deviation.
1342 | To calculate the group-standardised version of a set of variables, we can use the function "groupStandardise()" below:
1343 |
1344 |
1345 | ::
1346 |
1347 | > groupStandardise <- function(variables, groupvariable)
1348 | {
1349 | # find out how many variables we have
1350 | variables <- as.data.frame(variables)
1351 | numvariables <- length(variables)
1352 | # find the variable names
1353 | variablenames <- colnames(variables)
1354 | # calculate the group-standardised version of each variable
1355 | for (i in 1:numvariables)
1356 | {
1357 | variablei <- variables[i]
1358 | variablei_name <- variablenames[i]
1359 | variablei_Vw <- calcWithinGroupsVariance(variablei, groupvariable)
1360 | variablei_mean <- mean(variablei)
1361 | variablei_new <- (variablei - variablei_mean)/(sqrt(variablei_Vw))
1362 | data_length <- nrow(variablei)
1363 | if (i == 1) { variables_new <- data.frame(row.names=seq(1,data_length)) }
1364 | variables_new[`variablei_name`] <- variablei_new
1365 | }
1366 | return(variables_new)
1367 | }
1368 |
1369 | For example, we can use the "groupStandardise()" function to calculate the group-standardised versions of the
1370 | chemical concentrations in wine samples:
1371 |
1372 | ::
1373 |
1374 | > groupstandardisedconcentrations <- groupStandardise(wine[2:14], wine[1])
1375 |
1376 | We can then use the lda() function to perform linear disriminant analysis on the group-standardised variables:
1377 |
1378 | ::
1379 |
1380 | > wine.lda2 <- lda(wine$V1 ~ groupstandardisedconcentrations$V2 + groupstandardisedconcentrations$V3 +
1381 | groupstandardisedconcentrations$V4 + groupstandardisedconcentrations$V5 +
1382 | groupstandardisedconcentrations$V6 + groupstandardisedconcentrations$V7 +
1383 | groupstandardisedconcentrations$V8 + groupstandardisedconcentrations$V9 +
1384 | groupstandardisedconcentrations$V10 + groupstandardisedconcentrations$V11 +
1385 | groupstandardisedconcentrations$V12 + groupstandardisedconcentrations$V13 +
1386 | groupstandardisedconcentrations$V14)
1387 | > wine.lda2
1388 | Coefficients of linear discriminants:
1389 | LD1 LD2
1390 | groupstandardisedconcentrations$V2 -0.20650463 0.446280119
1391 | groupstandardisedconcentrations$V3 0.15568586 0.287697336
1392 | groupstandardisedconcentrations$V4 -0.09486893 0.602988809
1393 | groupstandardisedconcentrations$V5 0.43802089 -0.414203541
1394 | groupstandardisedconcentrations$V6 -0.02907934 -0.006219863
1395 | groupstandardisedconcentrations$V7 0.27030186 -0.014088108
1396 | groupstandardisedconcentrations$V8 -0.87067265 -0.257868714
1397 | groupstandardisedconcentrations$V9 -0.16325474 -0.178003512
1398 | groupstandardisedconcentrations$V10 0.06653116 -0.152364015
1399 | groupstandardisedconcentrations$V11 0.53670086 0.382782544
1400 | groupstandardisedconcentrations$V12 -0.12801061 -0.237174509
1401 | groupstandardisedconcentrations$V13 -0.46414916 0.020523349
1402 | groupstandardisedconcentrations$V14 -0.46385409 0.491738050
1403 |
1404 | It makes sense to interpret the loadings calculated using the group-standardised variables rather than the loadings for
1405 | the original (unstandardised) variables.
1406 |
1407 | In the first discriminant function calculated for the group-standardised variables, the largest loadings (in absolute) value
1408 | are given to V8 (-0.871), V11 (0.537), V13 (-0.464), V14 (-0.464), and V5 (0.438). The loadings for V8, V13 and V14 are negative, while
1409 | those for V11 and V5 are positive. Therefore, the discriminant function seems to represent a contrast between the concentrations of
1410 | V8, V13 and V14, and the concentrations of V11 and V5.
1411 |
1412 | We saw above that the individual variables which gave the greatest separations between the groups were V8 (separation 233.93), V14 (207.92),
1413 | V13 (189.97), V2 (135.08) and V11 (120.66). These were mostly the same variables that had the largest loadings in the linear discriminant
1414 | function (loading for V8: -0.871, for V14: -0.464, for V13: -0.464, for V11: 0.537).
1415 |
1416 | We found above that variables V8 and V11 have a negative between-groups covariance (-60.41) and a positive within-groups covariance (0.29).
1417 | When the between-groups covariance and within-groups covariance for two variables have opposite signs, it indicates that a better separation
1418 | between groups can be obtained by using a linear combination of those two variables than by using either variable on its own.
1419 |
1420 | Thus, given that the two variables V8 and V11 have between-groups and within-groups covariances of opposite signs, and that these are two
1421 | of the variables that gave the greatest separations between groups when used individually, it is not surprising that these are the two
1422 | variables that have the largest loadings in the first discriminant function.
1423 |
1424 | Note that although the loadings for the group-standardised variables are easier to interpret than the loadings for the
1425 | unstandardised variables, the values of the discriminant function are the same regardless of whether we standardise
1426 | the input variables or not. For example, for wine data, we can calculate the value of the first discriminant function calculated
1427 | using the unstandardised and group-standardised variables by typing:
1428 |
1429 | ::
1430 |
1431 | > wine.lda.values <- predict(wine.lda, wine[2:14])
1432 | > wine.lda.values$x[,1] # values for the first discriminant function, using the unstandardised data
1433 | 1 2 3 4 5 6
1434 | -4.70024401 -4.30195811 -3.42071952 -4.20575366 -1.50998168 -4.51868934
1435 | 7 8 9 10 11 12
1436 | -4.52737794 -4.14834781 -3.86082876 -3.36662444 -4.80587907 -3.42807646
1437 | 13 14 15 16 17 18
1438 | -3.66610246 -5.58824635 -5.50131449 -3.18475189 -3.28936988 -2.99809262
1439 | 19 20 21 22 23 24
1440 | -5.24640372 -3.13653106 -3.57747791 -1.69077135 -4.83515033 -3.09588961
1441 | ...
1442 | > wine.lda.values2 <- predict(wine.lda2, groupstandardisedconcentrations)
1443 | > wine.lda.values2$x[,1] # values for the first discriminant function, using the standardised data
1444 | 1 2 3 4 5 6
1445 | -4.70024401 -4.30195811 -3.42071952 -4.20575366 -1.50998168 -4.51868934
1446 | 7 8 9 10 11 12
1447 | -4.52737794 -4.14834781 -3.86082876 -3.36662444 -4.80587907 -3.42807646
1448 | 13 14 15 16 17 18
1449 | -3.66610246 -5.58824635 -5.50131449 -3.18475189 -3.28936988 -2.99809262
1450 | 19 20 21 22 23 24
1451 | -5.24640372 -3.13653106 -3.57747791 -1.69077135 -4.83515033 -3.09588961
1452 | ...
1453 |
1454 | .. Note these are the same values that I get using SPSS.
1455 |
1456 | We can see that although the loadings are different for the first discriminant functions calculated using
1457 | unstandardised and group-standardised data, the actual values of the first discriminant function are the same.
1458 |
1459 | Separation Achieved by the Discriminant Functions
1460 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1461 | To calculate the separation achieved by each discriminant function, we first need to calculate the
1462 | value of each discriminant function, by substituting the variables' values into the linear combination for
1463 | the discriminant function (eg. -0.403*V2 - 0.165*V3 - 0.369*V4 + 0.155*V5 - 0.002*V6 + 0.618*V7 - 1.661*V8
1464 | - 1.496*V9 + 0.134*V10 + 0.355*V11 - 0.818*V12 - 1.158*V13 - 0.003*V14 for the first discriminant function),
1465 | and then scaling the values of the discriminant function so that their mean is zero.
1466 |
1467 | As mentioned above, we can do this using the "predict()" function in R. For example,
1468 | to calculate the value of the discriminant functions for the wine data, we type:
1469 |
1470 | ::
1471 |
1472 | > wine.lda.values <- predict(wine.lda, standardisedconcentrations)
1473 |
1474 | The returned variable has a named element "x" which is a matrix containing the linear discriminant functions:
1475 | the first column of x contains the first discriminant function, the second column of x contains the second
1476 | discriminant function, and so on (if there are more discriminant functions).
1477 |
1478 | We can therefore calculate the separations achieved by the two linear discriminant functions for the wine data by using the
1479 | "calcSeparations()" function (see above), which calculates the separation as the ratio of the between-groups
1480 | variance to the within-groups variance:
1481 |
1482 | ::
1483 |
1484 | > calcSeparations(wine.lda.values$x,wine[1])
1485 | [1] "variable LD1 Vw= 1 Vb= 794.652200566216 separation= 794.652200566216"
1486 | [1] "variable LD2 Vw= 1 Vb= 361.241041493455 separation= 361.241041493455"
1487 |
1488 | As mentioned above, the loadings for each discriminant function are calculated in such a way that
1489 | the within-group variance (Vw) for each group (wine cultivar here) is equal to 1, as we see in the
1490 | output from calcSeparations() above.
1491 |
1492 | The output from calcSeparations() tells us that the separation achieved by the first (best) discriminant
1493 | function is 794.7, and the separation achieved by the second (second best) discriminant function is 361.2.
1494 |
1495 | Therefore, the total separation is the sum of these, which is (794.652200566216+361.241041493455=1155.893)
1496 | 1155.89, rounded to two decimal places. Therefore, the "percentage separation" achieved by the
1497 | first discriminant function is (794.652200566216*100/1155.893=) 68.75%, and the percentage separation achieved by the
1498 | second discriminant function is (361.241041493455*100/1155.893=) 31.25%.
1499 |
1500 | The "proportion of trace" that is printed when you type "wine.lda" (the variable returned by the lda() function)
1501 | is the percentage separation achieved by each discriminant function. For example, for the wine data we get the
1502 | same values as just calculated (68.75% and 31.25%):
1503 |
1504 | ::
1505 |
1506 | > wine.lda
1507 | Proportion of trace:
1508 | LD1 LD2
1509 | 0.6875 0.3125
1510 |
1511 | Therefore, the first discriminant function does achieve a good separation between the three groups (three cultivars), but the second
1512 | discriminant function does improve the separation of the groups by quite a large amount, so is it worth using the
1513 | second discriminant function as well. Therefore, to achieve a good separation of the groups (cultivars),
1514 | it is necessary to use both of the first two discriminant functions.
1515 |
1516 | We found above that the largest separation achieved for any of the individual variables (individual chemical concentrations)
1517 | was 233.9 for V8, which is quite a lot less than 794.7, the separation achieved by the first discriminant function. Therefore,
1518 | the effect of using more than one variable to calculate the discriminant function is that we can find a discriminant function
1519 | that achieves a far greater separation between groups than achieved by any one variable alone.
1520 |
1521 | The variable returned by the lda() function also has a named element "svd", which contains the ratio of
1522 | between- and within-group standard deviations for the linear discriminant variables, that is, the square
1523 | root of the "separation" value that we calculated using calcSeparations() above. When we calculate the
1524 | square of the value stored in "svd", we should get the same value as found using calcSeparations():
1525 |
1526 | ::
1527 |
1528 | > (wine.lda$svd)^2
1529 | [1] 794.6522 361.2410
1530 |
1531 |
1532 | .. Note that these are also called "canonical F-statistics".
1533 | .. Note the F statistics I get from aov() are the same as the separation values that I calculate:
1534 | .. > summary(aov(wine.lda.values$x ~ as.factor(wine[,1])))
1535 | .. Response LD1 :
1536 | .. Df Sum Sq Mean Sq F value Pr(>F)
1537 | .. as.factor(wine[, 1]) 2 1589.3 794.65 794.65 < 2.2e-16 ***
1538 | .. Residuals 175 175.0 1.00
1539 | .. ---
1540 | .. Response LD2 :
1541 | .. Df Sum Sq Mean Sq F value Pr(>F)
1542 | .. as.factor(wine[, 1]) 2 722.48 361.24 361.24 < 2.2e-16 ***
1543 | .. Residuals 175 175.00 1.00
1544 |
1545 | A Stacked Histogram of the LDA Values
1546 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1547 |
1548 | A nice way of displaying the results of a linear discriminant analysis (LDA) is to make a stacked histogram of the
1549 | values of the discriminant function for the samples from different groups (different wine cultivars in our example).
1550 |
1551 | We can do this using the "ldahist()" function in R. For example, to make a stacked histogram of the first discriminant
1552 | function's values for wine samples of the three different wine cultivars, we type:
1553 |
1554 | ::
1555 |
1556 | > ldahist(data = wine.lda.values$x[,1], g=wine$V1)
1557 |
1558 | |image8|
1559 |
1560 | We can see from the histogram that cultivars 1 and 3 are well separated by the first
1561 | discriminant function, since the values for the first cultivar are between -6 and -1,
1562 | while the values for cultivar 3 are between 2 and 6, and so there is no overlap in values.
1563 |
1564 | However, the separation achieved by the linear discriminant function on the training
1565 | set may be an overestimate. To get a more accurate idea of how well the first discriminant function
1566 | separates the groups, we would need to see a stacked histogram of the values for the three
1567 | cultivars using some unseen "test set", that is, using
1568 | a set of data that was not used to calculate the linear discriminant function.
1569 |
1570 | We see that the first discriminant function separates cultivars 1 and 3 very well, but
1571 | does not separate cultivars 1 and 2, or cultivars 2 and 3, so well.
1572 |
1573 | We therefore investigate whether the second discriminant function separates those cultivars,
1574 | by making a stacked histogram of the second discriminant function's values:
1575 |
1576 | ::
1577 |
1578 | > ldahist(data = wine.lda.values$x[,2], g=wine$V1)
1579 |
1580 | |image9|
1581 |
1582 | We see that the second discriminant function separates cultivars 1 and 2 quite well, although
1583 | there is a little overlap in their values. Furthermore, the second discriminant function also
1584 | separates cultivars 2 and 3 quite well, although again there is a little overlap in their values so
1585 | it is not perfect.
1586 |
1587 | Thus, we see that two discriminant functions are necessary to separate the cultivars, as was
1588 | discussed above (see the discussion of percentage separation above).
1589 |
1590 | Scatterplots of the Discriminant Functions
1591 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1592 |
1593 | We can obtain a scatterplot of the best two discriminant functions, with the data points labelled by cultivar, by typing:
1594 |
1595 | ::
1596 |
1597 | > plot(wine.lda.values$x[,1],wine.lda.values$x[,2]) # make a scatterplot
1598 | > text(wine.lda.values$x[,1],wine.lda.values$x[,2],wine$V1,cex=0.7,pos=4,col="red") # add labels
1599 |
1600 | |image10|
1601 |
1602 | From the scatterplot of the first two discriminant functions, we can see that the wines from the three
1603 | cultivars are well separated in the scatterplot. The first discriminant function (x-axis)
1604 | separates cultivars 1 and 3 very well, but doesn't not perfectly separate cultivars
1605 | 1 and 3, or cultivars 2 and 3.
1606 |
1607 | The second discriminant function (y-axis) achieves a fairly good separation of cultivars
1608 | 1 and 3, and cultivars 2 and 3, although it is not totally perfect.
1609 |
1610 | To achieve a very good separation of the three cultivars, it would be best to use both the first and second
1611 | discriminant functions together, since the first discriminant function can separate cultivars 1 and 3 very well,
1612 | and the second discriminant function can separate cultivars 1 and 2, and cultivars 2 and 3, reasonably well.
1613 |
1614 | Allocation Rules and Misclassification Rate
1615 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1616 |
1617 | We can calculate the mean values of the discriminant functions for each of the three cultivars using the
1618 | "printMeanAndSdByGroup()" function (see above):
1619 |
1620 | ::
1621 |
1622 | > printMeanAndSdByGroup(wine.lda.values$x,wine[1])
1623 | [1] "Means:"
1624 | V1 LD1 LD2
1625 | 1 1 -3.42248851 1.691674
1626 | 2 2 -0.07972623 -2.472656
1627 | 3 3 4.32473717 1.578120
1628 |
1629 | We find that the mean value of the first discriminant function is -3.42248851 for cultivar 1, -0.07972623 for cultivar 2,
1630 | and 4.32473717 for cultivar 3. The mid-way point between the mean values for cultivars 1 and 2 is (-3.42248851-0.07972623)/2=-1.751107,
1631 | and the mid-way point between the mean values for cultivars 2 and 3 is (-0.07972623+4.32473717)/2 = 2.122505.
1632 |
1633 | Therefore, we can use the following allocation rule:
1634 |
1635 | * if the first discriminant function is <= -1.751107, predict the sample to be from cultivar 1
1636 | * if the first discriminant function is > -1.751107 and <= 2.122505, predict the sample to be from cultivar 2
1637 | * if the first discriminant function is > 2.122505, predict the sample to be from cultivar 3
1638 |
1639 | We can examine the accuracy of this allocation rule by using the "calcAllocationRuleAccuracy()" function below:
1640 |
1641 | ::
1642 |
1643 | > calcAllocationRuleAccuracy <- function(ldavalue, groupvariable, cutoffpoints)
1644 | {
1645 | # find out how many values the group variable can take
1646 | groupvariable2 <- as.factor(groupvariable[[1]])
1647 | levels <- levels(groupvariable2)
1648 | numlevels <- length(levels)
1649 | # calculate the number of true positives and false negatives for each group
1650 | numlevels <- length(levels)
1651 | for (i in 1:numlevels)
1652 | {
1653 | leveli <- levels[i]
1654 | levelidata <- ldavalue[groupvariable==leveli]
1655 | # see how many of the samples from this group are classified in each group
1656 | for (j in 1:numlevels)
1657 | {
1658 | levelj <- levels[j]
1659 | if (j == 1)
1660 | {
1661 | cutoff1 <- cutoffpoints[1]
1662 | cutoff2 <- "NA"
1663 | results <- summary(levelidata <= cutoff1)
1664 | }
1665 | else if (j == numlevels)
1666 | {
1667 | cutoff1 <- cutoffpoints[(numlevels-1)]
1668 | cutoff2 <- "NA"
1669 | results <- summary(levelidata > cutoff1)
1670 | }
1671 | else
1672 | {
1673 | cutoff1 <- cutoffpoints[(j-1)]
1674 | cutoff2 <- cutoffpoints[(j)]
1675 | results <- summary(levelidata > cutoff1 & levelidata <= cutoff2)
1676 | }
1677 | trues <- results["TRUE"]
1678 | trues <- trues[[1]]
1679 | print(paste("Number of samples of group",leveli,"classified as group",levelj," : ",
1680 | trues,"(cutoffs:",cutoff1,",",cutoff2,")"))
1681 | }
1682 | }
1683 | }
1684 |
1685 | For example, to calculate the accuracy for the wine data based on the allocation
1686 | rule for the first discriminant function, we type:
1687 |
1688 | ::
1689 |
1690 | > calcAllocationRuleAccuracy(wine.lda.values$x[,1], wine[1], c(-1.751107, 2.122505))
1691 | [1] "Number of samples of group 1 classified as group 1 : 56 (cutoffs: -1.751107 , NA )"
1692 | [1] "Number of samples of group 1 classified as group 2 : 3 (cutoffs: -1.751107 , 2.122505 )"
1693 | [1] "Number of samples of group 1 classified as group 3 : NA (cutoffs: 2.122505 , NA )"
1694 | [1] "Number of samples of group 2 classified as group 1 : 5 (cutoffs: -1.751107 , NA )"
1695 | [1] "Number of samples of group 2 classified as group 2 : 65 (cutoffs: -1.751107 , 2.122505 )"
1696 | [1] "Number of samples of group 2 classified as group 3 : 1 (cutoffs: 2.122505 , NA )"
1697 | [1] "Number of samples of group 3 classified as group 1 : NA (cutoffs: -1.751107 , NA )"
1698 | [1] "Number of samples of group 3 classified as group 2 : NA (cutoffs: -1.751107 , 2.122505 )"
1699 | [1] "Number of samples of group 3 classified as group 3 : 48 (cutoffs: 2.122505 , NA )"
1700 |
1701 | This can be displayed in a "confusion matrix":
1702 |
1703 | +------------+----------------------+----------------------+----------------------+
1704 | | | Allocated to group 1 | Allocated to group 2 | Allocated to group 3 |
1705 | +============+======================+======================+======================+
1706 | | Is group 1 | 56 | 3 | 0 |
1707 | +------------+----------------------+----------------------+----------------------+
1708 | | Is group 2 | 5 | 65 | 1 |
1709 | +------------+----------------------+----------------------+----------------------+
1710 | | Is group 3 | 0 | 0 | 48 |
1711 | +------------+----------------------+----------------------+----------------------+
1712 |
1713 | There are 3+5+1=9 wine samples that are misclassified, out of (56+3+5+65+1+48=) 178 wine samples:
1714 | 3 samples from cultivar 1 are predicted to be from cultivar 2, 5 samples from cultivar 2 are predicted
1715 | to be from cultivar 1, and 1 sample from cultivar 2 is predicted to be from cultivar 3.
1716 | Therefore, the misclassification rate is 9/178, or 5.1%. The misclassification rate is quite low,
1717 | and therefore the accuracy of the allocation rule appears to be relatively high.
1718 |
1719 | However, this is probably an underestimate of the misclassification rate, as the allocation rule was based on this data (this is
1720 | the "training set"). If we calculated the misclassification rate for a separate "test set" consisting of data other than that
1721 | used to make the allocation rule, we would probably get a higher estimate of the misclassification rate.
1722 |
1723 | Links and Further Reading
1724 | -------------------------
1725 |
1726 | Here are some links for further reading.
1727 |
1728 | For a more in-depth introduction to R, a good online tutorial is
1729 | available on the "Kickstarting R" website,
1730 | `cran.r-project.org/doc/contrib/Lemon-kickstart `_.
1731 |
1732 | There is another nice (slightly more in-depth) tutorial to R
1733 | available on the "Introduction to R" website,
1734 | `cran.r-project.org/doc/manuals/R-intro.html `_.
1735 |
1736 | To learn about multivariate analysis, I would highly recommend the book "Multivariate
1737 | analysis" (product code M249/03) by the Open University, available from `the Open University Shop
1738 | `_.
1739 |
1740 | There is a book available in the "Use R!" series on using R for multivariate analyses,
1741 | `An Introduction to Applied Multivariate Analysis with R `_
1742 | by Everitt and Hothorn.
1743 |
1744 | Acknowledgements
1745 | ----------------
1746 |
1747 | Many of the examples in this booklet are inspired by examples in the excellent Open University book,
1748 | "Multivariate Analysis" (product code M249/03),
1749 | available from `the Open University Shop `_.
1750 |
1751 | I am grateful to the UCI Machine Learning Repository,
1752 | `http://archive.ics.uci.edu/ml `_, for making data sets available
1753 | which I have used in the examples in this booklet.
1754 |
1755 | Thank you to the following users for very helpful comments: to Rich O'Hara and Patrick Hausmann for pointing
1756 | out that sd() and mean() is deprecated; to Arnau Serra-Cayuela for pointing out a typo
1757 | in the LDA section; to John Christie for suggesting a more compact form for my printMeanAndSdByGroup() function,
1758 | and to Rama Ramakrishnan for suggesting a more compact form for my mosthighlycorrelated() function.
1759 |
1760 | Contact
1761 | -------
1762 |
1763 | I will be grateful if you will send me (`Avril Coghlan `_) corrections or suggestions for improvements to
1764 | my email address alc@sanger.ac.uk
1765 |
1766 | License
1767 | -------
1768 |
1769 | The content in this book is licensed under a `Creative Commons Attribution 3.0 License
1770 | `_.
1771 |
1772 | .. |image1| image:: ../_static/image1.png
1773 | :width: 500
1774 | .. |image2| image:: ../_static/image2.png
1775 | :width: 400
1776 | .. |image4| image:: ../_static/image4.png
1777 | :width: 400
1778 | .. |image5| image:: ../_static/image5.png
1779 | :width: 400
1780 | .. |image6| image:: ../_static/image6.png
1781 | :width: 400
1782 | .. |image7| image:: ../_static/image7.png
1783 | :width: 400
1784 | .. |image8| image:: ../_static/image8.png
1785 | :width: 400
1786 | .. |image9| image:: ../_static/image9.png
1787 | :width: 400
1788 | .. |image10| image:: ../_static/image10.png
1789 | :width: 400
1790 |
--------------------------------------------------------------------------------