├── DESCRIPTION
├── LICENSE.md
├── NAMESPACE
├── NEWS.md
├── R
    ├── classifySex.R
    ├── estimateBetaParam.R
    ├── estimateBetaParamsFromCounts.R
    ├── getTransformedProps.R
    ├── ggplotColors.R
    ├── normCounts.R
    ├── normSca.R
    ├── plotCellTypeMeanVar.R
    ├── plotCellTypeProps.R
    ├── plotCellTypePropsMeanVar.R
    ├── preprocess.R
    ├── propeller.R
    ├── propeller.anova.R
    ├── propeller.ttest.R
    ├── speckle-package.R
    ├── speckle_example_data.R
    └── sysdata.rda
├── README.md
├── man
    ├── classifySex.Rd
    ├── dot-extractSCE.Rd
    ├── dot-extractSeurat.Rd
    ├── estimateBetaParam.Rd
    ├── estimateBetaParamsFromCounts.Rd
    ├── getTransformedProps.Rd
    ├── ggplotColors.Rd
    ├── normCounts.Rd
    ├── normSca.Rd
    ├── plotCellTypeMeanVar.Rd
    ├── plotCellTypeProps.Rd
    ├── plotCellTypePropsMeanVar.Rd
    ├── preprocess.Rd
    ├── propeller.Rd
    ├── propeller.anova.Rd
    ├── propeller.ttest.Rd
    ├── speckle-package.Rd
    └── speckle_example_data.Rd
└── vignettes
    └── speckle.Rmd


/DESCRIPTION:
--------------------------------------------------------------------------------
 1 | Package: speckle
 2 | Type: Package
 3 | Title: Statistical methods for analysing single cell RNA-seq data
 4 | Version: 0.0.3
 5 | Date: 2020-12-02
 6 | Author: Belinda Phipson 
 7 | Maintainer: Belinda Phipson <Belinda.Phipson@petermac.org>
 8 | Depends: R (>= 3.6.0)
 9 | Imports: limma, edgeR, SingleCellExperiment, Seurat, statmod, ggplot2, methods, caret, scuttle, stringr, AnnotationDbi, org.Hs.eg.db, org.Mm.eg.db
10 | VignetteBuilder: knitr
11 | Suggests: BiocStyle, knitr, rmarkdown, CellBench, scater, patchwork
12 | Description: speckle contains functions for the analysis of single cell RNA-seq data.
13 | License: GPL-3
14 | biocViews: SingleCell, RNASeq, Regression, GeneExpression
15 | RoxygenNote: 7.1.1
16 | 


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
  1 | GNU General Public License
  2 | ==========================
  3 | 
  4 | _Version 3, 29 June 2007_  
  5 | _Copyright © 2007 Free Software Foundation, Inc. &lt;<http://fsf.org/>&gt;_
  6 | 
  7 | Everyone is permitted to copy and distribute verbatim copies of this license
  8 | document, but changing it is not allowed.
  9 | 
 10 | ## Preamble
 11 | 
 12 | The GNU General Public License is a free, copyleft license for software and other
 13 | kinds of works.
 14 | 
 15 | The licenses for most software and other practical works are designed to take away
 16 | your freedom to share and change the works. By contrast, the GNU General Public
 17 | License is intended to guarantee your freedom to share and change all versions of a
 18 | program--to make sure it remains free software for all its users. We, the Free
 19 | Software Foundation, use the GNU General Public License for most of our software; it
 20 | applies also to any other work released this way by its authors. You can apply it to
 21 | your programs, too.
 22 | 
 23 | When we speak of free software, we are referring to freedom, not price. Our General
 24 | Public Licenses are designed to make sure that you have the freedom to distribute
 25 | copies of free software (and charge for them if you wish), that you receive source
 26 | code or can get it if you want it, that you can change the software or use pieces of
 27 | it in new free programs, and that you know you can do these things.
 28 | 
 29 | To protect your rights, we need to prevent others from denying you these rights or
 30 | asking you to surrender the rights. Therefore, you have certain responsibilities if
 31 | you distribute copies of the software, or if you modify it: responsibilities to
 32 | respect the freedom of others.
 33 | 
 34 | For example, if you distribute copies of such a program, whether gratis or for a fee,
 35 | you must pass on to the recipients the same freedoms that you received. You must make
 36 | sure that they, too, receive or can get the source code. And you must show them these
 37 | terms so they know their rights.
 38 | 
 39 | Developers that use the GNU GPL protect your rights with two steps: **(1)** assert
 40 | copyright on the software, and **(2)** offer you this License giving you legal permission
 41 | to copy, distribute and/or modify it.
 42 | 
 43 | For the developers' and authors' protection, the GPL clearly explains that there is
 44 | no warranty for this free software. For both users' and authors' sake, the GPL
 45 | requires that modified versions be marked as changed, so that their problems will not
 46 | be attributed erroneously to authors of previous versions.
 47 | 
 48 | Some devices are designed to deny users access to install or run modified versions of
 49 | the software inside them, although the manufacturer can do so. This is fundamentally
 50 | incompatible with the aim of protecting users' freedom to change the software. The
 51 | systematic pattern of such abuse occurs in the area of products for individuals to
 52 | use, which is precisely where it is most unacceptable. Therefore, we have designed
 53 | this version of the GPL to prohibit the practice for those products. If such problems
 54 | arise substantially in other domains, we stand ready to extend this provision to
 55 | those domains in future versions of the GPL, as needed to protect the freedom of
 56 | users.
 57 | 
 58 | Finally, every program is threatened constantly by software patents. States should
 59 | not allow patents to restrict development and use of software on general-purpose
 60 | computers, but in those that do, we wish to avoid the special danger that patents
 61 | applied to a free program could make it effectively proprietary. To prevent this, the
 62 | GPL assures that patents cannot be used to render the program non-free.
 63 | 
 64 | The precise terms and conditions for copying, distribution and modification follow.
 65 | 
 66 | ## TERMS AND CONDITIONS
 67 | 
 68 | ### 0. Definitions
 69 | 
 70 | “This License” refers to version 3 of the GNU General Public License.
 71 | 
 72 | “Copyright” also means copyright-like laws that apply to other kinds of
 73 | works, such as semiconductor masks.
 74 | 
 75 | “The Program” refers to any copyrightable work licensed under this
 76 | License. Each licensee is addressed as “you”. “Licensees” and
 77 | “recipients” may be individuals or organizations.
 78 | 
 79 | To “modify” a work means to copy from or adapt all or part of the work in
 80 | a fashion requiring copyright permission, other than the making of an exact copy. The
 81 | resulting work is called a “modified version” of the earlier work or a
 82 | work “based on” the earlier work.
 83 | 
 84 | A “covered work” means either the unmodified Program or a work based on
 85 | the Program.
 86 | 
 87 | To “propagate” a work means to do anything with it that, without
 88 | permission, would make you directly or secondarily liable for infringement under
 89 | applicable copyright law, except executing it on a computer or modifying a private
 90 | copy. Propagation includes copying, distribution (with or without modification),
 91 | making available to the public, and in some countries other activities as well.
 92 | 
 93 | To “convey” a work means any kind of propagation that enables other
 94 | parties to make or receive copies. Mere interaction with a user through a computer
 95 | network, with no transfer of a copy, is not conveying.
 96 | 
 97 | An interactive user interface displays “Appropriate Legal Notices” to the
 98 | extent that it includes a convenient and prominently visible feature that **(1)**
 99 | displays an appropriate copyright notice, and **(2)** tells the user that there is no
100 | warranty for the work (except to the extent that warranties are provided), that
101 | licensees may convey the work under this License, and how to view a copy of this
102 | License. If the interface presents a list of user commands or options, such as a
103 | menu, a prominent item in the list meets this criterion.
104 | 
105 | ### 1. Source Code
106 | 
107 | The “source code” for a work means the preferred form of the work for
108 | making modifications to it. “Object code” means any non-source form of a
109 | work.
110 | 
111 | A “Standard Interface” means an interface that either is an official
112 | standard defined by a recognized standards body, or, in the case of interfaces
113 | specified for a particular programming language, one that is widely used among
114 | developers working in that language.
115 | 
116 | The “System Libraries” of an executable work include anything, other than
117 | the work as a whole, that **(a)** is included in the normal form of packaging a Major
118 | Component, but which is not part of that Major Component, and **(b)** serves only to
119 | enable use of the work with that Major Component, or to implement a Standard
120 | Interface for which an implementation is available to the public in source code form.
121 | A “Major Component”, in this context, means a major essential component
122 | (kernel, window system, and so on) of the specific operating system (if any) on which
123 | the executable work runs, or a compiler used to produce the work, or an object code
124 | interpreter used to run it.
125 | 
126 | The “Corresponding Source” for a work in object code form means all the
127 | source code needed to generate, install, and (for an executable work) run the object
128 | code and to modify the work, including scripts to control those activities. However,
129 | it does not include the work's System Libraries, or general-purpose tools or
130 | generally available free programs which are used unmodified in performing those
131 | activities but which are not part of the work. For example, Corresponding Source
132 | includes interface definition files associated with source files for the work, and
133 | the source code for shared libraries and dynamically linked subprograms that the work
134 | is specifically designed to require, such as by intimate data communication or
135 | control flow between those subprograms and other parts of the work.
136 | 
137 | The Corresponding Source need not include anything that users can regenerate
138 | automatically from other parts of the Corresponding Source.
139 | 
140 | The Corresponding Source for a work in source code form is that same work.
141 | 
142 | ### 2. Basic Permissions
143 | 
144 | All rights granted under this License are granted for the term of copyright on the
145 | Program, and are irrevocable provided the stated conditions are met. This License
146 | explicitly affirms your unlimited permission to run the unmodified Program. The
147 | output from running a covered work is covered by this License only if the output,
148 | given its content, constitutes a covered work. This License acknowledges your rights
149 | of fair use or other equivalent, as provided by copyright law.
150 | 
151 | You may make, run and propagate covered works that you do not convey, without
152 | conditions so long as your license otherwise remains in force. You may convey covered
153 | works to others for the sole purpose of having them make modifications exclusively
154 | for you, or provide you with facilities for running those works, provided that you
155 | comply with the terms of this License in conveying all material for which you do not
156 | control copyright. Those thus making or running the covered works for you must do so
157 | exclusively on your behalf, under your direction and control, on terms that prohibit
158 | them from making any copies of your copyrighted material outside their relationship
159 | with you.
160 | 
161 | Conveying under any other circumstances is permitted solely under the conditions
162 | stated below. Sublicensing is not allowed; section 10 makes it unnecessary.
163 | 
164 | ### 3. Protecting Users' Legal Rights From Anti-Circumvention Law
165 | 
166 | No covered work shall be deemed part of an effective technological measure under any
167 | applicable law fulfilling obligations under article 11 of the WIPO copyright treaty
168 | adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention
169 | of such measures.
170 | 
171 | When you convey a covered work, you waive any legal power to forbid circumvention of
172 | technological measures to the extent such circumvention is effected by exercising
173 | rights under this License with respect to the covered work, and you disclaim any
174 | intention to limit operation or modification of the work as a means of enforcing,
175 | against the work's users, your or third parties' legal rights to forbid circumvention
176 | of technological measures.
177 | 
178 | ### 4. Conveying Verbatim Copies
179 | 
180 | You may convey verbatim copies of the Program's source code as you receive it, in any
181 | medium, provided that you conspicuously and appropriately publish on each copy an
182 | appropriate copyright notice; keep intact all notices stating that this License and
183 | any non-permissive terms added in accord with section 7 apply to the code; keep
184 | intact all notices of the absence of any warranty; and give all recipients a copy of
185 | this License along with the Program.
186 | 
187 | You may charge any price or no price for each copy that you convey, and you may offer
188 | support or warranty protection for a fee.
189 | 
190 | ### 5. Conveying Modified Source Versions
191 | 
192 | You may convey a work based on the Program, or the modifications to produce it from
193 | the Program, in the form of source code under the terms of section 4, provided that
194 | you also meet all of these conditions:
195 | 
196 | * **a)** The work must carry prominent notices stating that you modified it, and giving a
197 | relevant date.
198 | * **b)** The work must carry prominent notices stating that it is released under this
199 | License and any conditions added under section 7. This requirement modifies the
200 | requirement in section 4 to “keep intact all notices”.
201 | * **c)** You must license the entire work, as a whole, under this License to anyone who
202 | comes into possession of a copy. This License will therefore apply, along with any
203 | applicable section 7 additional terms, to the whole of the work, and all its parts,
204 | regardless of how they are packaged. This License gives no permission to license the
205 | work in any other way, but it does not invalidate such permission if you have
206 | separately received it.
207 | * **d)** If the work has interactive user interfaces, each must display Appropriate Legal
208 | Notices; however, if the Program has interactive interfaces that do not display
209 | Appropriate Legal Notices, your work need not make them do so.
210 | 
211 | A compilation of a covered work with other separate and independent works, which are
212 | not by their nature extensions of the covered work, and which are not combined with
213 | it such as to form a larger program, in or on a volume of a storage or distribution
214 | medium, is called an “aggregate” if the compilation and its resulting
215 | copyright are not used to limit the access or legal rights of the compilation's users
216 | beyond what the individual works permit. Inclusion of a covered work in an aggregate
217 | does not cause this License to apply to the other parts of the aggregate.
218 | 
219 | ### 6. Conveying Non-Source Forms
220 | 
221 | You may convey a covered work in object code form under the terms of sections 4 and
222 | 5, provided that you also convey the machine-readable Corresponding Source under the
223 | terms of this License, in one of these ways:
224 | 
225 | * **a)** Convey the object code in, or embodied in, a physical product (including a
226 | physical distribution medium), accompanied by the Corresponding Source fixed on a
227 | durable physical medium customarily used for software interchange.
228 | * **b)** Convey the object code in, or embodied in, a physical product (including a
229 | physical distribution medium), accompanied by a written offer, valid for at least
230 | three years and valid for as long as you offer spare parts or customer support for
231 | that product model, to give anyone who possesses the object code either **(1)** a copy of
232 | the Corresponding Source for all the software in the product that is covered by this
233 | License, on a durable physical medium customarily used for software interchange, for
234 | a price no more than your reasonable cost of physically performing this conveying of
235 | source, or **(2)** access to copy the Corresponding Source from a network server at no
236 | charge.
237 | * **c)** Convey individual copies of the object code with a copy of the written offer to
238 | provide the Corresponding Source. This alternative is allowed only occasionally and
239 | noncommercially, and only if you received the object code with such an offer, in
240 | accord with subsection 6b.
241 | * **d)** Convey the object code by offering access from a designated place (gratis or for
242 | a charge), and offer equivalent access to the Corresponding Source in the same way
243 | through the same place at no further charge. You need not require recipients to copy
244 | the Corresponding Source along with the object code. If the place to copy the object
245 | code is a network server, the Corresponding Source may be on a different server
246 | (operated by you or a third party) that supports equivalent copying facilities,
247 | provided you maintain clear directions next to the object code saying where to find
248 | the Corresponding Source. Regardless of what server hosts the Corresponding Source,
249 | you remain obligated to ensure that it is available for as long as needed to satisfy
250 | these requirements.
251 | * **e)** Convey the object code using peer-to-peer transmission, provided you inform
252 | other peers where the object code and Corresponding Source of the work are being
253 | offered to the general public at no charge under subsection 6d.
254 | 
255 | A separable portion of the object code, whose source code is excluded from the
256 | Corresponding Source as a System Library, need not be included in conveying the
257 | object code work.
258 | 
259 | A “User Product” is either **(1)** a “consumer product”, which
260 | means any tangible personal property which is normally used for personal, family, or
261 | household purposes, or **(2)** anything designed or sold for incorporation into a
262 | dwelling. In determining whether a product is a consumer product, doubtful cases
263 | shall be resolved in favor of coverage. For a particular product received by a
264 | particular user, “normally used” refers to a typical or common use of
265 | that class of product, regardless of the status of the particular user or of the way
266 | in which the particular user actually uses, or expects or is expected to use, the
267 | product. A product is a consumer product regardless of whether the product has
268 | substantial commercial, industrial or non-consumer uses, unless such uses represent
269 | the only significant mode of use of the product.
270 | 
271 | “Installation Information” for a User Product means any methods,
272 | procedures, authorization keys, or other information required to install and execute
273 | modified versions of a covered work in that User Product from a modified version of
274 | its Corresponding Source. The information must suffice to ensure that the continued
275 | functioning of the modified object code is in no case prevented or interfered with
276 | solely because modification has been made.
277 | 
278 | If you convey an object code work under this section in, or with, or specifically for
279 | use in, a User Product, and the conveying occurs as part of a transaction in which
280 | the right of possession and use of the User Product is transferred to the recipient
281 | in perpetuity or for a fixed term (regardless of how the transaction is
282 | characterized), the Corresponding Source conveyed under this section must be
283 | accompanied by the Installation Information. But this requirement does not apply if
284 | neither you nor any third party retains the ability to install modified object code
285 | on the User Product (for example, the work has been installed in ROM).
286 | 
287 | The requirement to provide Installation Information does not include a requirement to
288 | continue to provide support service, warranty, or updates for a work that has been
289 | modified or installed by the recipient, or for the User Product in which it has been
290 | modified or installed. Access to a network may be denied when the modification itself
291 | materially and adversely affects the operation of the network or violates the rules
292 | and protocols for communication across the network.
293 | 
294 | Corresponding Source conveyed, and Installation Information provided, in accord with
295 | this section must be in a format that is publicly documented (and with an
296 | implementation available to the public in source code form), and must require no
297 | special password or key for unpacking, reading or copying.
298 | 
299 | ### 7. Additional Terms
300 | 
301 | “Additional permissions” are terms that supplement the terms of this
302 | License by making exceptions from one or more of its conditions. Additional
303 | permissions that are applicable to the entire Program shall be treated as though they
304 | were included in this License, to the extent that they are valid under applicable
305 | law. If additional permissions apply only to part of the Program, that part may be
306 | used separately under those permissions, but the entire Program remains governed by
307 | this License without regard to the additional permissions.
308 | 
309 | When you convey a copy of a covered work, you may at your option remove any
310 | additional permissions from that copy, or from any part of it. (Additional
311 | permissions may be written to require their own removal in certain cases when you
312 | modify the work.) You may place additional permissions on material, added by you to a
313 | covered work, for which you have or can give appropriate copyright permission.
314 | 
315 | Notwithstanding any other provision of this License, for material you add to a
316 | covered work, you may (if authorized by the copyright holders of that material)
317 | supplement the terms of this License with terms:
318 | 
319 | * **a)** Disclaiming warranty or limiting liability differently from the terms of
320 | sections 15 and 16 of this License; or
321 | * **b)** Requiring preservation of specified reasonable legal notices or author
322 | attributions in that material or in the Appropriate Legal Notices displayed by works
323 | containing it; or
324 | * **c)** Prohibiting misrepresentation of the origin of that material, or requiring that
325 | modified versions of such material be marked in reasonable ways as different from the
326 | original version; or
327 | * **d)** Limiting the use for publicity purposes of names of licensors or authors of the
328 | material; or
329 | * **e)** Declining to grant rights under trademark law for use of some trade names,
330 | trademarks, or service marks; or
331 | * **f)** Requiring indemnification of licensors and authors of that material by anyone
332 | who conveys the material (or modified versions of it) with contractual assumptions of
333 | liability to the recipient, for any liability that these contractual assumptions
334 | directly impose on those licensors and authors.
335 | 
336 | All other non-permissive additional terms are considered “further
337 | restrictions” within the meaning of section 10. If the Program as you received
338 | it, or any part of it, contains a notice stating that it is governed by this License
339 | along with a term that is a further restriction, you may remove that term. If a
340 | license document contains a further restriction but permits relicensing or conveying
341 | under this License, you may add to a covered work material governed by the terms of
342 | that license document, provided that the further restriction does not survive such
343 | relicensing or conveying.
344 | 
345 | If you add terms to a covered work in accord with this section, you must place, in
346 | the relevant source files, a statement of the additional terms that apply to those
347 | files, or a notice indicating where to find the applicable terms.
348 | 
349 | Additional terms, permissive or non-permissive, may be stated in the form of a
350 | separately written license, or stated as exceptions; the above requirements apply
351 | either way.
352 | 
353 | ### 8. Termination
354 | 
355 | You may not propagate or modify a covered work except as expressly provided under
356 | this License. Any attempt otherwise to propagate or modify it is void, and will
357 | automatically terminate your rights under this License (including any patent licenses
358 | granted under the third paragraph of section 11).
359 | 
360 | However, if you cease all violation of this License, then your license from a
361 | particular copyright holder is reinstated **(a)** provisionally, unless and until the
362 | copyright holder explicitly and finally terminates your license, and **(b)** permanently,
363 | if the copyright holder fails to notify you of the violation by some reasonable means
364 | prior to 60 days after the cessation.
365 | 
366 | Moreover, your license from a particular copyright holder is reinstated permanently
367 | if the copyright holder notifies you of the violation by some reasonable means, this
368 | is the first time you have received notice of violation of this License (for any
369 | work) from that copyright holder, and you cure the violation prior to 30 days after
370 | your receipt of the notice.
371 | 
372 | Termination of your rights under this section does not terminate the licenses of
373 | parties who have received copies or rights from you under this License. If your
374 | rights have been terminated and not permanently reinstated, you do not qualify to
375 | receive new licenses for the same material under section 10.
376 | 
377 | ### 9. Acceptance Not Required for Having Copies
378 | 
379 | You are not required to accept this License in order to receive or run a copy of the
380 | Program. Ancillary propagation of a covered work occurring solely as a consequence of
381 | using peer-to-peer transmission to receive a copy likewise does not require
382 | acceptance. However, nothing other than this License grants you permission to
383 | propagate or modify any covered work. These actions infringe copyright if you do not
384 | accept this License. Therefore, by modifying or propagating a covered work, you
385 | indicate your acceptance of this License to do so.
386 | 
387 | ### 10. Automatic Licensing of Downstream Recipients
388 | 
389 | Each time you convey a covered work, the recipient automatically receives a license
390 | from the original licensors, to run, modify and propagate that work, subject to this
391 | License. You are not responsible for enforcing compliance by third parties with this
392 | License.
393 | 
394 | An “entity transaction” is a transaction transferring control of an
395 | organization, or substantially all assets of one, or subdividing an organization, or
396 | merging organizations. If propagation of a covered work results from an entity
397 | transaction, each party to that transaction who receives a copy of the work also
398 | receives whatever licenses to the work the party's predecessor in interest had or
399 | could give under the previous paragraph, plus a right to possession of the
400 | Corresponding Source of the work from the predecessor in interest, if the predecessor
401 | has it or can get it with reasonable efforts.
402 | 
403 | You may not impose any further restrictions on the exercise of the rights granted or
404 | affirmed under this License. For example, you may not impose a license fee, royalty,
405 | or other charge for exercise of rights granted under this License, and you may not
406 | initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging
407 | that any patent claim is infringed by making, using, selling, offering for sale, or
408 | importing the Program or any portion of it.
409 | 
410 | ### 11. Patents
411 | 
412 | A “contributor” is a copyright holder who authorizes use under this
413 | License of the Program or a work on which the Program is based. The work thus
414 | licensed is called the contributor's “contributor version”.
415 | 
416 | A contributor's “essential patent claims” are all patent claims owned or
417 | controlled by the contributor, whether already acquired or hereafter acquired, that
418 | would be infringed by some manner, permitted by this License, of making, using, or
419 | selling its contributor version, but do not include claims that would be infringed
420 | only as a consequence of further modification of the contributor version. For
421 | purposes of this definition, “control” includes the right to grant patent
422 | sublicenses in a manner consistent with the requirements of this License.
423 | 
424 | Each contributor grants you a non-exclusive, worldwide, royalty-free patent license
425 | under the contributor's essential patent claims, to make, use, sell, offer for sale,
426 | import and otherwise run, modify and propagate the contents of its contributor
427 | version.
428 | 
429 | In the following three paragraphs, a “patent license” is any express
430 | agreement or commitment, however denominated, not to enforce a patent (such as an
431 | express permission to practice a patent or covenant not to sue for patent
432 | infringement). To “grant” such a patent license to a party means to make
433 | such an agreement or commitment not to enforce a patent against the party.
434 | 
435 | If you convey a covered work, knowingly relying on a patent license, and the
436 | Corresponding Source of the work is not available for anyone to copy, free of charge
437 | and under the terms of this License, through a publicly available network server or
438 | other readily accessible means, then you must either **(1)** cause the Corresponding
439 | Source to be so available, or **(2)** arrange to deprive yourself of the benefit of the
440 | patent license for this particular work, or **(3)** arrange, in a manner consistent with
441 | the requirements of this License, to extend the patent license to downstream
442 | recipients. “Knowingly relying” means you have actual knowledge that, but
443 | for the patent license, your conveying the covered work in a country, or your
444 | recipient's use of the covered work in a country, would infringe one or more
445 | identifiable patents in that country that you have reason to believe are valid.
446 | 
447 | If, pursuant to or in connection with a single transaction or arrangement, you
448 | convey, or propagate by procuring conveyance of, a covered work, and grant a patent
449 | license to some of the parties receiving the covered work authorizing them to use,
450 | propagate, modify or convey a specific copy of the covered work, then the patent
451 | license you grant is automatically extended to all recipients of the covered work and
452 | works based on it.
453 | 
454 | A patent license is “discriminatory” if it does not include within the
455 | scope of its coverage, prohibits the exercise of, or is conditioned on the
456 | non-exercise of one or more of the rights that are specifically granted under this
457 | License. You may not convey a covered work if you are a party to an arrangement with
458 | a third party that is in the business of distributing software, under which you make
459 | payment to the third party based on the extent of your activity of conveying the
460 | work, and under which the third party grants, to any of the parties who would receive
461 | the covered work from you, a discriminatory patent license **(a)** in connection with
462 | copies of the covered work conveyed by you (or copies made from those copies), or **(b)**
463 | primarily for and in connection with specific products or compilations that contain
464 | the covered work, unless you entered into that arrangement, or that patent license
465 | was granted, prior to 28 March 2007.
466 | 
467 | Nothing in this License shall be construed as excluding or limiting any implied
468 | license or other defenses to infringement that may otherwise be available to you
469 | under applicable patent law.
470 | 
471 | ### 12. No Surrender of Others' Freedom
472 | 
473 | If conditions are imposed on you (whether by court order, agreement or otherwise)
474 | that contradict the conditions of this License, they do not excuse you from the
475 | conditions of this License. If you cannot convey a covered work so as to satisfy
476 | simultaneously your obligations under this License and any other pertinent
477 | obligations, then as a consequence you may not convey it at all. For example, if you
478 | agree to terms that obligate you to collect a royalty for further conveying from
479 | those to whom you convey the Program, the only way you could satisfy both those terms
480 | and this License would be to refrain entirely from conveying the Program.
481 | 
482 | ### 13. Use with the GNU Affero General Public License
483 | 
484 | Notwithstanding any other provision of this License, you have permission to link or
485 | combine any covered work with a work licensed under version 3 of the GNU Affero
486 | General Public License into a single combined work, and to convey the resulting work.
487 | The terms of this License will continue to apply to the part which is the covered
488 | work, but the special requirements of the GNU Affero General Public License, section
489 | 13, concerning interaction through a network will apply to the combination as such.
490 | 
491 | ### 14. Revised Versions of this License
492 | 
493 | The Free Software Foundation may publish revised and/or new versions of the GNU
494 | General Public License from time to time. Such new versions will be similar in spirit
495 | to the present version, but may differ in detail to address new problems or concerns.
496 | 
497 | Each version is given a distinguishing version number. If the Program specifies that
498 | a certain numbered version of the GNU General Public License “or any later
499 | version” applies to it, you have the option of following the terms and
500 | conditions either of that numbered version or of any later version published by the
501 | Free Software Foundation. If the Program does not specify a version number of the GNU
502 | General Public License, you may choose any version ever published by the Free
503 | Software Foundation.
504 | 
505 | If the Program specifies that a proxy can decide which future versions of the GNU
506 | General Public License can be used, that proxy's public statement of acceptance of a
507 | version permanently authorizes you to choose that version for the Program.
508 | 
509 | Later license versions may give you additional or different permissions. However, no
510 | additional obligations are imposed on any author or copyright holder as a result of
511 | your choosing to follow a later version.
512 | 
513 | ### 15. Disclaimer of Warranty
514 | 
515 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.
516 | EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
517 | PROVIDE THE PROGRAM “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER
518 | EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
519 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE
520 | QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE
521 | DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
522 | 
523 | ### 16. Limitation of Liability
524 | 
525 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY
526 | COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS
527 | PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL,
528 | INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE
529 | PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE
530 | OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE
531 | WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
532 | POSSIBILITY OF SUCH DAMAGES.
533 | 
534 | ### 17. Interpretation of Sections 15 and 16
535 | 
536 | If the disclaimer of warranty and limitation of liability provided above cannot be
537 | given local legal effect according to their terms, reviewing courts shall apply local
538 | law that most closely approximates an absolute waiver of all civil liability in
539 | connection with the Program, unless a warranty or assumption of liability accompanies
540 | a copy of the Program in return for a fee.
541 | 
542 | _END OF TERMS AND CONDITIONS_
543 | 
544 | ## How to Apply These Terms to Your New Programs
545 | 
546 | If you develop a new program, and you want it to be of the greatest possible use to
547 | the public, the best way to achieve this is to make it free software which everyone
548 | can redistribute and change under these terms.
549 | 
550 | To do so, attach the following notices to the program. It is safest to attach them
551 | to the start of each source file to most effectively state the exclusion of warranty;
552 | and each file should have at least the “copyright” line and a pointer to
553 | where the full notice is found.
554 | 
555 |     <one line to give the program's name and a brief idea of what it does.>
556 |     Copyright (C) 2020 Belinda Phipson
557 | 
558 |     This program is free software: you can redistribute it and/or modify
559 |     it under the terms of the GNU General Public License as published by
560 |     the Free Software Foundation, either version 3 of the License, or
561 |     (at your option) any later version.
562 | 
563 |     This program is distributed in the hope that it will be useful,
564 |     but WITHOUT ANY WARRANTY; without even the implied warranty of
565 |     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
566 |     GNU General Public License for more details.
567 | 
568 |     You should have received a copy of the GNU General Public License
569 |     along with this program.  If not, see <http://www.gnu.org/licenses/>.
570 | 
571 | Also add information on how to contact you by electronic and paper mail.
572 | 
573 | If the program does terminal interaction, make it output a short notice like this
574 | when it starts in an interactive mode:
575 | 
576 |     speckle Copyright (C) 2020 Belinda Phipson
577 |     This program comes with ABSOLUTELY NO WARRANTY; for details type 'show w'.
578 |     This is free software, and you are welcome to redistribute it
579 |     under certain conditions; type 'show c' for details.
580 | 
581 | The hypothetical commands `show w` and `show c` should show the appropriate parts of
582 | the General Public License. Of course, your program's commands might be different;
583 | for a GUI interface, you would use an “about box”.
584 | 
585 | You should also get your employer (if you work as a programmer) or school, if any, to
586 | sign a “copyright disclaimer” for the program, if necessary. For more
587 | information on this, and how to apply and follow the GNU GPL, see
588 | &lt;<http://www.gnu.org/licenses/>&gt;.
589 | 
590 | The GNU General Public License does not permit incorporating your program into
591 | proprietary programs. If your program is a subroutine library, you may consider it
592 | more useful to permit linking proprietary applications with the library. If this is
593 | what you want to do, use the GNU Lesser General Public License instead of this
594 | License. But first, please read
595 | &lt;<http://www.gnu.org/philosophy/why-not-lgpl.html>&gt;.
596 | 


--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
 1 | # Generated by roxygen2: do not edit by hand
 2 | 
 3 | export(classifySex)
 4 | export(estimateBetaParam)
 5 | export(estimateBetaParamsFromCounts)
 6 | export(getTransformedProps)
 7 | export(ggplotColors)
 8 | export(normCounts)
 9 | export(normSca)
10 | export(plotCellTypeMeanVar)
11 | export(plotCellTypeProps)
12 | export(plotCellTypePropsMeanVar)
13 | export(preprocess)
14 | export(propeller)
15 | export(propeller.anova)
16 | export(propeller.ttest)
17 | export(speckle_example_data)
18 | importFrom(AnnotationDbi,select)
19 | importFrom(Seurat,Idents)
20 | importFrom(SingleCellExperiment,colData)
21 | importFrom(edgeR,DGEList)
22 | importFrom(edgeR,estimateDisp)
23 | importFrom(ggplot2,aes)
24 | importFrom(ggplot2,element_text)
25 | importFrom(ggplot2,geom_bar)
26 | importFrom(ggplot2,ggplot)
27 | importFrom(ggplot2,theme)
28 | importFrom(grDevices,hcl)
29 | importFrom(graphics,legend)
30 | importFrom(graphics,lines)
31 | importFrom(graphics,par)
32 | importFrom(graphics,title)
33 | importFrom(limma,contrasts.fit)
34 | importFrom(limma,eBayes)
35 | importFrom(limma,lmFit)
36 | importFrom(methods,is)
37 | importFrom(org.Hs.eg.db,org.Hs.eg.db)
38 | importFrom(org.Mm.eg.db,org.Mm.eg.db)
39 | importFrom(scuttle,perCellQCMetrics)
40 | importFrom(scuttle,quickPerCellQC)
41 | importFrom(stats,lowess)
42 | importFrom(stats,median)
43 | importFrom(stats,p.adjust)
44 | importFrom(stats,predict)
45 | importFrom(stats,var)
46 | importFrom(stringr,str_to_title)
47 | 


--------------------------------------------------------------------------------
/NEWS.md:
--------------------------------------------------------------------------------
 1 | # speckle 0.0.3
 2 | * Added functions to classify cells as male or female
 3 | * Change propeller transform default to logit
 4 | 
 5 | # speckle 0.0.2
 6 | * Added functions to plot the mean variance relationship of the cell type
 7 | counts and proportions
 8 | * Added functions to estimate parameters of a Beta distribution 
 9 | * Added logit transformation option to propeller
10 | 
11 | # speckle 0.0.1
12 | 
13 | * First version of the speckle package contains propeller functions to test for
14 | differences in cell type composition between groups of samples in single cell
15 | RNA-Seq data
16 | 


--------------------------------------------------------------------------------
/R/classifySex.R:
--------------------------------------------------------------------------------
  1 | #' Predict sex of cells in scRNA-seq data
  2 | #' 
  3 | #' This function will predict the sex for each cell in scRNA-seq data. The 
  4 | #' classifier is based on logistic regression models that have been trained
  5 | #' on mouse and human single cell RNA-seq data.
  6 | #'
  7 | #' For bulk RNA-seq, checking the sex of the samples for mouse and human 
  8 | #' experiments is trivial as we can simply check the expression of *Xist/XIST*. 
  9 | #' It is not as simple for single cell RNA-seq data as the number of counts 
 10 | #' measured per gene and per cell is often quite low. Simply relying on cut-offs
 11 | #' on the expression of genes like *Xist* means that many cells are unable to 
 12 | #' be classified. Hence we have developed a classifier based on a combination of
 13 | #' X- and Y-linked genes in order to accurately predict the sex of each cell.
 14 | #' 
 15 | #' Cells with zero counts on Xist and the sum of the Y chromosome genes will 
 16 | #' not be classified as there is simply not enough information to accurately 
 17 | #' classify as Male/Female, and NAs will be returned. In addition, the user has 
 18 | #' the option to perform quality control on the data first, by specifying 
 19 | #' \code{qc=TRUE}, which will not classify cells that are deemed low-quality.
 20 | #' 
 21 | #' @aliases classifySex
 22 | #' @param x counts matrix, rows correspond to genes and columns correspond to 
 23 | #' cells. Row names must be gene symbols. 
 24 | #' @param genome the genome the data arises from. Current options are 
 25 | #' human: genome = "Hs" or mouse: genome = "Mm".
 26 | #' @param qc logical, indicates whether to perform quality control or not. 
 27 | #' qc = TRUE will predict cells that pass quality control only and the filtered 
 28 | #' cells will not be classified. qc = FALSE will predict every cell except the 
 29 | #' cells with zero counts on *XIST/Xist* and the sum of the Y genes. Default is TRUE.
 30 | #' 
 31 | #' @return a dataframe with predicted labels for each cell
 32 | #' 
 33 | #' @importFrom stats predict  
 34 | #' @export classifySex
 35 | #' 
 36 | #' @author Xinyi Jin
 37 | #' 
 38 | #' @examples 
 39 | #' 
 40 | #' library(speckle)
 41 | #' library(SingleCellExperiment)
 42 | #' library(CellBench)
 43 | #' library(org.Hs.eg.db)
 44 | #'
 45 | #' sc_data <- load_sc_data()
 46 | #' sc_10x <- sc_data$sc_10x
 47 | #'
 48 | #' counts <- counts(sc_10x)
 49 | #' ann <- select(org.Hs.eg.db, keys=rownames(sc_10x),
 50 | #'              columns=c("ENSEMBL","SYMBOL"), keytype="ENSEMBL")
 51 | #' m <- match(rownames(counts), ann$ENSEMBL)
 52 | #' rownames(counts) <- ann$SYMBOL[m]
 53 | #'
 54 | #' sex <- classifySex(counts, genome="Hs")
 55 | #' 
 56 | #' table(sex$prediction)
 57 | #' boxplot(counts["XIST",]~sex$prediction)
 58 | #'    
 59 | classifySex<-function(x, genome=NULL, qc = TRUE)
 60 | #    Classify cells as male or female
 61 | #    Xinyi Jin and Belinda Phipson
 62 | #    11 February 2021
 63 | #    Modified 11 February 2021
 64 | {
 65 |     # Perform some checks on the data
 66 |     if(is.null(x)) stop("Counts matrix missing")
 67 |     x <- as.matrix(x)
 68 |     if(is.null(genome)){
 69 |         message("Genome not specified. Human genome used. Options are 'Hs' for 
 70 |         human and 'Mm' for mouse. We currently don't support other genomes.")
 71 |     }
 72 |     # Default is Hs
 73 |     genome <- match.arg(genome,c("Hs","Mm"))
 74 |     
 75 |     # pre-process 
 76 |     processed.data<-preprocess(x, genome = genome, qc = qc)
 77 |   
 78 |     # the processed transposed count matrix 
 79 |     tcm <-processed.data$tcm.final
 80 |   
 81 |     # the normalised, scaled transposed count matrix 
 82 |     data.df <- processed.data$data.df
 83 |   
 84 |     # cells that filtered by QC
 85 |     discarded.cells <- processed.data$discarded.cells
 86 |   
 87 |     # cells with zero count on XIST and superY.all
 88 |     zero.cells <- processed.data$zero.cells
 89 |   
 90 |     # store the final predictions 
 91 |     final.pred<-data.frame(prediction=rep("NA", ncol(x)))
 92 |     row.names(final.pred)<- colnames(x)
 93 |   
 94 |     # load trained models 
 95 |     if(genome == "Mm"){
 96 |       model <- Mm.model
 97 |     }
 98 |     else{
 99 |       model <- Hs.model
100 |     }
101 | 
102 |     preds <- predict(model, newdata = data.df)
103 |     final.pred[row.names(data.df), "prediction"]<- as.character(preds)
104 |   
105 |     final.pred
106 | }


--------------------------------------------------------------------------------
/R/estimateBetaParam.R:
--------------------------------------------------------------------------------
 1 | #' Estimate the parameters of a Beta distribution
 2 | #'
 3 | #' This function estimates the two parameters of the Beta distribution, alpha
 4 | #' and beta, given a vector of proportions. It uses the method of moments to
 5 | #' do this.
 6 | #'
 7 | #' @param x a vector of proportions.
 8 | #'
 9 | #' @return a list object with the estimate of alpha in \code{a} and beta in
10 | #' \code{b}.
11 | #'
12 | #' @export estimateBetaParam
13 | #' @importFrom stats var
14 | #'
15 | #' @author Belinda Phipson
16 | #'
17 | #' @examples
18 | #' # Generate proportions from a beta distribution
19 | #' props <- rbeta(1000, shape1=2, shape2=10)
20 | #' estimateBetaParam(props)
21 | #'
22 | estimateBetaParam <- function(x){
23 |   # solve for the hyperparameters of the beta distribution given a vector
24 |   # of proportions
25 |   mu <- mean(x)
26 |   V <- var(x)
27 |   a =((1-mu)/V - 1/mu)*mu^2
28 |   b = ((1-mu)/V - 1/mu)*mu*(1-mu)
29 |   list(a=a,b=b)
30 | }
31 | 


--------------------------------------------------------------------------------
/R/estimateBetaParamsFromCounts.R:
--------------------------------------------------------------------------------
 1 | #' Estimate parameters of a Beta distribution from counts
 2 | #' 
 3 | #' This function estimates the two parameters of the Beta distribution, alpha
 4 | #' and beta for each cell type. The input is a matrix of cell type counts, 
 5 | #' where the rows are the cell types/clusters and the columns are the samples.
 6 | #' 
 7 | #' This function is called from the plotting function \code{plotCellTypeMeanVar}
 8 | #' in order to estimate the variance for the Beta-Binomial distribution for 
 9 | #' each cell type.
10 | #' 
11 | #' @param x a matrix of counts
12 | #'
13 | #' @return outputs a list object with the following components
14 | #' \item{n }{Normalised library size}
15 | #' \item{alpha }{a vector of alpha parameters for the Beta distribution for 
16 | #' each cell type}
17 | #' \item{beta }{vector of beta parameters for the Beta distribution for 
18 | #' each cell type}
19 | #' \item{pi }{Estimate of the true proportion for each cell type}
20 | #' \item{dispersion }{Dispersion estimates for each cell type}
21 | #' \item{var }{Variance estimates for each cell type}
22 | #' 
23 | #' @export
24 | #' 
25 | #' @author Belinda Phipson
26 | #'
27 | #' @examples
28 | #' data <- speckle_example_data()
29 | #' x <- table(data$clusters, data$samples)
30 | #' estimateBetaParamsFromCounts(x)
31 | #' 
32 | #' 
33 | estimateBetaParamsFromCounts <- function(x){
34 |   # Make sure input is a matrix
35 |   counts <- as.matrix(x)
36 |   # Normalise the counts so that the total number of counts per sample is equal
37 |   nc <- normCounts(counts)
38 |   # Get cell type means
39 |   m1 <- rowMeans(nc)
40 |   # Get variance estimate for each cell type
41 |   m2 <- rowSums(nc^2)/ncol(nc)
42 |   n <- mean(colSums(nc))
43 |   alpha <- (n*m1-m2)/(n*(m2/m1-m1-1)+m1)
44 |   beta <- ((n-m1)*(n-m2/m1))/(n*(m2/m1-m1-1)+m1)
45 |   disp <- 1/(alpha+beta)
46 |   pi <- alpha/(alpha+beta)
47 |   var <- n*pi*(1-pi)*(n*disp+1)/(1+disp)
48 |   output <- list(n=n, alpha=alpha, beta=beta, pi=pi, dispersion=disp, var=var)
49 |   output
50 | }
51 | 


--------------------------------------------------------------------------------
/R/getTransformedProps.R:
--------------------------------------------------------------------------------
 1 | #' Calculates and transforms cell type proportions
 2 | #'
 3 | #' Calculates cell types proportions based on clusters/cell types and sample
 4 | #' information and performs a variance stabilising transformation on the
 5 | #' proportions.
 6 | #'
 7 | #' This function is called by the \code{propeller} function and calculates cell
 8 | #' type proportions and performs an arcsin-square root transformation.
 9 | #'
10 | #' @param clusters a factor specifying the cluster or cell type for every cell.
11 | #' @param sample a factor specifying the biological replicate for every cell.
12 | #' @param transform a character scalar specifying which transformation of the 
13 | #' proportions to perform. Possible values include "asin" or "logit". Defaults
14 | #' to "asin".
15 | #' 
16 | #' @return outputs a list object with the following components
17 | #' \item{Counts }{A matrix of cell type counts with
18 | #' the rows corresponding to the clusters/cell types and the columns
19 | #' corresponding to the biological replicates/samples.}
20 | #' \item{TransformedProps }{A matrix of transformed cell type proportions with
21 | #' the rows corresponding to the clusters/cell types and the columns
22 | #' corresponding to the biological replicates/samples.} 
23 | #' \item{Proportions }{A  matrix of cell type proportions with the rows 
24 | #' corresponding to the clusters/cell types and the columns corresponding to 
25 | #' the biological replicates/samples.}
26 | #'
27 | #' @export
28 | #'
29 | #' @author Belinda Phipson
30 | #'
31 | #' @seealso \code{\link{propeller}}
32 | #'
33 | #' @examples
34 | #'
35 | #'   library(speckle)
36 | #'   library(ggplot2)
37 | #'   library(limma)
38 | #'
39 | #'   # Make up some data
40 | #'
41 | #'   # True cell type proportions for 4 samples
42 | #'   p_s1 <- c(0.5,0.3,0.2)
43 | #'   p_s2 <- c(0.6,0.3,0.1)
44 | #'   p_s3 <- c(0.3,0.4,0.3)
45 | #'   p_s4 <- c(0.4,0.3,0.3)
46 | #'
47 | #'   # Total numbers of cells per sample
48 | #'   numcells <- c(1000,1500,900,1200)
49 | #'
50 | #'   # Generate cell-level vector for sample info
51 | #'   biorep <- rep(c("s1","s2","s3","s4"),numcells)
52 | #'   length(biorep)
53 | #'
54 | #'   # Numbers of cells for each of 3 clusters per sample
55 | #'   n_s1 <- p_s1*numcells[1]
56 | #'   n_s2 <- p_s2*numcells[2]
57 | #'   n_s3 <- p_s3*numcells[3]
58 | #'   n_s4 <- p_s4*numcells[4]
59 | #'
60 | #'   cl_s1 <- rep(c("c0","c1","c2"),n_s1)
61 | #'   cl_s2 <- rep(c("c0","c1","c2"),n_s2)
62 | #'   cl_s3 <- rep(c("c0","c1","c2"),n_s3)
63 | #'   cl_s4 <- rep(c("c0","c1","c2"),n_s4)
64 | #'
65 | #'   # Generate cell-level vector for cluster info
66 | #'   clust <- c(cl_s1,cl_s2,cl_s3,cl_s4)
67 | #'   length(clust)
68 | #'
69 | #'   getTransformedProps(clusters = clust, sample = biorep)
70 | #'
71 | getTransformedProps <- function(clusters=clusters, sample=sample, 
72 |                                 transform=NULL)
73 | {
74 |   if(is.null(transform)) transform <- "logit"
75 |   
76 |   tab <- table(sample, clusters)
77 |   props <- tab/rowSums(tab)
78 |   if(transform=="asin"){
79 |       message("Performing arcsin square root transformation of proportions")
80 |       prop.trans <- asin(sqrt(props))
81 |   }
82 |   else if(transform=="logit"){
83 |       message("Performing logit transformation of proportions")
84 |       props.pseudo <- (tab+0.5)/rowSums(tab+0.5)
85 |       prop.trans <- log(props.pseudo/(1-props.pseudo))
86 |   }
87 |   list(Counts=t(tab), TransformedProps=t(prop.trans), Proportions=t(props))
88 | }
89 | 


--------------------------------------------------------------------------------
/R/ggplotColors.R:
--------------------------------------------------------------------------------
 1 | #' Output a vector of colours based on the ggplot colour scheme
 2 | #'
 3 | #' This function takes as input the number of colours the user would like, and
 4 | #' outputs a vector of colours in the ggplot colour scheme.
 5 | #'
 6 | #' @param g the number of colours to be generated.
 7 | #'
 8 | #' @return a vector with the names of the colours.
 9 | #' @export ggplotColors
10 | #' 
11 | #' @importFrom grDevices hcl
12 | #'
13 | #' @author Belinda Phipson
14 | #'
15 | #' @examples
16 | #' # Generate a palette of 6 colours
17 | #' cols <- ggplotColors(6)
18 | #' cols
19 | #'
20 | #' # Generate some count data
21 | #' y <- matrix(rnbinom(600, mu=100, size=1), ncol=6)
22 | #'
23 | #' par(mfrow=c(1,1))
24 | #' boxplot(y, col=cols)
25 | #'
26 | ggplotColors <- function(g){
27 | 
28 |   d <- 360/g
29 | 
30 |   h <- cumsum(c(15, rep(d,g - 1)))
31 | 
32 |   hcl(h = h, c = 100, l = 65)
33 | 
34 | }
35 | 


--------------------------------------------------------------------------------
/R/normCounts.R:
--------------------------------------------------------------------------------
 1 | #' Normalise a counts matrix to the median library size
 2 | #'
 3 | #' This function takes a \code{DGEList} object or matrix of counts and
 4 | #' normalises the counts to the median library size. This puts the normalised
 5 | #' counts on a similar scale to the original counts.
 6 | #'
 7 | #' If the input is a DGEList object, the normalisation factors in
 8 | #' \code{norm.factors} are taken into account in the normalisation. The prior
 9 | #' counts are added proportionally to the library size
10 | #'
11 | #' @param x a \code{DGEList} object or matrix of counts.
12 | #' @param log logical, indicates whether the output should be on the log2 scale
13 | #' or counts scale. Default is FALSE.
14 | #' @param prior.count The prior count to add if the data is log2 normalised.
15 | #' Default is a small count of 0.5.
16 | #' @param lib.size a vector of library sizes to be used during the normalisation
17 | #' step. Default is NULL and will be computed from the counts matrix.
18 | #'
19 | #' @return a matrix of normalised counts
20 | #' 
21 | #' @export normCounts
22 | #' @importFrom stats median
23 | #' @author Belinda Phipson
24 | #'
25 | #' @examples
26 | #' # Simulate some data from a negative binomial distribution with mean equal
27 | #' # to 100 and dispersion set to 1. Simulate 1000 genes and 6 samples.
28 | #' y <- matrix(rnbinom(6000, mu = 100, size = 1), ncol = 6)
29 | #'
30 | #' # Normalise the counts
31 | #' norm.y <- normCounts(y)
32 | #'
33 | #' # Return log2 normalised counts
34 | #' lnorm.y <- normCounts(y, log=TRUE)
35 | #'
36 | #' # Return log2 normalised counts with prior.count = 2
37 | #' lnorm.y2 <- normCounts(y, log=TRUE, prior.count=2)
38 | #'
39 | #' par(mfrow=c(1,2))
40 | #' boxplot(norm.y, main="Normalised counts")
41 | #' boxplot(lnorm.y, main="Log2-normalised counts")
42 | #'
43 | normCounts <-function(x, log=FALSE, prior.count=0.5, lib.size=NULL)
44 |   # Function to normalise to median library size instead of counts per million
45 |   # Input is DGEList object or matrix
46 |   # Belinda Phipson
47 |   # 30 November 2015
48 | {
49 |   if(any(class(x)=="DGEList")){
50 |     lib.size <- x$samples$lib.size*x$samples$norm.factors
51 |     counts <- x$counts
52 |   }
53 |   else{
54 |     counts <- as.matrix(x)
55 |     if(is.null(lib.size)){
56 |       lib.size <- colSums(counts)
57 |     }
58 |     else{
59 |       if(length(lib.size)==ncol(x))
60 |         lib.size <- as.vector(lib.size)
61 |       else{
62 |         message("Vector of library sizes does not match dimensions of input 
63 |                 data. Calculating library sizes from the counts matrix.")
64 |         lib.size <- colSums(counts)
65 |       }
66 |     }
67 |       
68 |   }
69 | 
70 |   M <- median(lib.size)
71 |   if(log){
72 |     prior.count.scaled <- lib.size/mean(lib.size)*prior.count
73 |     lib.size <- lib.size + 2*prior.count.scaled
74 |     log2(t((t(counts)+prior.count.scaled)/lib.size*M))
75 |   }
76 |   else t(t(counts)/lib.size*M)
77 | }
78 | 


--------------------------------------------------------------------------------
/R/normSca.R:
--------------------------------------------------------------------------------
 1 | #' Normalise and scale counts matrix
 2 | #' 
 3 | #' This function is called by the \code{preprocess} function and performs
 4 | #' log-normalisation and scaling of a counts matrix.
 5 | #' 
 6 | #' @param x a matrix of counts
 7 | #' @param lib.size the library size 
 8 | #' @param log logical, indicates whether the output should be on the log2 scale
 9 | #' or counts scale. Default is TRUE.
10 | #' @param prior.count The prior count to add if the data is log2 normalised.
11 | #' Default is a small count of 0.5.
12 | #' 
13 | #' @return a dataframe of log-normalised and scaled counts
14 | #' 
15 | #' @export normSca
16 | #'
17 | #' @examples
18 | #' 
19 | #' y <- matrix(rnbinom(1000, size=2, mu= 20),ncol=10)
20 | #' colnames(y)<- paste("Cell",1:10, sep="")
21 | #' row.names(y)<-paste("Gene",1:100,sep="")
22 | #' norm.data <- normSca(y,lib.size=colSums(y))
23 | #' 
24 | #' #Visualise the counts vs scaled data
25 | #' boxplot(y)
26 | #' boxplot(norm.data)
27 | #' 
28 | normSca<-function(x, lib.size=lib.size, log = TRUE, prior.count = 0.5)
29 | #    Normalise and scale counts matrix
30 | #    Xinyi Jin and Belinda Phipson
31 | #    17 February 2021
32 | #    Modified 17 February 2021    
33 | {
34 |     x <- as.matrix(x)
35 |     # log normalise
36 |     normalisedVal<- normCounts(x, log = log, prior.count = prior.count, 
37 |                                lib.size=lib.size)
38 |     # scale
39 |     #scaledVal<-t(scale(t(normalisedVal)))
40 |     #scaledVal
41 |     
42 |     normalisedVal
43 | }
44 | 
45 | 
46 | 


--------------------------------------------------------------------------------
/R/plotCellTypeMeanVar.R:
--------------------------------------------------------------------------------
 1 | #' Plot cell type counts means versus variances
 2 | #' 
 3 | #' This function returns a plot of the log10(mean) versus log10(variance) of 
 4 | #' the cell type counts. The function takes a matrix of cell type counts as 
 5 | #' input. The rows are the clusters/cell types and the columns are the samples.
 6 | #' 
 7 | #' The expected variance under a binomial distribution is shown in the solid 
 8 | #' line, and the points represent the observed variance for each cell type/row 
 9 | #' in the counts table. The expected variance under different model assumptions
10 | #' are shown in the different coloured lines.
11 | #' 
12 | #' The mean and variance for each cell type is calculated across all samples.
13 | #'
14 | #' @param x a matrix or table of counts
15 | #'
16 | #' @return a base R plot
17 | #' 
18 | #' @importFrom edgeR estimateDisp
19 | #' @importFrom edgeR DGEList
20 | #' @importFrom graphics legend lines par title
21 | #' @importFrom stats lowess
22 | #' 
23 | #' @export
24 | #' 
25 | #' @author Belinda Phipson
26 | #' 
27 | #' @examples
28 | #' library(edgeR)
29 | #' # Generate some data
30 | #' # Total number of samples
31 | #' nsamp <- 10
32 | #' # True cell type proportions
33 | #' p <- c(0.05, 0.15, 0.35, 0.45)
34 | #' 
35 | #' # Parameters for beta distribution
36 | #' a <- 40
37 | #' b <- a*(1-p)/p
38 | #' 
39 | #' # Sample total cell counts per sample from negative binomial distribution
40 | #' numcells <- rnbinom(nsamp,size=20,mu=5000)
41 | #' true.p <- matrix(c(rbeta(nsamp,a,b[1]),rbeta(nsamp,a,b[2]),
42 | #'           rbeta(nsamp,a,b[3]),rbeta(nsamp,a,b[4])),byrow=TRUE, ncol=nsamp)
43 | #' 
44 | #' counts <- matrix(NA,ncol=nsamp, nrow=nrow(true.p))
45 | #' rownames(counts) <- paste("c",0:(nrow(true.p)-1), sep="")
46 | #' for(j in 1:length(p)){
47 | #'     counts[j,] <- rbinom(nsamp, size=numcells, prob=true.p[j,])
48 | #' }
49 | #' 
50 | #' plotCellTypeMeanVar(counts)
51 | #' 
52 | plotCellTypeMeanVar <- function(x){
53 |   # Make sure input is a matrix
54 |   x <- as.matrix(x)
55 |   # Normalise the cell type counts 
56 |   nc <- normCounts(x)
57 | 
58 |   # Beta binomial variance
59 |   params <- estimateBetaParamsFromCounts(x)
60 | 
61 |   #Observed variance
62 |   means <- rowMeans(nc)
63 |   means.mat <- matrix(rep(means,ncol(nc)),ncol=ncol(nc))
64 |   vars <- rowSums((nc-means.mat)^2)/(ncol(nc)-1)
65 | 
66 |   # Binomial variance
67 |   ebv <- params$n*params$pi*(1-params$pi)
68 | 
69 |   # Negative binomial variance
70 |   y <- estimateDisp(DGEList(x))
71 |   tagvars <- means + means^2 * y$tagwise.dispersion
72 | 
73 |   ylimits.min <- min(log10(vars),log10(ebv), log10(params$var),
74 |                      log10(tagvars), log10(params$n*params$pi))
75 |   ylimits.max <- max(log10(vars),log10(ebv), log10(params$var),
76 |                      log10(tagvars), log10(params$n*params$pi))
77 | 
78 |   par(mar=c(5,5,2,2))
79 |   #par(mfrow=c(1,1))
80 |   plot(log10(means),log10(vars), pch=16, cex=1.5, cex.lab=1.5, cex.axis=1.5,
81 |        ylim=c(ylimits.min,ylimits.max),
82 |        xlab = "log10(mean)", ylab = "log10(variance)")
83 |   lines(lowess(log10(means), log10(ebv)), col=1, lwd=2)
84 |   lines(lowess(log10(means), log10(params$var)), lwd=2, col=4)
85 |   lines(lowess(log10(means),log10(tagvars)),lwd=2,col="purple", lty=2)
86 |   lines(lowess(log10(means),log10(params$n*params$pi)),lwd=2,col="red", lty=2)
87 |   legend("bottomright", legend=c("Beta-binomial","Negative binomial", "Binomial", 
88 |                              "Poisson"),
89 |          col=c(4,"purple",1,2), lty=c(1,2,1,2), lwd=2)
90 |   title("Mean-variance relationship: cell type counts", cex.main=1.5)
91 | 
92 | }
93 | 


--------------------------------------------------------------------------------
/R/plotCellTypeProps.R:
--------------------------------------------------------------------------------
 1 | #' Plot cell type proportions for each sample
 2 | #'
 3 | #' This is a plotting function that shows the cell type composition for each
 4 | #' sample as a stacked barplot. The \code{plotCellTypeProps} returns a
 5 | #' \code{ggplot2} object enabling the user to make style changes as required.
 6 | #'
 7 | #' @param x object of class \code{SingleCellExperiment} or \code{Seurat}
 8 | #' @param clusters a factor specifying the cluster or cell type for every cell.
 9 | #' For \code{SingleCellExperiment} objects this should correspond to a column
10 | #' called \code{clusters} in the \code{colData} assay. For \code{Seurat}
11 | #' objects this will be extracted by a call to \code{Idents(x)}.
12 | #' @param sample a factor specifying the biological replicate for each cell.
13 | #' For \code{SingleCellExperiment} objects this should correspond to a column
14 | #' called \code{sample} in the \code{colData} assay and for \code{Seurat}
15 | #' objects this should correspond to \code{x$sample}.
16 | #'
17 | #' @return a ggplot2 object
18 | #'
19 | #' @importFrom ggplot2 aes
20 | #' @importFrom ggplot2 element_text
21 | #' @importFrom ggplot2 geom_bar
22 | #' @importFrom ggplot2 ggplot
23 | #' @importFrom ggplot2 theme
24 | #' @export
25 | #'
26 | #' @author Belinda Phipson
27 | #'
28 | #' @examples
29 | #'
30 | #' library(speckle)
31 | #' library(ggplot2)
32 | #' library(limma)
33 | #'
34 | #' # Generate some fake data from a multinomial distribution
35 | #' # Group A, 4 samples, 1000 cells in each sample
36 | #' countsA <- rmultinom(4, size=1000, prob=c(0.1,0.3,0.6))
37 | #' colnames(countsA) <- paste("s",1:4,sep="")
38 | #'
39 | #' # Group B, 3 samples, 800 cells in each sample
40 | #'
41 | #' countsB <- rmultinom(3, size=800, prob=c(0.2,0.05,0.75))
42 | #' colnames(countsB) <- paste("s",5:7,sep="")
43 | #' rownames(countsA) <- rownames(countsB) <- paste("c",0:2,sep="")
44 | #'
45 | #' allcounts <- cbind(countsA, countsB)
46 | #' sample <- c(rep(colnames(allcounts),allcounts[1,]),
47 | #'           rep(colnames(allcounts),allcounts[2,]),
48 | #'           rep(colnames(allcounts),allcounts[3,]))
49 | #' clust <- rep(rownames(allcounts),rowSums(allcounts))
50 | #'
51 | #' plotCellTypeProps(clusters=clust, sample=sample)
52 | #'
53 | plotCellTypeProps <- function(x=NULL, clusters=NULL, sample=NULL)
54 | {
55 |   if(is.null(x) & is.null(sample) & is.null(clusters))
56 |     stop("Please provide either a SingleCellExperiment object or Seurat
57 |           object with required annotation metadata, or explicitly provide
58 |           clusters and sample information")
59 | 
60 |   if((is.null(clusters) | is.null(sample)) & !is.null(x)){
61 |     # Extract cluster, sample and group info from SCE object
62 |     if(is(x,"SingleCellExperiment"))
63 |       y <- .extractSCE(x)
64 | 
65 |     # Extract cluster, sample and group info from Seurat object
66 |     if(is(x,"Seurat"))
67 |       y <- .extractSeurat(x)
68 | 
69 |     clusters <- y$clusters
70 |     sample <- y$sample
71 |   }
72 | 
73 |   prop.list <- getTransformedProps(clusters, sample)
74 | 
75 |   Proportions <- as.vector(t(prop.list$Proportions))
76 |   Samples <- rep(colnames(prop.list$Proportions), nrow(prop.list$Proportions))
77 |   Clusters <- rep(rownames(prop.list$Proportions),
78 |                   each=ncol(prop.list$Proportions))
79 | 
80 |   plotdf <- data.frame(Samples=Samples, Clusters=Clusters,
81 |                        Proportions=Proportions)
82 | 
83 |   ggplot(plotdf,aes(x=Samples,y=Proportions,fill=Clusters)) +
84 |     geom_bar(stat="identity") +
85 |     theme(axis.text.x = element_text(size=12),
86 |           axis.text.y = element_text(size=12),
87 |           axis.title = element_text(size=14),
88 |           legend.text = element_text(size=12),
89 |           legend.title = element_text(size=14))
90 | 
91 | }
92 | 


--------------------------------------------------------------------------------
/R/plotCellTypePropsMeanVar.R:
--------------------------------------------------------------------------------
 1 | #' Plot cell type proportions versus variances
 2 | #' 
 3 | #' This function returns a plot of the log10(proportion) versus log10(variance)
 4 | #' given a matrix of cell type counts. The rows are the clusters/cell types and
 5 | #' the columns are the samples.
 6 | #' 
 7 | #' The expected variance under a binomial distribution is shown in the solid 
 8 | #' line, and the points represent the observed variance for each cell type/row 
 9 | #' in the counts table. The blue line shows the empirical Bayes variance 
10 | #' that is used in \code{propeller}.
11 | #' 
12 | #' The mean and variance for each cell type is calculated across all samples.
13 | #' 
14 | #' @param x a matrix or table of counts
15 | #'
16 | #' @return a base R plot
17 | #' 
18 | #' @importFrom limma lmFit
19 | #' @importFrom limma eBayes
20 | #' @importFrom graphics legend lines par title
21 | #' @importFrom stats lowess
22 | #' 
23 | #' @export
24 | #' 
25 | #' @author Belinda Phipson
26 | #'
27 | #' @examples
28 | #' library(limma)
29 | #' # Generate some data
30 | #' # Total number of samples
31 | #' nsamp <- 10
32 | #' # True cell type proportions
33 | #' p <- c(0.05, 0.15, 0.35, 0.45)
34 | #' 
35 | #' # Parameters for beta distribution
36 | #' a <- 40
37 | #' b <- a*(1-p)/p
38 | #' 
39 | #' # Sample total cell counts per sample from negative binomial distribution
40 | #' numcells <- rnbinom(nsamp,size=20,mu=5000)
41 | #' true.p <- matrix(c(rbeta(nsamp,a,b[1]),rbeta(nsamp,a,b[2]),
42 | #'           rbeta(nsamp,a,b[3]),rbeta(nsamp,a,b[4])),byrow=TRUE, ncol=nsamp)
43 | #' 
44 | #' counts <- matrix(NA,ncol=nsamp, nrow=nrow(true.p))
45 | #' rownames(counts) <- paste("c",0:(nrow(true.p)-1), sep="")
46 | #' for(j in 1:length(p)){
47 | #'     counts[j,] <- rbinom(nsamp, size=numcells, prob=true.p[j,])
48 | #' }
49 | #' 
50 | #' plotCellTypePropsMeanVar(counts)
51 | #' 
52 | plotCellTypePropsMeanVar <- function(x){
53 |   x <- as.matrix(x)
54 |   params <- estimateBetaParamsFromCounts(x)
55 |   tot.cells <- colSums(x)
56 |   props <- t(t(x)/tot.cells)
57 | 
58 |   varp <- params$pi*(1-params$pi)/params$n
59 |   var.props <- apply(props,1,var)
60 |   fitp <- lmFit(props)
61 |   fitp <- eBayes(fitp, robust=TRUE)
62 | 
63 |   ylimits.min <- min(log10(var.props),log10(varp), log10(fitp$s2.post))
64 |   ylimits.max <- max(log10(var.props),log10(varp), log10(fitp$s2.post))
65 | 
66 |  # par(mfrow=c(1,1))
67 |   plot(log10(rowMeans(props)), log10(var.props), pch=16, cex=2,
68 |        xlab="log10(proportion)", ylab="log10(variance)",
69 |        ylim=c(ylimits.min,ylimits.max), cex.lab=1.5, cex.axis=1.5)
70 |   lines(lowess(log10(rowMeans(props)),log10(varp)))
71 | 
72 |   lines(lowess(log10(rowMeans(props)), log10(fitp$s2.post)), lwd=2, col=4)
73 |   legend("bottomright", legend=c("Empirical Bayes variance","Binomial variance"),
74 |          col=c(4,1), lty=1, lwd=2)
75 | 
76 |   title("Mean-variance relationship: cell type proportions", cex.main=1.5)
77 | }
78 | 


--------------------------------------------------------------------------------
/R/preprocess.R:
--------------------------------------------------------------------------------
  1 | #' Pre-processing function for sex classification 
  2 | #' 
  3 | #' The purpose of this function is to process a single cell counts matrix into
  4 | #' the appropriate format for the \code{classifySex} function. 
  5 | #' 
  6 | #' This function will filter out cells that are unable to be classified due to 
  7 | #' zero counts on *XIST/Xist* and all of the Y chromosome genes. If 
  8 | #' \code{qc=TRUE} additional cells are removed as identified by the 
  9 | #' \code{perCellQCMetrics} and \code{quickPerCellQC} functions from the 
 10 | #' \code{scuttle} package. The resulting counts matrix is then log-normalised 
 11 | #' and scaled. 
 12 | #' 
 13 | #' @param x the counts matrix, rows are genes and columns are cells. Row names 
 14 | #' must be gene symbols. 
 15 | #' @param genome the genome the data arises from. Current options are 
 16 | #' human: genome = "Hs" or mouse: genome = "Mm".
 17 | #' @param qc logical, indicates whether to perform additional quality control on 
 18 | #' the cells. qc = TRUE will predict cells that pass quality control only and 
 19 | #' the filtered cells will not be classified. qc = FALSE will predict every cell 
 20 | #' except the cells with zero counts on *XIST/Xist* and the sum of the Y genes. 
 21 | #' Default is TRUE.
 22 | #' 
 23 | #' @return outputs a list object with the following components
 24 | #' \item{tcm.final }{A transposed count matrix where rows are cells and columns 
 25 | #' are the features used for classification.}
 26 | #' \item{data.df }{The normalised and scaled \code{tcm.final} matrix.} 
 27 | #' \item{discarded.cells }{Character vector of cell IDs for the cells that are
 28 | #' discarded when \code{qc=TRUE}.}
 29 | #' \item{zero.cells }{Character vector of cell IDs for the cells that can not
 30 | #' be classified as male/female due to zero counts on *Xist* and all the 
 31 | #' Y chromosome genes.}
 32 | #' 
 33 | #' @importFrom AnnotationDbi select
 34 | #' @importFrom stringr str_to_title 
 35 | #' @importFrom scuttle perCellQCMetrics
 36 | #' @importFrom scuttle quickPerCellQC
 37 | #' @importFrom org.Hs.eg.db org.Hs.eg.db
 38 | #' @importFrom org.Mm.eg.db org.Mm.eg.db
 39 | #' @export preprocess
 40 | #'
 41 | #' @examples 
 42 | #' 
 43 | #' library(speckle)
 44 | #' library(SingleCellExperiment)
 45 | #' library(CellBench)
 46 | #' library(org.Hs.eg.db)
 47 | #'
 48 | #' # Get data from CellBench library
 49 | #' sc_data <- load_sc_data()
 50 | #' sc_10x <- sc_data$sc_10x
 51 | #'
 52 | #' # Get counts matrix in correct format with gene symbol as rownames 
 53 | #' # rather than ENSEMBL ID.
 54 | #' counts <- counts(sc_10x)
 55 | #' ann <- select(org.Hs.eg.db, keys=rownames(sc_10x),
 56 | #'              columns=c("ENSEMBL","SYMBOL"), keytype="ENSEMBL")
 57 | #' m <- match(rownames(counts), ann$ENSEMBL)
 58 | #' rownames(counts) <- ann$SYMBOL[m]
 59 | #' 
 60 | #' # Preprocess data
 61 | #' pro.data <- preprocess(counts, genome="Hs", qc = TRUE)
 62 | #' 
 63 | #' # Look at counts on XIST and superY.all
 64 | #' plot(pro.data$tcm.final$XIST, pro.data$tcm.final$superY)
 65 | #' 
 66 | #' # Cells that are identified as low quality
 67 | #' pro.data$discarded.cells
 68 | #' 
 69 | #' # Cells with zero counts on XIST and all Y genes
 70 | #' pro.data$zero.cells
 71 | #' 
 72 | preprocess<- function(x, genome=genome, qc=qc){
 73 |   
 74 |   x <- as.matrix(x)
 75 |   row.names(x)<- toupper(row.names(x))
 76 |   # genes located in the X chromosome that have been reported to escape 
 77 |   # X-inactivation
 78 |   # http://bioinf.wehi.edu.au/software/GenderGenes/index.html
 79 |   Xgenes<- c("ARHGAP4","STS","ARSD", "ARSL", "AVPR2", "BRS3", "S100G", "CHM",
 80 |              "CLCN4", "DDX3X","EIF1AX","EIF2S3", "GPM6B", "GRPR", "HCFC1",
 81 |              "L1CAM", "MAOA", "MYCLP1", "NAP1L3", "GPR143", "CDK16", "PLXNB3",
 82 |              "PRKX", "RBBP7", "RENBP", "RPS4X", "TRAPPC2", "SH3BGRL", "TBL1X",
 83 |              "UBA1", "KDM6A", "XG", "XIST", "ZFX", "PUDP", "PNPLA4", "USP9X",
 84 |              "KDM5C", "SMC1A", "NAA10", "OFD1", "IKBKG", "PIR", "INE2", "INE1",
 85 |              "AP1S2", "GYG2", "MED14", "RAB9A", "ITM2A", "MORF4L2", "CA5B",
 86 |              "SRPX2", "GEMIN8", "CTPS2", "CLTRN", "NLGN4X", "DUSP21", "ALG13",
 87 |              "SYAP1", "SYTL4", "FUNDC1", "GAB3", "RIBC1", "FAM9C","CA5BP1")
 88 |   
 89 |   # genes belonging to the male-specific region of chromosome Y (unique genes)
 90 |   # http://bioinf.wehi.edu.au/software/GenderGenes/index.html
 91 |   Ygenes<-c("AMELY", "DAZ1", "PRKY", "RBMY1A1", "RBMY1HP", "RPS4Y1", "SRY",
 92 |             "TSPY1", "UTY", "ZFY","KDM5D", "USP9Y", "DDX3Y", "PRY", "XKRY",
 93 |             "BPY2", "VCY", "CDY1", "EIF1AY", "TMSB4Y","CDY2A", "NLGN4Y",
 94 |             "PCDH11Y", "HSFY1", "TGIF2LY", "TBL1Y", "RPS4Y2", "HSFY2",
 95 |             "CDY2B", "TXLNGY","CDY1B", "DAZ3", "DAZ2", "DAZ4")
 96 |   
 97 |   # build artificial genes
 98 |   Xgene.set <-Xgenes[Xgenes %in% row.names(x)]
 99 |   Ygene.set <-Ygenes[Ygenes %in% row.names(x)]
100 |   cm.new<-as.data.frame(matrix(rep(0, 3*ncol(x)), ncol = ncol(x),nrow = 3))
101 |   row.names(cm.new) <- c("XIST","superX","superY")
102 |   colnames(cm.new) <- colnames(x)
103 |   cm.new["XIST", ]<- x["XIST", ]
104 |   cm.new["superX", ] <-colSums(x[Xgene.set,])
105 |   cm.new["superY", ] <-colSums(x[Ygene.set,])
106 |   
107 | #  if (genome == "Mm"){
108 | #    ann <- suppressWarnings(AnnotationDbi::select(org.Mm.eg.db,keys=str_to_title(row.names(x)),
109 | #                                  columns=c("SYMBOL","GENENAME","CHR"),
110 | #                                  keytype="SYMBOL"))
111 | #  }else{
112 | #    ann <- suppressWarnings(AnnotationDbi::select(org.Hs.eg.db,keys=row.names(x),
113 | #                                  columns=c("SYMBOL","GENENAME","CHR"),
114 | #                                  keytype="SYMBOL"))
115 | #  }
116 | #  # create  superY.all
117 | #  Ychr.genes<- toupper(unique(ann[which(ann$CHR=="Y"), "SYMBOL"]))
118 | #  missing <- setdiff(Ychr.genes, row.names(x))
119 | #  if (length(missing) >0){
120 | #    Ychr.genes <- Ychr.genes[-match(missing, Ychr.genes)]
121 | #  }
122 | #  cm.new["superY.all", ] <- colSums(x[Ychr.genes,])
123 |   
124 |   ############################################################################
125 |   # Pre-processing
126 |   # perform simple QC
127 |   # keep a copy of library size
128 |   discarded.cells <- NA
129 |   if (qc == TRUE){
130 |     #data.sce <-SingleCellExperiment(assays = list(counts = x))
131 |     qcstats <- scuttle::perCellQCMetrics(x,subsets=list(Mito=1:100))
132 |     qcfilter <- scuttle::quickPerCellQC(qcstats, 
133 |                                         percent_subsets=c("subsets_Mito_percent"))
134 |     # save the discarded cells
135 |     discarded.cells <- colnames(x[,qcfilter$discard])
136 |     
137 |     # cm.new only contains cells that pass the quality control
138 |     cm.new <-cm.new[,!qcfilter$discard]
139 |   }
140 |   
141 |   tcm.final <- t(cm.new)
142 |   tcm.final <- as.data.frame(tcm.final)
143 |   
144 |   #Do Not Classify
145 |   zero.cells <- NA
146 |   dnc <- tcm.final$superY==0 & tcm.final$superX==0
147 |   if(any(dnc)==TRUE){
148 |       zero.cells <- row.names(tcm.final)[dnc]
149 |       cat(length(zero.cells), "cell/s are unable to be classified due to an 
150 |       abundance of zeroes on X and Y chromosome genes\n")
151 |   }
152 |   tcm.final <- tcm.final[!dnc, ]
153 | 
154 |   cm.new <- cm.new[,!dnc]
155 |   cm.lib.size<- colSums(x[,colnames(cm.new)], na.rm=TRUE)
156 |   
157 |   # log-normalisation performed for each cell
158 |   # scaling performed for each gene
159 |   normsca.cm <- normSca(cm.new, lib.size=cm.lib.size)
160 |   data.df <- t(normsca.cm)
161 |   data.df <- as.data.frame(data.df)
162 |   
163 |   list(tcm.final=tcm.final, data.df=data.df, discarded.cells=discarded.cells, 
164 |        zero.cells=zero.cells)
165 | }
166 | 
167 | 


--------------------------------------------------------------------------------
/R/propeller.R:
--------------------------------------------------------------------------------
  1 | #' Finding statistically significant differences in cell type proportions
  2 | #'
  3 | #' Calculates cell type proportions, performs a variance stabilising
  4 | #' transformation on the proportions and determines whether the cell type
  5 | #' proportions are statistically significant between different groups using
  6 | #' linear modelling.
  7 | #'
  8 | #' This function will take a \code{SingleCellExperiment} or \code{Seurat}
  9 | #' object and extract the \code{group}, \code{sample} and \code{clusters} cell
 10 | #' information. The user can either state these factor vectors explicitly in
 11 | #' the call to the \code{propeller} function, or internal functions will
 12 | #' extract them from the relevants objects. The user must ensure that
 13 | #' \code{group} and \code{sample} are columns in the metadata assays of the
 14 | #' relevant objects (any combination of upper/lower case is acceptable). For
 15 | #' \code{Seurat} objects the clusters are extracted using the \code{Idents}
 16 | #' function. For \code{SingleCellExperiment} objects, \code{clusters} needs to
 17 | #' be a column in the \code{colData} assay.
 18 | #'
 19 | #' The \code{propeller} function calculates cell type proportions for each
 20 | #' biological replicate, performs a variance stabilising transformation on the
 21 | #' matrix of proportions and fits a linear model for each cell type or cluster
 22 | #' using the \code{limma} framework. There are two options for the 
 23 | #' transformation: arcsin square root or logit. Propeller tests whether there 
 24 | #' is a difference in the cell type proportions between multiple groups. 
 25 | #' If there are only 2 groups, a t-test is used to calculate p-values, and if 
 26 | #' there are more than 2 groups, an F-test (ANOVA) is used. Cell type 
 27 | #' proportions of 1 or 0 are accommodated. Benjamini and Hochberg false 
 28 | #' discovery rates are calculated to account to multiple testing of 
 29 | #' cell types/clusters.
 30 | #'
 31 | #' @aliases propeller
 32 | #' @param x object of class \code{SingleCellExperiment} or \code{Seurat}
 33 | #' @param clusters a factor specifying the cluster or cell type for every cell.
 34 | #' For \code{SingleCellExperiment} objects this should correspond to a column
 35 | #' called \code{clusters} in the \code{colData} assay. For \code{Seurat}
 36 | #' objects this will be extracted by a call to \code{Idents(x)}.
 37 | #' @param sample a factor specifying the biological replicate for each cell.
 38 | #' For \code{SingleCellExperiment} objects this should correspond to a column
 39 | #' called \code{sample} in the \code{colData} assay and for \code{Seurat}
 40 | #' objects this should correspond to \code{x$sample}.
 41 | #' @param group a factor specifying the groups of interest for performing the
 42 | #' differential proportions analysis. For \code{SingleCellExperiment} objects
 43 | #' this should correspond to a column called \code{group} in the \code{colData}
 44 | #' assay.  For \code{Seurat} objects this should correspond to \code{x$group}.
 45 | #' @param trend logical, if true fits a mean variance trend on the transformed
 46 | #' proportions
 47 | #' @param robust logical, if true performs robust empirical Bayes shrinkage of
 48 | #' the variances
 49 | #' @param transform a character scalar specifying which transformation of the 
 50 | #' proportions to perform. Possible values include "asin" or "logit". Defaults
 51 | #' to "logit".
 52 | #'
 53 | #' @return produces a dataframe of results
 54 | #'
 55 | #' @importFrom stats p.adjust
 56 | #' @export propeller
 57 | #'
 58 | #' @author Belinda Phipson
 59 | #'
 60 | #' @seealso \code{\link{propeller.ttest}} \code{\link{propeller.anova}} 
 61 | #' \code{\link{lmFit}}, \code{\link{eBayes}},
 62 | #' \code{\link{getTransformedProps}}
 63 | #'
 64 | #' @references Smyth, G.K. (2004). Linear models and empirical Bayes methods
 65 | #' for assessing differential expression in microarray experiments.
 66 | #' \emph{Statistical Applications in Genetics and Molecular Biology}, Volume
 67 | #' \bold{3}, Article 3.
 68 | #'
 69 | #' Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery
 70 | #' rate: a practical and powerful approach to multiple testing. \emph{Journal
 71 | #' of the Royal Statistical Society Series}, B, \bold{57}, 289-300.
 72 | #'
 73 | #' @examples
 74 | #'
 75 | #'   library(speckle)
 76 | #'   library(ggplot2)
 77 | #'   library(limma)
 78 | #'
 79 | #'   # Make up some data
 80 | #'   # True cell type proportions for 4 samples
 81 | #'   p_s1 <- c(0.5,0.3,0.2)
 82 | #'   p_s2 <- c(0.6,0.3,0.1)
 83 | #'   p_s3 <- c(0.3,0.4,0.3)
 84 | #'   p_s4 <- c(0.4,0.3,0.3)
 85 | #'
 86 | #'   # Total numbers of cells per sample
 87 | #'   numcells <- c(1000,1500,900,1200)
 88 | #'
 89 | #'   # Generate cell-level vector for sample info
 90 | #'   biorep <- rep(c("s1","s2","s3","s4"),numcells)
 91 | #'   length(biorep)
 92 | #'
 93 | #'   # Numbers of cells for each of the 3 clusters per sample
 94 | #'   n_s1 <- p_s1*numcells[1]
 95 | #'   n_s2 <- p_s2*numcells[2]
 96 | #'   n_s3 <- p_s3*numcells[3]
 97 | #'   n_s4 <- p_s4*numcells[4]
 98 | #'
 99 | #'   # Assign cluster labels for 4 samples
100 | #'   cl_s1 <- rep(c("c0","c1","c2"),n_s1)
101 | #'   cl_s2 <- rep(c("c0","c1","c2"),n_s2)
102 | #'   cl_s3 <- rep(c("c0","c1","c2"),n_s3)
103 | #'   cl_s4 <- rep(c("c0","c1","c2"),n_s4)
104 | #'
105 | #'   # Generate cell-level vector for cluster info
106 | #'   clust <- c(cl_s1,cl_s2,cl_s3,cl_s4)
107 | #'   length(clust)
108 | #'
109 | #'   # Assume s1 and s2 belong to group 1 and s3 and s4 belong to group 2
110 | #'   grp <- rep(c("grp1","grp2"),c(sum(numcells[1:2]),sum(numcells[3:4])))
111 | #'
112 | #'   propeller(clusters = clust, sample = biorep, group = grp,
113 | #'   robust = FALSE, trend = FALSE, transform="asin")
114 | #'
115 | propeller <- function(x=NULL, clusters=NULL, sample=NULL, group=NULL,
116 |                       trend=FALSE, robust=TRUE, transform="logit")
117 | #    Testing for differences in cell type proportions
118 | #    Belinda Phipson
119 | #    29 July 2019
120 | #    Modified 22 April 2020
121 | {
122 | 
123 |     if(is.null(x) & is.null(sample) & is.null(group) & is.null(clusters))
124 |         stop("Please provide either a SingleCellExperiment object or Seurat
125 |         object with required annotation metadata, or explicitly provide
126 |         clusters, sample and group information")
127 | 
128 |     if((is.null(clusters) | is.null(sample) | is.null(group)) & !is.null(x)){
129 |         # Extract cluster, sample and group info from SCE object
130 |         if(is(x,"SingleCellExperiment"))
131 |             y <- .extractSCE(x)
132 | 
133 |         # Extract cluster, sample and group info from Seurat object
134 |         if(is(x,"Seurat"))
135 |             y <- .extractSeurat(x)
136 | 
137 |         clusters <- y$clusters
138 |         sample <- y$sample
139 |         group <- y$group
140 |     }
141 |     
142 |     if(is.null(transform)) transform <- "logit"
143 | 
144 |     # Get transformed proportions
145 |     prop.list <- getTransformedProps(clusters, sample, transform)
146 | 
147 |     # Calculate baseline proportions for each cluster
148 |     baseline.props <- table(clusters)/sum(table(clusters))
149 | 
150 |     # Collapse group information
151 |     group.coll <- table(sample, group)
152 | 
153 |     design <- matrix(as.integer(group.coll != 0), ncol=ncol(group.coll))
154 |     colnames(design) <- colnames(group.coll)
155 | 
156 |     if(ncol(design)==2){
157 |         message("group variable has 2 levels, t-tests will be performed")
158 |         contrasts <- c(1,-1)
159 |         out <- propeller.ttest(prop.list, design, contrasts=contrasts,
160 |                                robust=robust, trend=trend, sort=FALSE)
161 |         out <- data.frame(BaselineProp=baseline.props,out)
162 |         o <- order(out$P.Value)
163 |         out[o,]
164 |     }
165 |     else if(ncol(design)>=2){
166 |         message("group variable has > 2 levels, ANOVA will be performed")
167 |         coef <- seq_len(ncol(design))
168 |         out <- propeller.anova(prop.list, design, coef=coef, robust=robust,
169 |                                trend=trend, sort=FALSE)
170 |         out <- data.frame(BaselineProp=as.vector(baseline.props),out)
171 |         o <- order(out$P.Value)
172 |         out[o,]
173 |   }
174 | 
175 | }
176 | 
177 | #' Extract metadata from \code{SingleCellExperiment} object
178 | #'
179 | #' This is an accessor function that extracts cluster, sample and group
180 | #' information for each cell.
181 | #'
182 | #' @param x object of class \code{SingleCellExperiment}
183 | #'
184 | #' @return a dataframe containing clusters, sample and group
185 | #'
186 | #' @importFrom methods is
187 | #' @importFrom SingleCellExperiment colData
188 | #'
189 | #' @author Belinda Phipson
190 | #'
191 | .extractSCE <- function(x){
192 |     message("extracting sample information from SingleCellExperiment object")
193 |     colnames(colData(x)) <- toupper(colnames(colData(x)))
194 |     clusters <- factor(colData(x)$CLUSTER)
195 |     sample <- factor(colData(x)$SAMPLE)
196 |     group <- factor(colData(x)$GROUP)
197 |     data.frame(clusters=clusters,sample=sample,group=group)
198 | }
199 | 
200 | #' Extract metadata from \code{Seurat} object
201 | #'
202 | #' This is an accessor function that extracts cluster, sample and group
203 | #' information for each cell.
204 | #'
205 | #' @param x object of class \code{Seurat}
206 | #'
207 | #' @return a dataframe containing clusters, sample and group
208 | #'
209 | #' @importFrom Seurat Idents
210 | #'
211 | #' @author Belinda Phipson
212 | #'
213 | .extractSeurat <- function(x){
214 |     message("extracting sample information from Seurat object")
215 |     colnames(x@meta.data) <- toupper(colnames(x@meta.data))
216 |     clusters <- factor(Idents(x))
217 |     sample <- factor(x$SAMPLE)
218 |     group <- factor(x$GROUP)
219 |     data.frame(clusters=clusters,sample=sample,group=group)
220 | }
221 | 
222 | 
223 | 
224 | 


--------------------------------------------------------------------------------
/R/propeller.anova.R:
--------------------------------------------------------------------------------
  1 | #' Performs F-tests for transformed cell type proportions
  2 | #'
  3 | #' This function is called by \code{propeller} and performs F-tests between
  4 | #' multiple experimental groups or conditions (> 2) on transformed cell type 
  5 | #' proportions.
  6 | #'
  7 | #' In order to run this function, the user needs to run the
  8 | #' \code{getTransformedProps} function first. The output from
  9 | #' \code{getTransformedProps} is used as input. The \code{propeller.anova}
 10 | #' function expects that the design matrix is not in the intercept format.
 11 | #' This \code{coef} vector will identify the columns in the design matrix that
 12 | #' correspond to the groups being tested.
 13 | #' Note that additional confounding covariates can be accounted for as extra
 14 | #' columns in the design matrix, but need to come after the group-specific
 15 | #' columns.
 16 | #'
 17 | #' The \code{propeller.anova} function uses the \code{limma} functions
 18 | #' \code{lmFit} and \code{eBayes} to extract F statistics and p-values.
 19 | #' This has the additional advantage that empirical Bayes shrinkage of the
 20 | #' variances are performed.
 21 | #'
 22 | #' @param prop.list a list object containing two matrices:
 23 | #' \code{TransformedProps} and \code{Proportions}
 24 | #' @param design a design matrix with rows corresponding to samples and columns
 25 | #' to coefficients to be estimated
 26 | #' @param coef a vector specifying which the columns of the design matrix
 27 | #' corresponding to the groups to test
 28 | #' @param robust logical, should robust variance estimation be used. Defaults to
 29 | #' TRUE.
 30 | #' @param trend logical, should a trend between means and variances be accounted
 31 | #' for. Defaults to FALSE.
 32 | #' @param sort logical, should the output be sorted by P-value.
 33 | #'
 34 | #' @return produces a dataframe of results
 35 | #'
 36 | #' @importFrom stats p.adjust
 37 | #' @importFrom limma lmFit
 38 | #' @importFrom limma eBayes
 39 | #' @export propeller.anova
 40 | #'
 41 | #' @author Belinda Phipson
 42 | #'
 43 | #'@seealso \code{\link{propeller}}, \code{\link{getTransformedProps}},
 44 | #' \code{\link{lmFit}}, \code{\link{eBayes}}
 45 | #'
 46 | #' @examples
 47 | #'   library(speckle)
 48 | #'   library(ggplot2)
 49 | #'   library(limma)
 50 | #'
 51 | #'   # Make up some data
 52 | #'
 53 | #'   # True cell type proportions for 4 samples
 54 | #'   p_s1 <- c(0.5,0.3,0.2)
 55 | #'   p_s2 <- c(0.6,0.3,0.1)
 56 | #'   p_s3 <- c(0.3,0.4,0.3)
 57 | #'   p_s4 <- c(0.4,0.3,0.3)
 58 | #'   p_s5 <- c(0.8,0.1,0.1)
 59 | #'   p_s6 <- c(0.75,0.2,0.05)
 60 | #'
 61 | #'   # Total numbers of cells per sample
 62 | #'   numcells <- c(1000,1500,900,1200,1000,800)
 63 | #'
 64 | #'   # Generate cell-level vector for sample info
 65 | #'   biorep <- rep(c("s1","s2","s3","s4","s5","s6"),numcells)
 66 | #'   length(biorep)
 67 | #'
 68 | #'   # Numbers of cells for each of 3 clusters per sample
 69 | #'   n_s1 <- p_s1*numcells[1]
 70 | #'   n_s2 <- p_s2*numcells[2]
 71 | #'   n_s3 <- p_s3*numcells[3]
 72 | #'   n_s4 <- p_s4*numcells[4]
 73 | #'   n_s5 <- p_s5*numcells[5]
 74 | #'   n_s6 <- p_s6*numcells[6]
 75 | #'
 76 | #'   cl_s1 <- rep(c("c0","c1","c2"),n_s1)
 77 | #'   cl_s2 <- rep(c("c0","c1","c2"),n_s2)
 78 | #'   cl_s3 <- rep(c("c0","c1","c2"),n_s3)
 79 | #'   cl_s4 <- rep(c("c0","c1","c2"),n_s4)
 80 | #'   cl_s5 <- rep(c("c0","c1","c2"),n_s5)
 81 | #'   cl_s6 <- rep(c("c0","c1","c2"),n_s6)
 82 | #'
 83 | #'   # Generate cell-level vector for cluster info
 84 | #'   clust <- c(cl_s1,cl_s2,cl_s3,cl_s4,cl_s5,cl_s6)
 85 | #'   length(clust)
 86 | #'
 87 | #'   prop.list <- getTransformedProps(clusters = clust, sample = biorep)
 88 | #'
 89 | #'   # Assume s1 and s2 belong to group A, s3 and s4 belong to group B, s5 and
 90 | #'   # s6 belong to group C
 91 | #'   grp <- rep(c("A","B","C"), each=2)
 92 | #'
 93 | #'   # Make sure design matrix does not have an intercept term
 94 | #'   design <- model.matrix(~0+grp)
 95 | #'   design
 96 | #'
 97 | #'   propeller.anova(prop.list, design=design, coef=c(1,2,3), robust=TRUE,
 98 | #'   trend=FALSE, sort=TRUE)
 99 | #'
100 | propeller.anova <- function(prop.list=prop.list, design=design, coef = coef,
101 |                             robust=robust, trend=trend, sort=sort)
102 | {
103 |     prop.trans <- prop.list$TransformedProps
104 |     prop <- prop.list$Proportions
105 | 
106 |     # get cell type mean proportions ignoring other variables
107 |     # this assumes that the design matrix is not in Intercept format
108 |     fit.prop <- lmFit(prop, design[,coef])
109 | 
110 |     # Change design matrix to intercept format
111 |     design[,1] <- 1
112 |     colnames(design)[1] <- "Int"
113 | 
114 |     # Fit linear model taking into account all confounding variables
115 |     fit <- lmFit(prop.trans,design)
116 | 
117 |     # Get F statistics corresponding to group information only
118 |     # You have to remove the intercept term for this to work
119 |     fit <- eBayes(fit[,coef[-1]], robust=robust, trend=trend)
120 | 
121 |     # Extract F p-value
122 |     p.value <- fit$F.p.value
123 |     # and perform FDR adjustment
124 |     fdr <- p.adjust(fit$F.p.value, method="BH")
125 | 
126 |     out <- data.frame(PropMean=fit.prop$coefficients, Fstatistic= fit$F,
127 |                     P.Value=p.value, FDR=fdr)
128 |     if(sort){
129 |         o <- order(out$P.Value)
130 |         out[o,]
131 |     }
132 |     else out
133 | }
134 | 


--------------------------------------------------------------------------------
/R/propeller.ttest.R:
--------------------------------------------------------------------------------
  1 | #' Performs t-tests of transformed cell type proportions
  2 | #'
  3 | #' This function is called by \code{propeller} and performs t-tests between two
  4 | #' experimental groups or conditions on the transformed cell type proportions.
  5 | #'
  6 | #' In order to run this function, the user needs to run the
  7 | #' \code{getTransformedProps} function first. The output from
  8 | #' \code{getTransformedProps} is used as input. The \code{propeller.ttest}
  9 | #' function expects that the design matrix is not in the intercept format
 10 | #' and a contrast vector needs to be supplied. This contrast vector will
 11 | #' identify the two groups to be tested. Note that additional confounding
 12 | #' covariates can be accounted for as extra columns in the design matrix after
 13 | #' the group-specific columns.
 14 | #'
 15 | #' The \code{propeller.ttest} function uses the \code{limma} functions
 16 | #' \code{lmFit}, \code{contrasts.fit} and \code{eBayes} which has the additional
 17 | #' advantage that empirical Bayes shrinkage of the variances are performed.
 18 | #'
 19 | #' @param prop.list a list object containing two matrices:
 20 | #' \code{TransformedProps} and \code{Proportions}
 21 | #' @param design a design matrix with rows corresponding to samples and columns
 22 | #' to coefficients to be estimated
 23 | #' @param contrasts a vector specifying which columns of the design matrix
 24 | #' correspond to the two groups to test
 25 | #' @param robust logical, should robust variance estimation be used. Defaults to
 26 | #' TRUE.
 27 | #' @param trend logical, should a trend between means and variances be accounted
 28 | #' for. Defaults to FALSE.
 29 | #' @param sort logical, should the output be sorted by P-value.
 30 | #'
 31 | #' @return produces a dataframe of results
 32 | #'
 33 | #' @importFrom stats p.adjust
 34 | #' @importFrom limma lmFit
 35 | #' @importFrom limma eBayes
 36 | #' @importFrom limma contrasts.fit
 37 | #' @export propeller.ttest
 38 | #'
 39 | #' @author Belinda Phipson
 40 | #'
 41 | #' @seealso \code{\link{propeller}}, \code{\link{getTransformedProps}},
 42 | #' \code{\link{lmFit}}, \code{\link{contrasts.fit}}, \code{\link{eBayes}}
 43 | #'
 44 | #' @examples
 45 | #'   library(speckle)
 46 | #'   library(ggplot2)
 47 | #'   library(limma)
 48 | #'
 49 | #'   # Make up some data
 50 | #'
 51 | #'   # True cell type proportions for 4 samples
 52 | #'   p_s1 <- c(0.5,0.3,0.2)
 53 | #'   p_s2 <- c(0.6,0.3,0.1)
 54 | #'   p_s3 <- c(0.3,0.4,0.3)
 55 | #'   p_s4 <- c(0.4,0.3,0.3)
 56 | #'
 57 | #'   # Total numbers of cells per sample
 58 | #'   numcells <- c(1000,1500,900,1200)
 59 | #'
 60 | #'   # Generate cell-level vector for sample info
 61 | #'   biorep <- rep(c("s1","s2","s3","s4"),numcells)
 62 | #'   length(biorep)
 63 | #'
 64 | #'   # Numbers of cells for each of 3 clusters per sample
 65 | #'   n_s1 <- p_s1*numcells[1]
 66 | #'   n_s2 <- p_s2*numcells[2]
 67 | #'   n_s3 <- p_s3*numcells[3]
 68 | #'   n_s4 <- p_s4*numcells[4]
 69 | #'
 70 | #'   cl_s1 <- rep(c("c0","c1","c2"),n_s1)
 71 | #'   cl_s2 <- rep(c("c0","c1","c2"),n_s2)
 72 | #'   cl_s3 <- rep(c("c0","c1","c2"),n_s3)
 73 | #'   cl_s4 <- rep(c("c0","c1","c2"),n_s4)
 74 | #'
 75 | #'   # Generate cell-level vector for cluster info
 76 | #'   clust <- c(cl_s1,cl_s2,cl_s3,cl_s4)
 77 | #'   length(clust)
 78 | #'
 79 | #'   prop.list <- getTransformedProps(clusters = clust, sample = biorep)
 80 | #'
 81 | #'   # Assume s1 and s2 belong to group 1 and s3 and s4 belong to group 2
 82 | #'   grp <- rep(c("A","B"), each=2)
 83 | #'
 84 | #'   design <- model.matrix(~0+grp)
 85 | #'   design
 86 | #'
 87 | #'   # Compare Grp A to B
 88 | #'   contrasts <- c(1,-1)
 89 | #'
 90 | #'   propeller.ttest(prop.list, design=design, contrasts=contrasts, robust=TRUE,
 91 | #'   trend=FALSE, sort=TRUE)
 92 | #'
 93 | #'   # Pretend additional sex variable exists and we want to control for it
 94 | #'   # in the linear model
 95 | #'   sex <- rep(c(0,1),2)
 96 | #'   des.sex <- model.matrix(~0+grp+sex)
 97 | #'   des.sex
 98 | #'
 99 | #'   # Compare Grp A to B
100 | #'   cont.sex <- c(1,-1,0)
101 | #'
102 | #'   propeller.ttest(prop.list, design=des.sex, contrasts=cont.sex, robust=TRUE,
103 | #'   trend=FALSE, sort=TRUE)
104 | #'
105 | #'
106 | propeller.ttest <- function(prop.list=prop.list, design=design,
107 |                             contrasts=contrasts, robust=robust, trend=trend,
108 |                             sort=sort)
109 | {
110 |     prop.trans <- prop.list$TransformedProps
111 |     prop <- prop.list$Proportions
112 | 
113 |     fit <- lmFit(prop.trans, design)
114 |     fit.cont <- contrasts.fit(fit, contrasts=contrasts)
115 |     fit.cont <- eBayes(fit.cont, robust=robust, trend=trend)
116 | 
117 |     # Get mean cell type proportions and relative risk for output
118 |     # If no confounding variable included in design matrix
119 |     if(length(contrasts)==2){
120 |         fit.prop <- lmFit(prop, design)
121 |         z <- apply(fit.prop$coefficients, 1, function(x) x^contrasts)
122 |         RR <- apply(z, 2, prod)
123 |     }
124 |     # If confounding variables included in design matrix exclude them
125 |     else{
126 |         new.des <- design[,contrasts!=0]
127 |         fit.prop <- lmFit(prop,new.des)
128 |         new.cont <- contrasts[contrasts!=0]
129 |         z <- apply(fit.prop$coefficients, 1, function(x) x^new.cont)
130 |         RR <- apply(z, 2, prod)
131 |     }
132 | 
133 |     fdr <- p.adjust(fit.cont$p.value, method="BH")
134 | 
135 |     out <- data.frame(PropMean=fit.prop$coefficients, PropRatio=RR,
136 |                     Tstatistic=fit.cont$t[,1], P.Value=fit.cont$p.value[,1],
137 |                     FDR=fdr)
138 |     if(sort){
139 |         o <- order(out$P.Value)
140 |         out[o,]
141 |     }
142 |     else out
143 | }
144 | 


--------------------------------------------------------------------------------
/R/speckle-package.R:
--------------------------------------------------------------------------------
1 | #' @keywords internal
2 | "_PACKAGE"
3 | 
4 | # The following block is used by usethis to automatically manage
5 | # roxygen namespace tags. Modify with care!
6 | ## usethis namespace: start
7 | ## usethis namespace: end
8 | NULL
9 | 


--------------------------------------------------------------------------------
/R/speckle_example_data.R:
--------------------------------------------------------------------------------
 1 | #' Generate example data
 2 | #'
 3 | #' @return a dataframe containing cell-level information for sample, group and
 4 | #' clusters
 5 | #'
 6 | #' @export
 7 | #'
 8 | #' @examples
 9 | #'
10 | #' speckle_example_data()
11 | #'
12 | speckle_example_data <- function(){
13 |     # Make up some data with two groups, two biological replicates in each
14 |     # group and three cell types
15 | 
16 |     # True cell type proportions for 4 samples
17 |     p_s1 <- c(0.5,0.3,0.2)
18 |     p_s2 <- c(0.6,0.3,0.1)
19 |     p_s3 <- c(0.3,0.4,0.3)
20 |     p_s4 <- c(0.4,0.3,0.3)
21 | 
22 |     # Total numbers of cells per sample
23 |     numcells <- c(1000,1500,900,1200)
24 | 
25 |     # Generate cell-level vector for sample info
26 |     biorep <- rep(c("s1","s2","s3","s4"),numcells)
27 |     length(biorep)
28 | 
29 |     # Numbers of cells for each of the 3 clusters per sample
30 |     n_s1 <- p_s1*numcells[1]
31 |     n_s2 <- p_s2*numcells[2]
32 |     n_s3 <- p_s3*numcells[3]
33 |     n_s4 <- p_s4*numcells[4]
34 | 
35 |     # Assign cluster labels for 4 samples
36 |     cl_s1 <- rep(c("c0","c1","c2"),n_s1)
37 |     cl_s2 <- rep(c("c0","c1","c2"),n_s2)
38 |     cl_s3 <- rep(c("c0","c1","c2"),n_s3)
39 |     cl_s4 <- rep(c("c0","c1","c2"),n_s4)
40 | 
41 |     # Generate cell-level vector for cluster info
42 |     clust <- c(cl_s1,cl_s2,cl_s3,cl_s4)
43 |     length(clust)
44 | 
45 |     # Assume s1 and s2 belong to group 1 and s3 and s4 belong to group 2
46 |     grp <- rep(c("grp1","grp2"),c(sum(numcells[1:2]),sum(numcells[3:4])))
47 | 
48 |     data.frame(clusters=clust, samples=biorep, group=grp)
49 | }
50 | 


--------------------------------------------------------------------------------
/R/sysdata.rda:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Oshlack/speckle/9347bf07b5cdc49ecedc0042d3a007742db01691/R/sysdata.rda


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # speckle
  3 | 
  4 | <!-- badges: start -->
  5 | <!-- badges: end -->
  6 | 
  7 | **Please note that the speckle package has moved to the following repository:
  8 | https://github.com/phipsonlab. Version 0.0.3 of
  9 | the package will remain here but will no longer be actively updated/maintained. 
 10 | This version of the package contains the propeller and classifySex functions.** 
 11 | 
 12 | ## Citation
 13 | The propeller method has now been accepted for publicaton in *Bioinformatics*. 
 14 | Please use the following citation when using propeller: \
 15 | Belinda Phipson, Choon Boon Sim, Enzo R Porrello, Alex W Hewitt, Joseph Powell, Alicia Oshlack, propeller: testing for differences in cell type proportions in single cell data, *Bioinformatics*, 2022;, btac582, [https://doi.org/10.1093/bioinformatics/btac582](https://doi.org/10.1093/bioinformatics/btac582)
 16 | 
 17 | The preprint is available on [bioRxiv](https://www.biorxiv.org/content/10.1101/2021.11.28.470236v2.full).
 18 | 
 19 | ## Introduction
 20 | The speckle package currently contains functions to analyse differences in cell 
 21 | type proportions in single cell RNA-seq data, and to classify cells as male or 
 22 | female. As our research into specialised 
 23 | analyses of single cell data continues we anticipate that the package will be 
 24 | updated with new functions.
 25 | 
 26 | The propeller function performs statistical tests for differences in cell
 27 | type composition in single cell data. In order to test for differences in cell
 28 | type proportions between multiple experimental conditions at least one of the 
 29 | groups must have some form of biological replication (i.e. at least two 
 30 | samples). For a two group scenario, the absolute minimum sample size is thus 
 31 | three. Since there are many technical aspects which can affect cell type 
 32 | proportion estimates, having biological replication is essential for a 
 33 | meaningful analysis.
 34 | 
 35 | The propeller function takes a SingleCellExperiment or Seurat object as input,
 36 | extracts the relevant cell information, and tests whether the cell type 
 37 | proportions are statistically significantly different between experimental
 38 | conditions/groups. The user can also explicitly pass the cluster, sample and 
 39 | experimental group information to propeller. P-values and false discovery rates 
 40 | are outputted for each cell type. 
 41 | 
 42 | ## Installation
 43 | 
 44 | If you would like to view the speckle vignette, you can install the released 
 45 | version of speckle from [github](https://github.com/Oshlack/speckle) using the 
 46 | following commands:
 47 | 
 48 | ``` r
 49 | # devtools/remotes won't install Suggested packages from Bioconductor
 50 | BiocManager::install(c("CellBench", "BiocStyle", "scater"))
 51 | 
 52 | remotes::install_github("Oshlack/speckle", build_vignettes = TRUE, 
 53 | dependencies = "Suggest")
 54 | ```
 55 | 
 56 | In order to view the vignette for speckle use the following command:
 57 | 
 58 | ``` r
 59 | browseVignettes("speckle")
 60 | ```
 61 | 
 62 | If you don't care to view the glorious vignette you can also install speckle as 
 63 | follows:
 64 | 
 65 | ``` r
 66 | library(devtools)
 67 | devtools::install_github("Oshlack/speckle")
 68 | ```
 69 | 
 70 | ## Example
 71 | 
 72 | This is a basic example which shows you how to test for differences in cell 
 73 | type proportions in a two group experimental set up.
 74 | 
 75 | ``` r
 76 | library(speckle)
 77 | library(limma)
 78 | library(ggplot2)
 79 | 
 80 | # Get some example data which has two groups, three cell types and two 
 81 | # biological replicates in each group
 82 | cell_info <- speckle_example_data()
 83 | head(cell_info)
 84 | 
 85 | # Run propeller testing for cell type proportion differences between the two 
 86 | # groups
 87 | propeller(clusters = cell_info$clusters, sample = cell_info$samples, 
 88 | group = cell_info$group)
 89 | 
 90 | # Plot cell type proportions
 91 | plotCellTypeProps(clusters=cell_info$clusters, sample=cell_info$samples)
 92 | ```
 93 | Please note that this basic implementation is for when you are only modelling
 94 | group information. When you have additional covariates that you would like to 
 95 | account for, please use the propeller.ttest() and propeller.anova() functions
 96 | directly. Please read the vignette for examples on how to model a continuous 
 97 | variable, account for additional covariates and include a random effect in the 
 98 | analysis. 
 99 | 
100 | 
101 | 


--------------------------------------------------------------------------------
/man/classifySex.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/classifySex.R
 3 | \name{classifySex}
 4 | \alias{classifySex}
 5 | \title{Predict sex of cells in scRNA-seq data}
 6 | \usage{
 7 | classifySex(x, genome = NULL, qc = TRUE)
 8 | }
 9 | \arguments{
10 | \item{x}{counts matrix, rows correspond to genes and columns correspond to 
11 | cells. Row names must be gene symbols.}
12 | 
13 | \item{genome}{the genome the data arises from. Current options are 
14 | human: genome = "Hs" or mouse: genome = "Mm".}
15 | 
16 | \item{qc}{logical, indicates whether to perform quality control or not. 
17 | qc = TRUE will predict cells that pass quality control only and the filtered 
18 | cells will not be classified. qc = FALSE will predict every cell except the 
19 | cells with zero counts on *XIST/Xist* and the sum of the Y genes. Default is TRUE.}
20 | }
21 | \value{
22 | a dataframe with predicted labels for each cell
23 | }
24 | \description{
25 | This function will predict the sex for each cell in scRNA-seq data. The 
26 | classifier is based on logistic regression models that have been trained
27 | on mouse and human single cell RNA-seq data.
28 | }
29 | \details{
30 | For bulk RNA-seq, checking the sex of the samples for mouse and human 
31 | experiments is trivial as we can simply check the expression of *Xist/XIST*. 
32 | It is not as simple for single cell RNA-seq data as the number of counts 
33 | measured per gene and per cell is often quite low. Simply relying on cut-offs
34 | on the expression of genes like *Xist* means that many cells are unable to 
35 | be classified. Hence we have developed a classifier based on a combination of
36 | X- and Y-linked genes in order to accurately predict the sex of each cell.
37 | 
38 | Cells with zero counts on Xist and the sum of the Y chromosome genes will 
39 | not be classified as there is simply not enough information to accurately 
40 | classify as Male/Female, and NAs will be returned. In addition, the user has 
41 | the option to perform quality control on the data first, by specifying 
42 | \code{qc=TRUE}, which will not classify cells that are deemed low-quality.
43 | }
44 | \examples{
45 | 
46 | library(speckle)
47 | library(SingleCellExperiment)
48 | library(CellBench)
49 | library(org.Hs.eg.db)
50 | 
51 | sc_data <- load_sc_data()
52 | sc_10x <- sc_data$sc_10x
53 | 
54 | counts <- counts(sc_10x)
55 | ann <- select(org.Hs.eg.db, keys=rownames(sc_10x),
56 |              columns=c("ENSEMBL","SYMBOL"), keytype="ENSEMBL")
57 | m <- match(rownames(counts), ann$ENSEMBL)
58 | rownames(counts) <- ann$SYMBOL[m]
59 | 
60 | sex <- classifySex(counts, genome="Hs")
61 | 
62 | table(sex$prediction)
63 | boxplot(counts["XIST",]~sex$prediction)
64 |    
65 | }
66 | \author{
67 | Xinyi Jin
68 | }
69 | 


--------------------------------------------------------------------------------
/man/dot-extractSCE.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/propeller.R
 3 | \name{.extractSCE}
 4 | \alias{.extractSCE}
 5 | \title{Extract metadata from \code{SingleCellExperiment} object}
 6 | \usage{
 7 | .extractSCE(x)
 8 | }
 9 | \arguments{
10 | \item{x}{object of class \code{SingleCellExperiment}}
11 | }
12 | \value{
13 | a dataframe containing clusters, sample and group
14 | }
15 | \description{
16 | This is an accessor function that extracts cluster, sample and group
17 | information for each cell.
18 | }
19 | \author{
20 | Belinda Phipson
21 | }
22 | 


--------------------------------------------------------------------------------
/man/dot-extractSeurat.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/propeller.R
 3 | \name{.extractSeurat}
 4 | \alias{.extractSeurat}
 5 | \title{Extract metadata from \code{Seurat} object}
 6 | \usage{
 7 | .extractSeurat(x)
 8 | }
 9 | \arguments{
10 | \item{x}{object of class \code{Seurat}}
11 | }
12 | \value{
13 | a dataframe containing clusters, sample and group
14 | }
15 | \description{
16 | This is an accessor function that extracts cluster, sample and group
17 | information for each cell.
18 | }
19 | \author{
20 | Belinda Phipson
21 | }
22 | 


--------------------------------------------------------------------------------
/man/estimateBetaParam.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/estimateBetaParam.R
 3 | \name{estimateBetaParam}
 4 | \alias{estimateBetaParam}
 5 | \title{Estimate the parameters of a Beta distribution}
 6 | \usage{
 7 | estimateBetaParam(x)
 8 | }
 9 | \arguments{
10 | \item{x}{a vector of proportions.}
11 | }
12 | \value{
13 | a list object with the estimate of alpha in \code{a} and beta in
14 | \code{b}.
15 | }
16 | \description{
17 | This function estimates the two parameters of the Beta distribution, alpha
18 | and beta, given a vector of proportions. It uses the method of moments to
19 | do this.
20 | }
21 | \examples{
22 | # Generate proportions from a beta distribution
23 | props <- rbeta(1000, shape1=2, shape2=10)
24 | estimateBetaParam(props)
25 | 
26 | }
27 | \author{
28 | Belinda Phipson
29 | }
30 | 


--------------------------------------------------------------------------------
/man/estimateBetaParamsFromCounts.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/estimateBetaParamsFromCounts.R
 3 | \name{estimateBetaParamsFromCounts}
 4 | \alias{estimateBetaParamsFromCounts}
 5 | \title{Estimate parameters of a Beta distribution from counts}
 6 | \usage{
 7 | estimateBetaParamsFromCounts(x)
 8 | }
 9 | \arguments{
10 | \item{x}{a matrix of counts}
11 | }
12 | \value{
13 | outputs a list object with the following components
14 | \item{n }{Normalised library size}
15 | \item{alpha }{a vector of alpha parameters for the Beta distribution for 
16 | each cell type}
17 | \item{beta }{vector of beta parameters for the Beta distribution for 
18 | each cell type}
19 | \item{pi }{Estimate of the true proportion for each cell type}
20 | \item{dispersion }{Dispersion estimates for each cell type}
21 | \item{var }{Variance estimates for each cell type}
22 | }
23 | \description{
24 | This function estimates the two parameters of the Beta distribution, alpha
25 | and beta for each cell type. The input is a matrix of cell type counts, 
26 | where the rows are the cell types/clusters and the columns are the samples.
27 | }
28 | \details{
29 | This function is called from the plotting function \code{plotCellTypeMeanVar}
30 | in order to estimate the variance for the Beta-Binomial distribution for 
31 | each cell type.
32 | }
33 | \examples{
34 | data <- speckle_example_data()
35 | x <- table(data$clusters, data$samples)
36 | estimateBetaParamsFromCounts(x)
37 | 
38 | 
39 | }
40 | \author{
41 | Belinda Phipson
42 | }
43 | 


--------------------------------------------------------------------------------
/man/getTransformedProps.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/getTransformedProps.R
 3 | \name{getTransformedProps}
 4 | \alias{getTransformedProps}
 5 | \title{Calculates and transforms cell type proportions}
 6 | \usage{
 7 | getTransformedProps(clusters = clusters, sample = sample, transform = NULL)
 8 | }
 9 | \arguments{
10 | \item{clusters}{a factor specifying the cluster or cell type for every cell.}
11 | 
12 | \item{sample}{a factor specifying the biological replicate for every cell.}
13 | 
14 | \item{transform}{a character scalar specifying which transformation of the 
15 | proportions to perform. Possible values include "asin" or "logit". Defaults
16 | to "asin".}
17 | }
18 | \value{
19 | outputs a list object with the following components
20 | \item{Counts }{A matrix of cell type counts with
21 | the rows corresponding to the clusters/cell types and the columns
22 | corresponding to the biological replicates/samples.}
23 | \item{TransformedProps }{A matrix of transformed cell type proportions with
24 | the rows corresponding to the clusters/cell types and the columns
25 | corresponding to the biological replicates/samples.} 
26 | \item{Proportions }{A  matrix of cell type proportions with the rows 
27 | corresponding to the clusters/cell types and the columns corresponding to 
28 | the biological replicates/samples.}
29 | }
30 | \description{
31 | Calculates cell types proportions based on clusters/cell types and sample
32 | information and performs a variance stabilising transformation on the
33 | proportions.
34 | }
35 | \details{
36 | This function is called by the \code{propeller} function and calculates cell
37 | type proportions and performs an arcsin-square root transformation.
38 | }
39 | \examples{
40 | 
41 |   library(speckle)
42 |   library(ggplot2)
43 |   library(limma)
44 | 
45 |   # Make up some data
46 | 
47 |   # True cell type proportions for 4 samples
48 |   p_s1 <- c(0.5,0.3,0.2)
49 |   p_s2 <- c(0.6,0.3,0.1)
50 |   p_s3 <- c(0.3,0.4,0.3)
51 |   p_s4 <- c(0.4,0.3,0.3)
52 | 
53 |   # Total numbers of cells per sample
54 |   numcells <- c(1000,1500,900,1200)
55 | 
56 |   # Generate cell-level vector for sample info
57 |   biorep <- rep(c("s1","s2","s3","s4"),numcells)
58 |   length(biorep)
59 | 
60 |   # Numbers of cells for each of 3 clusters per sample
61 |   n_s1 <- p_s1*numcells[1]
62 |   n_s2 <- p_s2*numcells[2]
63 |   n_s3 <- p_s3*numcells[3]
64 |   n_s4 <- p_s4*numcells[4]
65 | 
66 |   cl_s1 <- rep(c("c0","c1","c2"),n_s1)
67 |   cl_s2 <- rep(c("c0","c1","c2"),n_s2)
68 |   cl_s3 <- rep(c("c0","c1","c2"),n_s3)
69 |   cl_s4 <- rep(c("c0","c1","c2"),n_s4)
70 | 
71 |   # Generate cell-level vector for cluster info
72 |   clust <- c(cl_s1,cl_s2,cl_s3,cl_s4)
73 |   length(clust)
74 | 
75 |   getTransformedProps(clusters = clust, sample = biorep)
76 | 
77 | }
78 | \seealso{
79 | \code{\link{propeller}}
80 | }
81 | \author{
82 | Belinda Phipson
83 | }
84 | 


--------------------------------------------------------------------------------
/man/ggplotColors.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/ggplotColors.R
 3 | \name{ggplotColors}
 4 | \alias{ggplotColors}
 5 | \title{Output a vector of colours based on the ggplot colour scheme}
 6 | \usage{
 7 | ggplotColors(g)
 8 | }
 9 | \arguments{
10 | \item{g}{the number of colours to be generated.}
11 | }
12 | \value{
13 | a vector with the names of the colours.
14 | }
15 | \description{
16 | This function takes as input the number of colours the user would like, and
17 | outputs a vector of colours in the ggplot colour scheme.
18 | }
19 | \examples{
20 | # Generate a palette of 6 colours
21 | cols <- ggplotColors(6)
22 | cols
23 | 
24 | # Generate some count data
25 | y <- matrix(rnbinom(600, mu=100, size=1), ncol=6)
26 | 
27 | par(mfrow=c(1,1))
28 | boxplot(y, col=cols)
29 | 
30 | }
31 | \author{
32 | Belinda Phipson
33 | }
34 | 


--------------------------------------------------------------------------------
/man/normCounts.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/normCounts.R
 3 | \name{normCounts}
 4 | \alias{normCounts}
 5 | \title{Normalise a counts matrix to the median library size}
 6 | \usage{
 7 | normCounts(x, log = FALSE, prior.count = 0.5, lib.size = NULL)
 8 | }
 9 | \arguments{
10 | \item{x}{a \code{DGEList} object or matrix of counts.}
11 | 
12 | \item{log}{logical, indicates whether the output should be on the log2 scale
13 | or counts scale. Default is FALSE.}
14 | 
15 | \item{prior.count}{The prior count to add if the data is log2 normalised.
16 | Default is a small count of 0.5.}
17 | 
18 | \item{lib.size}{a vector of library sizes to be used during the normalisation
19 | step. Default is NULL and will be computed from the counts matrix.}
20 | }
21 | \value{
22 | a matrix of normalised counts
23 | }
24 | \description{
25 | This function takes a \code{DGEList} object or matrix of counts and
26 | normalises the counts to the median library size. This puts the normalised
27 | counts on a similar scale to the original counts.
28 | }
29 | \details{
30 | If the input is a DGEList object, the normalisation factors in
31 | \code{norm.factors} are taken into account in the normalisation. The prior
32 | counts are added proportionally to the library size
33 | }
34 | \examples{
35 | # Simulate some data from a negative binomial distribution with mean equal
36 | # to 100 and dispersion set to 1. Simulate 1000 genes and 6 samples.
37 | y <- matrix(rnbinom(6000, mu = 100, size = 1), ncol = 6)
38 | 
39 | # Normalise the counts
40 | norm.y <- normCounts(y)
41 | 
42 | # Return log2 normalised counts
43 | lnorm.y <- normCounts(y, log=TRUE)
44 | 
45 | # Return log2 normalised counts with prior.count = 2
46 | lnorm.y2 <- normCounts(y, log=TRUE, prior.count=2)
47 | 
48 | par(mfrow=c(1,2))
49 | boxplot(norm.y, main="Normalised counts")
50 | boxplot(lnorm.y, main="Log2-normalised counts")
51 | 
52 | }
53 | \author{
54 | Belinda Phipson
55 | }
56 | 


--------------------------------------------------------------------------------
/man/normSca.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/normSca.R
 3 | \name{normSca}
 4 | \alias{normSca}
 5 | \title{Normalise and scale counts matrix}
 6 | \usage{
 7 | normSca(x, lib.size = lib.size, log = TRUE, prior.count = 0.5)
 8 | }
 9 | \arguments{
10 | \item{x}{a matrix of counts}
11 | 
12 | \item{lib.size}{the library size}
13 | 
14 | \item{log}{logical, indicates whether the output should be on the log2 scale
15 | or counts scale. Default is TRUE.}
16 | 
17 | \item{prior.count}{The prior count to add if the data is log2 normalised.
18 | Default is a small count of 0.5.}
19 | }
20 | \value{
21 | a dataframe of log-normalised and scaled counts
22 | }
23 | \description{
24 | This function is called by the \code{preprocess} function and performs
25 | log-normalisation and scaling of a counts matrix.
26 | }
27 | \examples{
28 | 
29 | y <- matrix(rnbinom(1000, size=2, mu= 20),ncol=10)
30 | colnames(y)<- paste("Cell",1:10, sep="")
31 | row.names(y)<-paste("Gene",1:100,sep="")
32 | norm.data <- normSca(y,lib.size=colSums(y))
33 | 
34 | #Visualise the counts vs scaled data
35 | boxplot(y)
36 | boxplot(norm.data)
37 | 
38 | }
39 | 


--------------------------------------------------------------------------------
/man/plotCellTypeMeanVar.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/plotCellTypeMeanVar.R
 3 | \name{plotCellTypeMeanVar}
 4 | \alias{plotCellTypeMeanVar}
 5 | \title{Plot cell type counts means versus variances}
 6 | \usage{
 7 | plotCellTypeMeanVar(x)
 8 | }
 9 | \arguments{
10 | \item{x}{a matrix or table of counts}
11 | }
12 | \value{
13 | a base R plot
14 | }
15 | \description{
16 | This function returns a plot of the log10(mean) versus log10(variance) of 
17 | the cell type counts. The function takes a matrix of cell type counts as 
18 | input. The rows are the clusters/cell types and the columns are the samples.
19 | }
20 | \details{
21 | The expected variance under a binomial distribution is shown in the solid 
22 | line, and the points represent the observed variance for each cell type/row 
23 | in the counts table. The expected variance under different model assumptions
24 | are shown in the different coloured lines.
25 | 
26 | The mean and variance for each cell type is calculated across all samples.
27 | }
28 | \examples{
29 | library(edgeR)
30 | # Generate some data
31 | # Total number of samples
32 | nsamp <- 10
33 | # True cell type proportions
34 | p <- c(0.05, 0.15, 0.35, 0.45)
35 | 
36 | # Parameters for beta distribution
37 | a <- 40
38 | b <- a*(1-p)/p
39 | 
40 | # Sample total cell counts per sample from negative binomial distribution
41 | numcells <- rnbinom(nsamp,size=20,mu=5000)
42 | true.p <- matrix(c(rbeta(nsamp,a,b[1]),rbeta(nsamp,a,b[2]),
43 |           rbeta(nsamp,a,b[3]),rbeta(nsamp,a,b[4])),byrow=TRUE, ncol=nsamp)
44 | 
45 | counts <- matrix(NA,ncol=nsamp, nrow=nrow(true.p))
46 | rownames(counts) <- paste("c",0:(nrow(true.p)-1), sep="")
47 | for(j in 1:length(p)){
48 |     counts[j,] <- rbinom(nsamp, size=numcells, prob=true.p[j,])
49 | }
50 | 
51 | plotCellTypeMeanVar(counts)
52 | 
53 | }
54 | \author{
55 | Belinda Phipson
56 | }
57 | 


--------------------------------------------------------------------------------
/man/plotCellTypeProps.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/plotCellTypeProps.R
 3 | \name{plotCellTypeProps}
 4 | \alias{plotCellTypeProps}
 5 | \title{Plot cell type proportions for each sample}
 6 | \usage{
 7 | plotCellTypeProps(x = NULL, clusters = NULL, sample = NULL)
 8 | }
 9 | \arguments{
10 | \item{x}{object of class \code{SingleCellExperiment} or \code{Seurat}}
11 | 
12 | \item{clusters}{a factor specifying the cluster or cell type for every cell.
13 | For \code{SingleCellExperiment} objects this should correspond to a column
14 | called \code{clusters} in the \code{colData} assay. For \code{Seurat}
15 | objects this will be extracted by a call to \code{Idents(x)}.}
16 | 
17 | \item{sample}{a factor specifying the biological replicate for each cell.
18 | For \code{SingleCellExperiment} objects this should correspond to a column
19 | called \code{sample} in the \code{colData} assay and for \code{Seurat}
20 | objects this should correspond to \code{x$sample}.}
21 | }
22 | \value{
23 | a ggplot2 object
24 | }
25 | \description{
26 | This is a plotting function that shows the cell type composition for each
27 | sample as a stacked barplot. The \code{plotCellTypeProps} returns a
28 | \code{ggplot2} object enabling the user to make style changes as required.
29 | }
30 | \examples{
31 | 
32 | library(speckle)
33 | library(ggplot2)
34 | library(limma)
35 | 
36 | # Generate some fake data from a multinomial distribution
37 | # Group A, 4 samples, 1000 cells in each sample
38 | countsA <- rmultinom(4, size=1000, prob=c(0.1,0.3,0.6))
39 | colnames(countsA) <- paste("s",1:4,sep="")
40 | 
41 | # Group B, 3 samples, 800 cells in each sample
42 | 
43 | countsB <- rmultinom(3, size=800, prob=c(0.2,0.05,0.75))
44 | colnames(countsB) <- paste("s",5:7,sep="")
45 | rownames(countsA) <- rownames(countsB) <- paste("c",0:2,sep="")
46 | 
47 | allcounts <- cbind(countsA, countsB)
48 | sample <- c(rep(colnames(allcounts),allcounts[1,]),
49 |           rep(colnames(allcounts),allcounts[2,]),
50 |           rep(colnames(allcounts),allcounts[3,]))
51 | clust <- rep(rownames(allcounts),rowSums(allcounts))
52 | 
53 | plotCellTypeProps(clusters=clust, sample=sample)
54 | 
55 | }
56 | \author{
57 | Belinda Phipson
58 | }
59 | 


--------------------------------------------------------------------------------
/man/plotCellTypePropsMeanVar.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/plotCellTypePropsMeanVar.R
 3 | \name{plotCellTypePropsMeanVar}
 4 | \alias{plotCellTypePropsMeanVar}
 5 | \title{Plot cell type proportions versus variances}
 6 | \usage{
 7 | plotCellTypePropsMeanVar(x)
 8 | }
 9 | \arguments{
10 | \item{x}{a matrix or table of counts}
11 | }
12 | \value{
13 | a base R plot
14 | }
15 | \description{
16 | This function returns a plot of the log10(proportion) versus log10(variance)
17 | given a matrix of cell type counts. The rows are the clusters/cell types and
18 | the columns are the samples.
19 | }
20 | \details{
21 | The expected variance under a binomial distribution is shown in the solid 
22 | line, and the points represent the observed variance for each cell type/row 
23 | in the counts table. The blue line shows the empirical Bayes variance 
24 | that is used in \code{propeller}.
25 | 
26 | The mean and variance for each cell type is calculated across all samples.
27 | }
28 | \examples{
29 | library(limma)
30 | # Generate some data
31 | # Total number of samples
32 | nsamp <- 10
33 | # True cell type proportions
34 | p <- c(0.05, 0.15, 0.35, 0.45)
35 | 
36 | # Parameters for beta distribution
37 | a <- 40
38 | b <- a*(1-p)/p
39 | 
40 | # Sample total cell counts per sample from negative binomial distribution
41 | numcells <- rnbinom(nsamp,size=20,mu=5000)
42 | true.p <- matrix(c(rbeta(nsamp,a,b[1]),rbeta(nsamp,a,b[2]),
43 |           rbeta(nsamp,a,b[3]),rbeta(nsamp,a,b[4])),byrow=TRUE, ncol=nsamp)
44 | 
45 | counts <- matrix(NA,ncol=nsamp, nrow=nrow(true.p))
46 | rownames(counts) <- paste("c",0:(nrow(true.p)-1), sep="")
47 | for(j in 1:length(p)){
48 |     counts[j,] <- rbinom(nsamp, size=numcells, prob=true.p[j,])
49 | }
50 | 
51 | plotCellTypePropsMeanVar(counts)
52 | 
53 | }
54 | \author{
55 | Belinda Phipson
56 | }
57 | 


--------------------------------------------------------------------------------
/man/preprocess.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/preprocess.R
 3 | \name{preprocess}
 4 | \alias{preprocess}
 5 | \title{Pre-processing function for sex classification}
 6 | \usage{
 7 | preprocess(x, genome = genome, qc = qc)
 8 | }
 9 | \arguments{
10 | \item{x}{the counts matrix, rows are genes and columns are cells. Row names 
11 | must be gene symbols.}
12 | 
13 | \item{genome}{the genome the data arises from. Current options are 
14 | human: genome = "Hs" or mouse: genome = "Mm".}
15 | 
16 | \item{qc}{logical, indicates whether to perform additional quality control on 
17 | the cells. qc = TRUE will predict cells that pass quality control only and 
18 | the filtered cells will not be classified. qc = FALSE will predict every cell 
19 | except the cells with zero counts on *XIST/Xist* and the sum of the Y genes. 
20 | Default is TRUE.}
21 | }
22 | \value{
23 | outputs a list object with the following components
24 | \item{tcm.final }{A transposed count matrix where rows are cells and columns 
25 | are the features used for classification.}
26 | \item{data.df }{The normalised and scaled \code{tcm.final} matrix.} 
27 | \item{discarded.cells }{Character vector of cell IDs for the cells that are
28 | discarded when \code{qc=TRUE}.}
29 | \item{zero.cells }{Character vector of cell IDs for the cells that can not
30 | be classified as male/female due to zero counts on *Xist* and all the 
31 | Y chromosome genes.}
32 | }
33 | \description{
34 | The purpose of this function is to process a single cell counts matrix into
35 | the appropriate format for the \code{classifySex} function.
36 | }
37 | \details{
38 | This function will filter out cells that are unable to be classified due to 
39 | zero counts on *XIST/Xist* and all of the Y chromosome genes. If 
40 | \code{qc=TRUE} additional cells are removed as identified by the 
41 | \code{perCellQCMetrics} and \code{quickPerCellQC} functions from the 
42 | \code{scuttle} package. The resulting counts matrix is then log-normalised 
43 | and scaled.
44 | }
45 | \examples{
46 | 
47 | library(speckle)
48 | library(SingleCellExperiment)
49 | library(CellBench)
50 | library(org.Hs.eg.db)
51 | 
52 | # Get data from CellBench library
53 | sc_data <- load_sc_data()
54 | sc_10x <- sc_data$sc_10x
55 | 
56 | # Get counts matrix in correct format with gene symbol as rownames 
57 | # rather than ENSEMBL ID.
58 | counts <- counts(sc_10x)
59 | ann <- select(org.Hs.eg.db, keys=rownames(sc_10x),
60 |              columns=c("ENSEMBL","SYMBOL"), keytype="ENSEMBL")
61 | m <- match(rownames(counts), ann$ENSEMBL)
62 | rownames(counts) <- ann$SYMBOL[m]
63 | 
64 | # Preprocess data
65 | pro.data <- preprocess(counts, genome="Hs", qc = TRUE)
66 | 
67 | # Look at counts on XIST and superY.all
68 | plot(pro.data$tcm.final$XIST, pro.data$tcm.final$superY)
69 | 
70 | # Cells that are identified as low quality
71 | pro.data$discarded.cells
72 | 
73 | # Cells with zero counts on XIST and all Y genes
74 | pro.data$zero.cells
75 | 
76 | }
77 | 


--------------------------------------------------------------------------------
/man/propeller.Rd:
--------------------------------------------------------------------------------
  1 | % Generated by roxygen2: do not edit by hand
  2 | % Please edit documentation in R/propeller.R
  3 | \name{propeller}
  4 | \alias{propeller}
  5 | \title{Finding statistically significant differences in cell type proportions}
  6 | \usage{
  7 | propeller(
  8 |   x = NULL,
  9 |   clusters = NULL,
 10 |   sample = NULL,
 11 |   group = NULL,
 12 |   trend = FALSE,
 13 |   robust = TRUE,
 14 |   transform = "logit"
 15 | )
 16 | }
 17 | \arguments{
 18 | \item{x}{object of class \code{SingleCellExperiment} or \code{Seurat}}
 19 | 
 20 | \item{clusters}{a factor specifying the cluster or cell type for every cell.
 21 | For \code{SingleCellExperiment} objects this should correspond to a column
 22 | called \code{clusters} in the \code{colData} assay. For \code{Seurat}
 23 | objects this will be extracted by a call to \code{Idents(x)}.}
 24 | 
 25 | \item{sample}{a factor specifying the biological replicate for each cell.
 26 | For \code{SingleCellExperiment} objects this should correspond to a column
 27 | called \code{sample} in the \code{colData} assay and for \code{Seurat}
 28 | objects this should correspond to \code{x$sample}.}
 29 | 
 30 | \item{group}{a factor specifying the groups of interest for performing the
 31 | differential proportions analysis. For \code{SingleCellExperiment} objects
 32 | this should correspond to a column called \code{group} in the \code{colData}
 33 | assay.  For \code{Seurat} objects this should correspond to \code{x$group}.}
 34 | 
 35 | \item{trend}{logical, if true fits a mean variance trend on the transformed
 36 | proportions}
 37 | 
 38 | \item{robust}{logical, if true performs robust empirical Bayes shrinkage of
 39 | the variances}
 40 | 
 41 | \item{transform}{a character scalar specifying which transformation of the 
 42 | proportions to perform. Possible values include "asin" or "logit". Defaults
 43 | to "logit".}
 44 | }
 45 | \value{
 46 | produces a dataframe of results
 47 | }
 48 | \description{
 49 | Calculates cell type proportions, performs a variance stabilising
 50 | transformation on the proportions and determines whether the cell type
 51 | proportions are statistically significant between different groups using
 52 | linear modelling.
 53 | }
 54 | \details{
 55 | This function will take a \code{SingleCellExperiment} or \code{Seurat}
 56 | object and extract the \code{group}, \code{sample} and \code{clusters} cell
 57 | information. The user can either state these factor vectors explicitly in
 58 | the call to the \code{propeller} function, or internal functions will
 59 | extract them from the relevants objects. The user must ensure that
 60 | \code{group} and \code{sample} are columns in the metadata assays of the
 61 | relevant objects (any combination of upper/lower case is acceptable). For
 62 | \code{Seurat} objects the clusters are extracted using the \code{Idents}
 63 | function. For \code{SingleCellExperiment} objects, \code{clusters} needs to
 64 | be a column in the \code{colData} assay.
 65 | 
 66 | The \code{propeller} function calculates cell type proportions for each
 67 | biological replicate, performs a variance stabilising transformation on the
 68 | matrix of proportions and fits a linear model for each cell type or cluster
 69 | using the \code{limma} framework. There are two options for the 
 70 | transformation: arcsin square root or logit. Propeller tests whether there 
 71 | is a difference in the cell type proportions between multiple groups. 
 72 | If there are only 2 groups, a t-test is used to calculate p-values, and if 
 73 | there are more than 2 groups, an F-test (ANOVA) is used. Cell type 
 74 | proportions of 1 or 0 are accommodated. Benjamini and Hochberg false 
 75 | discovery rates are calculated to account to multiple testing of 
 76 | cell types/clusters.
 77 | }
 78 | \examples{
 79 | 
 80 |   library(speckle)
 81 |   library(ggplot2)
 82 |   library(limma)
 83 | 
 84 |   # Make up some data
 85 |   # True cell type proportions for 4 samples
 86 |   p_s1 <- c(0.5,0.3,0.2)
 87 |   p_s2 <- c(0.6,0.3,0.1)
 88 |   p_s3 <- c(0.3,0.4,0.3)
 89 |   p_s4 <- c(0.4,0.3,0.3)
 90 | 
 91 |   # Total numbers of cells per sample
 92 |   numcells <- c(1000,1500,900,1200)
 93 | 
 94 |   # Generate cell-level vector for sample info
 95 |   biorep <- rep(c("s1","s2","s3","s4"),numcells)
 96 |   length(biorep)
 97 | 
 98 |   # Numbers of cells for each of the 3 clusters per sample
 99 |   n_s1 <- p_s1*numcells[1]
100 |   n_s2 <- p_s2*numcells[2]
101 |   n_s3 <- p_s3*numcells[3]
102 |   n_s4 <- p_s4*numcells[4]
103 | 
104 |   # Assign cluster labels for 4 samples
105 |   cl_s1 <- rep(c("c0","c1","c2"),n_s1)
106 |   cl_s2 <- rep(c("c0","c1","c2"),n_s2)
107 |   cl_s3 <- rep(c("c0","c1","c2"),n_s3)
108 |   cl_s4 <- rep(c("c0","c1","c2"),n_s4)
109 | 
110 |   # Generate cell-level vector for cluster info
111 |   clust <- c(cl_s1,cl_s2,cl_s3,cl_s4)
112 |   length(clust)
113 | 
114 |   # Assume s1 and s2 belong to group 1 and s3 and s4 belong to group 2
115 |   grp <- rep(c("grp1","grp2"),c(sum(numcells[1:2]),sum(numcells[3:4])))
116 | 
117 |   propeller(clusters = clust, sample = biorep, group = grp,
118 |   robust = FALSE, trend = FALSE, transform="asin")
119 | 
120 | }
121 | \references{
122 | Smyth, G.K. (2004). Linear models and empirical Bayes methods
123 | for assessing differential expression in microarray experiments.
124 | \emph{Statistical Applications in Genetics and Molecular Biology}, Volume
125 | \bold{3}, Article 3.
126 | 
127 | Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery
128 | rate: a practical and powerful approach to multiple testing. \emph{Journal
129 | of the Royal Statistical Society Series}, B, \bold{57}, 289-300.
130 | }
131 | \seealso{
132 | \code{\link{propeller.ttest}} \code{\link{propeller.anova}} 
133 | \code{\link{lmFit}}, \code{\link{eBayes}},
134 | \code{\link{getTransformedProps}}
135 | }
136 | \author{
137 | Belinda Phipson
138 | }
139 | 


--------------------------------------------------------------------------------
/man/propeller.anova.Rd:
--------------------------------------------------------------------------------
  1 | % Generated by roxygen2: do not edit by hand
  2 | % Please edit documentation in R/propeller.anova.R
  3 | \name{propeller.anova}
  4 | \alias{propeller.anova}
  5 | \title{Performs F-tests for transformed cell type proportions}
  6 | \usage{
  7 | propeller.anova(
  8 |   prop.list = prop.list,
  9 |   design = design,
 10 |   coef = coef,
 11 |   robust = robust,
 12 |   trend = trend,
 13 |   sort = sort
 14 | )
 15 | }
 16 | \arguments{
 17 | \item{prop.list}{a list object containing two matrices:
 18 | \code{TransformedProps} and \code{Proportions}}
 19 | 
 20 | \item{design}{a design matrix with rows corresponding to samples and columns
 21 | to coefficients to be estimated}
 22 | 
 23 | \item{coef}{a vector specifying which the columns of the design matrix
 24 | corresponding to the groups to test}
 25 | 
 26 | \item{robust}{logical, should robust variance estimation be used. Defaults to
 27 | TRUE.}
 28 | 
 29 | \item{trend}{logical, should a trend between means and variances be accounted
 30 | for. Defaults to FALSE.}
 31 | 
 32 | \item{sort}{logical, should the output be sorted by P-value.}
 33 | }
 34 | \value{
 35 | produces a dataframe of results
 36 | }
 37 | \description{
 38 | This function is called by \code{propeller} and performs F-tests between
 39 | multiple experimental groups or conditions (> 2) on transformed cell type 
 40 | proportions.
 41 | }
 42 | \details{
 43 | In order to run this function, the user needs to run the
 44 | \code{getTransformedProps} function first. The output from
 45 | \code{getTransformedProps} is used as input. The \code{propeller.anova}
 46 | function expects that the design matrix is not in the intercept format.
 47 | This \code{coef} vector will identify the columns in the design matrix that
 48 | correspond to the groups being tested.
 49 | Note that additional confounding covariates can be accounted for as extra
 50 | columns in the design matrix, but need to come after the group-specific
 51 | columns.
 52 | 
 53 | The \code{propeller.anova} function uses the \code{limma} functions
 54 | \code{lmFit} and \code{eBayes} to extract F statistics and p-values.
 55 | This has the additional advantage that empirical Bayes shrinkage of the
 56 | variances are performed.
 57 | }
 58 | \examples{
 59 |   library(speckle)
 60 |   library(ggplot2)
 61 |   library(limma)
 62 | 
 63 |   # Make up some data
 64 | 
 65 |   # True cell type proportions for 4 samples
 66 |   p_s1 <- c(0.5,0.3,0.2)
 67 |   p_s2 <- c(0.6,0.3,0.1)
 68 |   p_s3 <- c(0.3,0.4,0.3)
 69 |   p_s4 <- c(0.4,0.3,0.3)
 70 |   p_s5 <- c(0.8,0.1,0.1)
 71 |   p_s6 <- c(0.75,0.2,0.05)
 72 | 
 73 |   # Total numbers of cells per sample
 74 |   numcells <- c(1000,1500,900,1200,1000,800)
 75 | 
 76 |   # Generate cell-level vector for sample info
 77 |   biorep <- rep(c("s1","s2","s3","s4","s5","s6"),numcells)
 78 |   length(biorep)
 79 | 
 80 |   # Numbers of cells for each of 3 clusters per sample
 81 |   n_s1 <- p_s1*numcells[1]
 82 |   n_s2 <- p_s2*numcells[2]
 83 |   n_s3 <- p_s3*numcells[3]
 84 |   n_s4 <- p_s4*numcells[4]
 85 |   n_s5 <- p_s5*numcells[5]
 86 |   n_s6 <- p_s6*numcells[6]
 87 | 
 88 |   cl_s1 <- rep(c("c0","c1","c2"),n_s1)
 89 |   cl_s2 <- rep(c("c0","c1","c2"),n_s2)
 90 |   cl_s3 <- rep(c("c0","c1","c2"),n_s3)
 91 |   cl_s4 <- rep(c("c0","c1","c2"),n_s4)
 92 |   cl_s5 <- rep(c("c0","c1","c2"),n_s5)
 93 |   cl_s6 <- rep(c("c0","c1","c2"),n_s6)
 94 | 
 95 |   # Generate cell-level vector for cluster info
 96 |   clust <- c(cl_s1,cl_s2,cl_s3,cl_s4,cl_s5,cl_s6)
 97 |   length(clust)
 98 | 
 99 |   prop.list <- getTransformedProps(clusters = clust, sample = biorep)
100 | 
101 |   # Assume s1 and s2 belong to group A, s3 and s4 belong to group B, s5 and
102 |   # s6 belong to group C
103 |   grp <- rep(c("A","B","C"), each=2)
104 | 
105 |   # Make sure design matrix does not have an intercept term
106 |   design <- model.matrix(~0+grp)
107 |   design
108 | 
109 |   propeller.anova(prop.list, design=design, coef=c(1,2,3), robust=TRUE,
110 |   trend=FALSE, sort=TRUE)
111 | 
112 | }
113 | \seealso{
114 | \code{\link{propeller}}, \code{\link{getTransformedProps}},
115 | \code{\link{lmFit}}, \code{\link{eBayes}}
116 | }
117 | \author{
118 | Belinda Phipson
119 | }
120 | 


--------------------------------------------------------------------------------
/man/propeller.ttest.Rd:
--------------------------------------------------------------------------------
  1 | % Generated by roxygen2: do not edit by hand
  2 | % Please edit documentation in R/propeller.ttest.R
  3 | \name{propeller.ttest}
  4 | \alias{propeller.ttest}
  5 | \title{Performs t-tests of transformed cell type proportions}
  6 | \usage{
  7 | propeller.ttest(
  8 |   prop.list = prop.list,
  9 |   design = design,
 10 |   contrasts = contrasts,
 11 |   robust = robust,
 12 |   trend = trend,
 13 |   sort = sort
 14 | )
 15 | }
 16 | \arguments{
 17 | \item{prop.list}{a list object containing two matrices:
 18 | \code{TransformedProps} and \code{Proportions}}
 19 | 
 20 | \item{design}{a design matrix with rows corresponding to samples and columns
 21 | to coefficients to be estimated}
 22 | 
 23 | \item{contrasts}{a vector specifying which columns of the design matrix
 24 | correspond to the two groups to test}
 25 | 
 26 | \item{robust}{logical, should robust variance estimation be used. Defaults to
 27 | TRUE.}
 28 | 
 29 | \item{trend}{logical, should a trend between means and variances be accounted
 30 | for. Defaults to FALSE.}
 31 | 
 32 | \item{sort}{logical, should the output be sorted by P-value.}
 33 | }
 34 | \value{
 35 | produces a dataframe of results
 36 | }
 37 | \description{
 38 | This function is called by \code{propeller} and performs t-tests between two
 39 | experimental groups or conditions on the transformed cell type proportions.
 40 | }
 41 | \details{
 42 | In order to run this function, the user needs to run the
 43 | \code{getTransformedProps} function first. The output from
 44 | \code{getTransformedProps} is used as input. The \code{propeller.ttest}
 45 | function expects that the design matrix is not in the intercept format
 46 | and a contrast vector needs to be supplied. This contrast vector will
 47 | identify the two groups to be tested. Note that additional confounding
 48 | covariates can be accounted for as extra columns in the design matrix after
 49 | the group-specific columns.
 50 | 
 51 | The \code{propeller.ttest} function uses the \code{limma} functions
 52 | \code{lmFit}, \code{contrasts.fit} and \code{eBayes} which has the additional
 53 | advantage that empirical Bayes shrinkage of the variances are performed.
 54 | }
 55 | \examples{
 56 |   library(speckle)
 57 |   library(ggplot2)
 58 |   library(limma)
 59 | 
 60 |   # Make up some data
 61 | 
 62 |   # True cell type proportions for 4 samples
 63 |   p_s1 <- c(0.5,0.3,0.2)
 64 |   p_s2 <- c(0.6,0.3,0.1)
 65 |   p_s3 <- c(0.3,0.4,0.3)
 66 |   p_s4 <- c(0.4,0.3,0.3)
 67 | 
 68 |   # Total numbers of cells per sample
 69 |   numcells <- c(1000,1500,900,1200)
 70 | 
 71 |   # Generate cell-level vector for sample info
 72 |   biorep <- rep(c("s1","s2","s3","s4"),numcells)
 73 |   length(biorep)
 74 | 
 75 |   # Numbers of cells for each of 3 clusters per sample
 76 |   n_s1 <- p_s1*numcells[1]
 77 |   n_s2 <- p_s2*numcells[2]
 78 |   n_s3 <- p_s3*numcells[3]
 79 |   n_s4 <- p_s4*numcells[4]
 80 | 
 81 |   cl_s1 <- rep(c("c0","c1","c2"),n_s1)
 82 |   cl_s2 <- rep(c("c0","c1","c2"),n_s2)
 83 |   cl_s3 <- rep(c("c0","c1","c2"),n_s3)
 84 |   cl_s4 <- rep(c("c0","c1","c2"),n_s4)
 85 | 
 86 |   # Generate cell-level vector for cluster info
 87 |   clust <- c(cl_s1,cl_s2,cl_s3,cl_s4)
 88 |   length(clust)
 89 | 
 90 |   prop.list <- getTransformedProps(clusters = clust, sample = biorep)
 91 | 
 92 |   # Assume s1 and s2 belong to group 1 and s3 and s4 belong to group 2
 93 |   grp <- rep(c("A","B"), each=2)
 94 | 
 95 |   design <- model.matrix(~0+grp)
 96 |   design
 97 | 
 98 |   # Compare Grp A to B
 99 |   contrasts <- c(1,-1)
100 | 
101 |   propeller.ttest(prop.list, design=design, contrasts=contrasts, robust=TRUE,
102 |   trend=FALSE, sort=TRUE)
103 | 
104 |   # Pretend additional sex variable exists and we want to control for it
105 |   # in the linear model
106 |   sex <- rep(c(0,1),2)
107 |   des.sex <- model.matrix(~0+grp+sex)
108 |   des.sex
109 | 
110 |   # Compare Grp A to B
111 |   cont.sex <- c(1,-1,0)
112 | 
113 |   propeller.ttest(prop.list, design=des.sex, contrasts=cont.sex, robust=TRUE,
114 |   trend=FALSE, sort=TRUE)
115 | 
116 | 
117 | }
118 | \seealso{
119 | \code{\link{propeller}}, \code{\link{getTransformedProps}},
120 | \code{\link{lmFit}}, \code{\link{contrasts.fit}}, \code{\link{eBayes}}
121 | }
122 | \author{
123 | Belinda Phipson
124 | }
125 | 


--------------------------------------------------------------------------------
/man/speckle-package.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/speckle-package.R
 3 | \docType{package}
 4 | \name{speckle-package}
 5 | \alias{speckle}
 6 | \alias{speckle-package}
 7 | \title{speckle: Statistical methods for analysing single cell RNA-seq data}
 8 | \description{
 9 | speckle contains functions for the analysis of single cell RNA-seq data.
10 | }
11 | \keyword{internal}
12 | 


--------------------------------------------------------------------------------
/man/speckle_example_data.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/speckle_example_data.R
 3 | \name{speckle_example_data}
 4 | \alias{speckle_example_data}
 5 | \title{Generate example data}
 6 | \usage{
 7 | speckle_example_data()
 8 | }
 9 | \value{
10 | a dataframe containing cell-level information for sample, group and
11 | clusters
12 | }
13 | \description{
14 | Generate example data
15 | }
16 | \examples{
17 | 
18 | speckle_example_data()
19 | 
20 | }
21 | 


--------------------------------------------------------------------------------
/vignettes/speckle.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | author: "Belinda Phipson"
  3 | title: "speckle: statistical methods for analysing single cell RNA-seq data"
  4 | date: "`r BiocStyle::doc_date()`"
  5 | package: "`r BiocStyle::pkg_ver('speckle')`"
  6 | vignette: >
  7 |   %\VignetteEncoding{UTF-8}
  8 |   %\VignetteIndexEntry{speckle: statistical methods for analysing single cell RNA-seq data}
  9 |   %\VignetteEngine{knitr::rmarkdown}
 10 | output: >
 11 |   BiocStyle::html_document
 12 | html_document:
 13 |   fig_caption: yes
 14 |   fig_retina: FALSE
 15 |   keep_md: FALSE
 16 | editor_options: 
 17 |   chunk_output_type: console
 18 | ---
 19 | 
 20 | ```{r setup, include=FALSE}
 21 | knitr::opts_chunk$set(
 22 |     echo = TRUE,
 23 |     message = FALSE,
 24 |     warning = FALSE
 25 | )
 26 | ```
 27 | 
 28 | # Introduction
 29 | 
 30 | The `r BiocStyle::Biocpkg("speckle")` package contains functions to analyse 
 31 | differences in cell type proportions in single cell RNA-seq data. As our 
 32 | research into specialised analyses of single cell data continues we anticipate 
 33 | that the package will be updated with new functions.
 34 | 
 35 | The analysis of single cell RNA-seq data consists of a large number of steps, 
 36 | which can be iterative and also depend on the research question. There are many 
 37 | R packages that can do some or most of these steps. The analysis steps are 
 38 | described here briefly. 
 39 | 
 40 | Once the sequencing data has been summarised into counts over genes, quality 
 41 | control is performed to remove poor quality cells. Poor quality cells are often 
 42 | characterised as having very low total counts (library size) and very few genes 
 43 | detected. Lowly expressed and uninformative genes are filtered out, followed by 
 44 | appropriate normalisation. Dimensionality reduction and clustering of the cells
 45 | is then performed. Cells that have similar transcriptional profiles cluster 
 46 | together, and these clusters (hopefully) correspond to something biologically 
 47 | relevant, such as different cell types. Differential expression between each 
 48 | cluster compared to all other clusters can highlight genes that are more highly
 49 | expressed in each cluster. These marker genes help to determine the cell type 
 50 | each cluster corresponds to. Cell type identification is a process that often 
 51 | uses marker genes as well as a list of curated genes that are known to be 
 52 | expressed in each cell type. It is always helpful to visualise the data in a lot
 53 | of different ways to aid in interpretation of the clusters using tSNE/UMAP 
 54 | plots, clustering trees and heatmaps of known marker genes.
 55 | 
 56 | # Installation
 57 | 
 58 | ```{r eval=FALSE}
 59 | library(devtools)
 60 | install_github("/Oshlack/speckle")
 61 | ```
 62 | 
 63 | # Finding significant differences in cell type proportions using propeller
 64 | 
 65 | In order to determine whether there are statistically significant compositional 
 66 | differences between groups, there must be some form of biological replication in
 67 | the experiment. This is so that we can estimate the variability of the cell type
 68 | proportion estimates for each group. A classical statistical test for 
 69 | differences between two proportions is typically very sensitive to small changes
 70 | and will almost always yield a significant p-value. Hence `propeller` is only 
 71 | suitable to use in single cell experiments where there are multiple groups and 
 72 | multiple biological replicates in at least one of the groups. The absolute 
 73 | minimum sample size is 2 in one group and 1 in the other group/s. Variance 
 74 | estimates are obtained from the group with more than 1 biological replicate 
 75 | which assumes that the cell type proportion variances estimates are similar 
 76 | between experimental conditions.
 77 | 
 78 | The `propeller` test is performed after initial analysis of the single cell data
 79 | has been done, i.e. after clustering and cell type assignment. The `propeller` 
 80 | function can take a `SingleCellExperiment` or `Seurat` object and extract the 
 81 | necessary information from the metadata. The basic model for `propeller` is that
 82 | the cell type proportions for each sample are estimated based on the clustering 
 83 | information provided by the user or extracted from the relevant slots in the 
 84 | data objects. The proportions are then transformed using either an arcsin square
 85 | root transformation, or logit transformation. For 
 86 | each cell type $i$, we fit a linear model with group as the explanatory variable
 87 | using functions from the R Bioconductor package `r BiocStyle::Biocpkg("limma")`.
 88 | Using `r BiocStyle::Biocpkg("limma")` to obtain p-values has the added benefit 
 89 | of performing empirical Bayes shrinkage of the variances. For every cell type 
 90 | we obtain a p-value that indicates whether that cell type proportion is 
 91 | statistically significantly different between two (or more) groups.
 92 | 
 93 | # Load the libraries
 94 | 
 95 | ```{r}
 96 | library(speckle)
 97 | library(SingleCellExperiment)
 98 | library(CellBench)
 99 | library(limma)
100 | library(ggplot2)
101 | library(scater)
102 | library(patchwork)
103 | library(edgeR)
104 | library(AnnotationDbi)
105 | library(org.Hs.eg.db)
106 | ```
107 | 
108 | 
109 | # Loading data into R
110 | 
111 | We are using single cell data from the `r BiocStyle::Biocpkg("CellBench")` 
112 | package to illustrate how `propeller` works. This is an artificial dataset that 
113 | is made up of an equal mixture of 3 different cell lines. There are three 
114 | datasets corresponding to three different technologies: 10x genomics, CelSeq and
115 | DropSeq.
116 | 
117 | ```{r}
118 | sc_data <- load_sc_data()
119 | ```
120 | 
121 | The way that `propeller` is designed to be used is in the context of a designed 
122 | experiment where there are multiple biological replicates and multiple groups. 
123 | Comparing cell type proportions without biological replication should be done 
124 | with caution as there will be a large degree of variability in the cell type 
125 | proportions between samples due to technical factors (cell capture bias, 
126 | sampling, clustering errors), as well as biological variability. The 
127 | `r BiocStyle::Biocpkg("CellBench")` dataset does not have biological 
128 | replication, so we will create several artificial biological replicates by 
129 | bootstrapping the data. Bootstrapping has the advantage that it induces 
130 | variability between bootstrap samples by sampling with replacement. Here we will
131 | treat the three technologies as the groups, and create artifical biological 
132 | replicates within each group. Note that bootstrapping only induces sampling 
133 | variability between our biological replicates, which will almost certainly be 
134 | much smaller than biological variability we would expect to see in a real 
135 | dataset.
136 | 
137 | The three single cell experiment objects in `sc_data` all have differing numbers
138 | of genes. The first step is to find all the common genes between all three 
139 | experiments in order to create one large dataset.
140 | 
141 | ```{r}
142 | commongenes1 <- rownames(sc_data$sc_dropseq)[rownames(sc_data$sc_dropseq) %in% 
143 |                                                rownames(sc_data$sc_celseq)]
144 | commongenes2 <-  commongenes1[commongenes1 %in% rownames(sc_data$sc_10x)]
145 | 
146 | sce_10x <- sc_data$sc_10x[commongenes2,]
147 | sce_celseq <- sc_data$sc_celseq[commongenes2,] 
148 | sce_dropseq <- sc_data$sc_dropseq[commongenes2,] 
149 | 
150 | dim(sce_10x)
151 | dim(sce_celseq)
152 | dim(sce_dropseq)
153 | 
154 | table(rownames(sce_10x) == rownames(sce_celseq))
155 | table(rownames(sce_10x) == rownames(sce_dropseq))
156 | ```
157 | 
158 | # Bootstrap additional samples
159 | 
160 | This dataset does not have any biological replicates, so we will bootstrap 
161 | additional samples and pretend that they are biological replicates. 
162 | Bootstrapping won't replicate true biological variation between samples, but we 
163 | will ignore that for the purpose of demonstrating how `propeller` works. Note 
164 | that we don't need to simulate gene expression measurements; `propeller` only 
165 | uses cluster information, hence we simply bootstrap the column indices of the 
166 | single cell count matrices.
167 | 
168 | ```{r}
169 | i.10x <- seq_len(ncol(sce_10x))
170 | i.celseq <- seq_len(ncol(sce_celseq))
171 | i.dropseq <- seq_len(ncol(sce_dropseq))
172 | 
173 | set.seed(10)
174 | boot.10x <- sample(i.10x, replace=TRUE)
175 | boot.celseq <- sample(i.celseq, replace=TRUE)
176 | boot.dropseq <- sample(i.dropseq, replace=TRUE)
177 | 
178 | sce_10x_rep2 <- sce_10x[,boot.10x]
179 | sce_celseq_rep2 <- sce_celseq[,boot.celseq]
180 | sce_dropseq_rep2 <- sce_dropseq[,boot.dropseq]
181 | ```
182 | 
183 | # Combine all SingleCellExperiment objects
184 | 
185 | The `SingleCellExperiment` objects don't combine very easily, so I will create a
186 | new object manually, and retain only the information needed to run `propeller`.
187 | 
188 | ```{r}
189 | sample <- rep(c("S1","S2","S3","S4","S5","S6"), 
190 |               c(ncol(sce_10x),ncol(sce_10x_rep2),ncol(sce_celseq),
191 |                 ncol(sce_celseq_rep2), 
192 |                 ncol(sce_dropseq),ncol(sce_dropseq_rep2)))
193 | cluster <- c(sce_10x$cell_line,sce_10x_rep2$cell_line,sce_celseq$cell_line,
194 |              sce_celseq_rep2$cell_line,sce_dropseq$cell_line,
195 |              sce_dropseq_rep2$cell_line)
196 | group <- rep(c("10x","celseq","dropseq"),
197 |              c(2*ncol(sce_10x),2*ncol(sce_celseq),2*ncol(sce_dropseq)))
198 | 
199 | allcounts <- cbind(counts(sce_10x),counts(sce_10x_rep2), 
200 |                    counts(sce_celseq), counts(sce_celseq_rep2),
201 |                    counts(sce_dropseq), counts(sce_dropseq_rep2))
202 | 
203 | sce_all <- SingleCellExperiment(assays = list(counts = allcounts))
204 | sce_all$sample <- sample
205 | sce_all$group <- group
206 | sce_all$cluster <- cluster
207 | ```
208 | 
209 | # Visualise the data
210 | 
211 | Here I am going to use the Bioconductor package `r BiocStyle::Biocpkg("scater")`
212 | to visualise the data. The `r BiocStyle::Biocpkg("scater")` vignette goes 
213 | quite deeply into quality 
214 | control of the cells and the kinds of QC plots we like to look at. Here we will 
215 | simply log-normalise the gene expression counts, perform dimensionality 
216 | reduction (PCA) and generate PCA/TSNE/UMAP plots to visualise the relationships 
217 | between the cells.
218 | 
219 | ```{r}
220 | sce_all <- scater::logNormCounts(sce_all)
221 | sce_all <- scater::runPCA(sce_all)
222 | sce_all <- scater::runUMAP(sce_all)
223 | ```
224 | 
225 | Plot PC1 vs PC2 colouring by cell line and technology:
226 | 
227 | ```{r, fig.width=12, fig.height=6}
228 | pca1 <- scater::plotReducedDim(sce_all, dimred = "PCA", colour_by = "cluster") +
229 |   ggtitle("Cell line")
230 | pca2 <- scater::plotReducedDim(sce_all, dimred = "PCA", colour_by = "group") +
231 |   ggtitle("Technology")
232 | pca1 + pca2
233 | ```
234 | 
235 | Plot UMAP highlighting cell line and technology:
236 | 
237 | ```{r, fig.width=12, fig.height=6}
238 | umap1 <- scater::plotReducedDim(sce_all, dimred = "UMAP", 
239 |                                 colour_by = "cluster") + 
240 |   ggtitle("Cell line")
241 | umap2 <- scater::plotReducedDim(sce_all, dimred = "UMAP", colour_by = "group") +
242 |   ggtitle("Technology")
243 | umap1 + umap2
244 | ```
245 | 
246 | For this dataset UMAP is a little bit of an overkill, the PCA plots show the 
247 | relationships between the cells quite well. PC1 separates cells based on 
248 | technology, and PC2 separates cells based on the cell line (clusters). From the 
249 | PCA plots we can see that 10x is quite different to CelSeq and DropSeq, and the 
250 | H2228 cell line is quite different to the remaining 2 cell lines.
251 | 
252 | # Test for differences in cell line proportions in the three technologies
253 | 
254 | In order to demonstrate `propeller` I will assume that the cell line information
255 | corresponds to clusters and all the analysis steps have beeen performed. Here we
256 | are interested in testing whether there are compositional differences between 
257 | the three technologies: 10x, CelSeq and DropSeq. Since there are more than 2 
258 | groups, `propeller` will perform an ANOVA to determine whether there is a 
259 | significant shift in the cell type proportions between these three groups.
260 | 
261 | The `propeller` function can take a `SingleCellExperiment` object or `Seurat` 
262 | object as input and extract the three necessary pieces of information from the 
263 | cell information stored in `colData`. The three essential pieces of information 
264 | are
265 | 
266 | * cluster
267 | * sample
268 | * group
269 | 
270 | If these arguments are not explicitly passed to the `propeller` function, then 
271 | these are extracted from the `SingleCellExperiment` or `Seurat` object. Upper 
272 | or lower case is acceptable, but 
273 | the variables need to be named exactly as stated in the list above. For a 
274 | `Seurat` object, the cluster information is extracted from `Idents(x)`.
275 | 
276 | The default of propeller is to perform the logit transformation:
277 | ```{r}
278 | # Perform logit transformation
279 | propeller(sce_all)
280 | ```
281 | 
282 | An alternative variance stabilising transformation is the arcsin square root
283 | transformation. 
284 | 
285 | ```{r}
286 | # Perform arcsin square root transformation
287 | propeller(sce_all, transform="asin")
288 | ```
289 | 
290 | The results from using the two different transforms are a little bit different, 
291 | with the H1975 cell line being statistically significant using the arc sin 
292 | square root transform, and not significant after using the logit transform.
293 | 
294 | Another option for running `propeller` is for the user to supply the cluster, 
295 | sample and group information explicitly to the `propeller` function.
296 | 
297 | ```{r}
298 | propeller(clusters=sce_all$cluster, sample=sce_all$sample, group=sce_all$group)
299 | ```
300 | 
301 | The cell lines were mixed together in roughly equal proportions (~0.33) and 
302 | hence we don't expect to see significant differences between the three clusters. 
303 | However, because bootstrapping the samples doesn't incorporate enough 
304 | variability between the samples to mimic true biological variability, we can see 
305 | that the H1975 cluster looks significantly different between the three 
306 | technologies. The proportion of this cell line is closer to 0.4 for CelSeq and 
307 | DropSeq, and 0.34 for the 10x data.
308 | 
309 | # Visualise the results
310 | 
311 | In the `r BiocStyle::Biocpkg("speckle")` package there is a plotting function 
312 | `plotCellTypeProps` that takes a `Seurat` or `SingleCellExperiment` object, 
313 | extracts sample and cluster information and outputs a barplot of cell type 
314 | proportions between the samples. The user also has the option of supplying the
315 | cluster and sample cell information instead of an R object. The output is a 
316 | `ggplot2` object that the user can then manipulate however they please.
317 | 
318 | ```{r,fig.height=4,fig.width=7}
319 | plotCellTypeProps(sce_all)
320 | ```
321 | 
322 | Alternatively, we can obtain the cell type proportions and transformed 
323 | proportions directly by running the `getTransformedProps` function which takes 
324 | the cluster and sample information as input. The output from 
325 | `getTransformedProps` is a list with the cell type counts, transformed 
326 | proportions and proportions as elements.
327 | 
328 | ```{r,fig.height=5,fig.width=7}
329 | props <- getTransformedProps(sce_all$cluster, sce_all$sample, transform="logit")
330 | barplot(props$Proportions, col = c("orange","purple","dark green"),legend=TRUE, 
331 |         ylab="Proportions")
332 | ```
333 | 
334 | Call me old-school, but I still like looking at stripcharts to visualise results
335 | and see whether the significant p-values make sense.
336 | 
337 | ```{r,fig.height=4,fig.width=10}
338 | par(mfrow=c(1,3))
339 | for(i in 1:3){
340 | stripchart(props$Proportions[i,]~rep(c("10x","celseq","dropseq"),each=2),
341 |            vertical=TRUE, pch=16, method="jitter",
342 |            col = c("orange","purple","dark green"),cex=2, ylab="Proportions")
343 | title(rownames(props$Proportions)[i])
344 | }
345 | ```
346 | 
347 | If you are interested in seeing which models best fit the data in terms of the
348 | cell type variances, there are two plotting functions that can do this: 
349 | `plotCellTypeMeanVar` and `plotCellTypePropsMeanVar`. For this particular 
350 | dataset it isn't very informative because there are only three cell "types" 
351 | and no biogical variability.
352 | 
353 | ```{r}
354 | par(mfrow=c(1,1))
355 | plotCellTypeMeanVar(props$Counts)
356 | plotCellTypePropsMeanVar(props$Counts)
357 | ```
358 | 
359 | 
360 | # Fitting linear models using the transformed proportions directly
361 | 
362 | If you are like me, you won't feel very comfortable with a black-box approach 
363 | where one function simply spits out a table of results. If you would like to 
364 | have more control over your linear model and include extra covariates then you 
365 | can fit a linear model in a more direct way using the transformed proportions 
366 | that can be obtained by running the `getTransformedProps` function.
367 | 
368 | We have already obtained the proportions and transformed proportions when we ran
369 | the `getTransformedProps` function above. This function outputs a list object 
370 | with three elements: `Counts`, `TransformedProps` and `Proportions`. These are 
371 | all matrices with clusters/cell types in the rows and samples in the columns.
372 | 
373 | ```{r}
374 | names(props)
375 | 
376 | props$TransformedProps
377 | ```
378 | 
379 | First we need to set up our sample information in much the same way we would if 
380 | we were analysing bulk RNA-seq data. We can pretend that we have pairing 
381 | information which corresponds to our original vs bootstrapped samples to make 
382 | our model a little more complicated for demonstration purposes. 
383 | 
384 | ```{r}
385 | group <- rep(c("10x","celseq","dropseq"),each=2)
386 | pair <- rep(c(1,2),3)
387 | data.frame(group,pair)
388 | ```
389 | 
390 | We can set up a design matrix taking into account group and pairing information.
391 | Please note that the way that `propeller` has been designed is such that the 
392 | group information is *always* first in the design matrix specification, and 
393 | there is NO intercept. If you are new to design matrices and linear modelling, I
394 | would highly recommend reading the `r BiocStyle::Biocpkg("limma")` manual, which
395 | is incredibly extensive and covers a variety of different experimental set ups.
396 | 
397 | ```{r}
398 | design <- model.matrix(~ 0 + group + pair)
399 | design
400 | ```
401 | 
402 | In our example, we have three groups, 10x, CelSeq and DropSeq. In order to 
403 | perform an ANOVA to test for cell type composition differences between these
404 | 3 technologies, we can use the `propeller.anova` function. The `coef` argument
405 | tells the function which columns of the design matrix correspond to the groups 
406 | we are interested in testing. Here we are interested in the first three columns.
407 | 
408 | ```{r}
409 | propeller.anova(prop.list=props, design=design, coef = c(1,2,3), 
410 |                 robust=TRUE, trend=FALSE, sort=TRUE)
411 | ```
412 | 
413 | Note that the p-values are smaller here than before because we have specified
414 | a pairing vector that states which samples were bootstrapped and which are the 
415 | original samples.
416 | 
417 | If we were interested in testing only 10x versus DropSeq we could alternatively
418 | use the `propeller.ttest` function and specify a contrast that tests for this
419 | comparison with our design matrix.
420 | 
421 | ```{r}
422 | design
423 | mycontr <- makeContrasts(group10x-groupdropseq, levels=design)
424 | propeller.ttest(props, design, contrasts = mycontr, robust=TRUE, trend=FALSE, 
425 |                 sort=TRUE)
426 | ```
427 | 
428 | Finally note that the `robust` and `trend` parameters are parameters for the 
429 | `eBayes` function in `r BiocStyle::Biocpkg("limma")`. When `robust = TRUE`, 
430 | robust empirical Bayes shrinkage of the variances is performed which mitigates 
431 | the effects of outlying observations. We set `trend = FALSE` as we don't expect 
432 | a mean-variance trend after performing our variance stabilising transformation. 
433 | There may also be an error when `trend` is set to TRUE because there are often 
434 | not enough data points to estimate the trend.
435 | 
436 | # More complex (or just different) experimental designs
437 | 
438 | ## Fitting a continuous variable rather than groups
439 | 
440 | Let us assume that we expect that the different technologies have a meaningful 
441 | ordering to them, and we would like to find the cell types that are increasing 
442 | or decreasing along this trend. In more complex scenarios beyond group 
443 | comparisons I would recommend taking the transformed proportions from the 
444 | `getTransformedProps` function and using the linear model fitting functions from 
445 | the `r BiocStyle::Biocpkg("limma")` package directly.
446 | 
447 | Let us assume that the ordering of the technologies is 10x -> celseq -> dropseq. 
448 | Then we can recode them 1, 2, 3 and treat the technologies as a
449 | continuous variable. Obviously this scenario doesn't make much sense 
450 | biologically, but we will continue for demonstration purposes.
451 | 
452 | ```{r}
453 | group
454 | dose <- rep(c(1,2,3), each=2) 
455 | 
456 | des.dose <- model.matrix(~dose)
457 | des.dose
458 | 
459 | fit <- lmFit(props$TransformedProps,des.dose)
460 | fit <- eBayes(fit, robust=TRUE)
461 | topTable(fit)
462 | ```
463 | 
464 | Here the log fold changes are reported on the transformed data, so they are 
465 | not as easy to interpret directly. The positive logFC indicates that the cell
466 | type proportions are increasing (for example for H1975), and a negative
467 | logFC indicates that the proportions are decreasing across the ordered 
468 | technologies 10x -> celseq -> dropseq.
469 | 
470 | You can get the estimates from the model on the proportions directly by fitting
471 | the same model to the proportions. Here the `logFC` is the slope of the trend 
472 | line on the proportions, and the `AveExpr` is the average of the proportions 
473 | across all technologies.
474 | 
475 | ```{r}
476 | fit.prop <- lmFit(props$Proportions,des.dose)
477 | fit.prop <- eBayes(fit.prop, robust=TRUE)
478 | topTable(fit.prop)
479 | ```
480 | 
481 | You could plot the continuous variable `dose` against the proportions and add 
482 | trend lines, for example.
483 | 
484 | ```{r,fig.height=4,fig.width=10}
485 | fit.prop$coefficients
486 | 
487 | par(mfrow=c(1,3))
488 | for(i in 1:3){
489 |   plot(dose, props$Proportions[i,], main = rownames(props$Proportions)[i], 
490 |        pch=16, cex=2, ylab="Proportions", cex.lab=1.5, cex.axis=1.5,
491 |        cex.main=2)
492 |   abline(a=fit.prop$coefficients[i,1], b=fit.prop$coefficients[i,2], col=4, 
493 |          lwd=2)
494 | }
495 | ```
496 | 
497 | What I recommend in this instance is using the p-values from the analysis on the
498 | transformed data, and the reported statistics (i.e. the coefficients from the 
499 | model) obtained from the analysis on the proportions for visualisation purposes. 
500 | 
501 | ## Including random effects
502 | 
503 | If you have a random effect that you would like to account for in your analysis, 
504 | for example repeated measures on the same individual, then you can use the
505 | `duplicateCorrelation` function from the `r BiocStyle::Biocpkg("limma")`.
506 | 
507 | For illustration purposes, let us assume that `pair` indicates samples taken 
508 | from the same individual (or they could represent technical replicates), and we 
509 | would like to account for this in our analysis 
510 | using a random effect. Again, we fit our models on the transformed proportions
511 | in order to obtain the p-values.
512 | 
513 | We will formulate the design matrix with an intercept for this example, and test
514 | the differences in technologies relative to 10x. The `block` parameter will be
515 | the `pair` variable. Note that the design matrix now does not include `pair` as 
516 | a fixed effect.
517 | 
518 | ```{r}
519 | des.tech <- model.matrix(~group)
520 | 
521 | dupcor <- duplicateCorrelation(props$TransformedProps, design=des.tech,
522 |                                block=pair)
523 | dupcor
524 | ```
525 | 
526 | The consensus correlation is quite high (`r dupcor$consensus.correlation`), 
527 | which we expect because we bootstrapped these additional samples.
528 | 
529 | ```{r}
530 | # Fitting the linear model accounting for pair as a random effect
531 | fit1 <- lmFit(props$TransformedProps, design=des.tech, block=pair, 
532 |               correlation=dupcor$consensus)
533 | fit1 <- eBayes(fit1)
534 | summary(decideTests(fit1))
535 | 
536 | # Differences between celseq vs 10x
537 | topTable(fit1,coef=2)
538 | 
539 | # Differences between dropseq vs 10x
540 | topTable(fit1, coef=3)
541 | ```
542 | 
543 | For celseq vs 10x, H1975 and H2228 are significantly different, with a greater
544 | proportion of H1975
545 | cells detected in celseq, and fewer H2228 cells. For dropseq vs 10x, there is a 
546 | higher proportion of H1975 cells.
547 | 
548 | If you want to do an ANOVA between the three groups:
549 | ```{r}
550 | topTable(fit1, coef=2:3)
551 | ```
552 | 
553 | Generally, you can perform any analysis on the transformed proportions that you
554 | would normally do when using limma (i.e. on roughly normally distributed data). 
555 | For more complex random effects models with 2 or more random effects, you can 
556 | use the ``r BiocStyle::Biocpkg("lme4")` package.
557 | 
558 | 
559 | # Tips for clustering
560 | 
561 | The experimental groups are likely to contribute large sources of variation in 
562 | the data. In the `r BiocStyle::Biocpkg("CellBench")` data the technology effect 
563 | is larger than the cell line effect. In order to cluster the data to produce 
564 | meaningful cell types that will then feed into meaningful tests for proportions, 
565 | the cell types should be represented in as many samples as possible. Users 
566 | should consider using integration techniques on their single cell data 
567 | prior to clustering, integrating on biological sample or perhaps experimental 
568 | group. See methods such as Harmony, Liger and Seurat's integration technique 
569 | for more information.
570 | 
571 | Cell type label assignment should not be too refined such that every sample has
572 | many unique cell types. The `propeller` function can handle proportions of 0 and
573 | 1 without breaking, but it is not very meaningful if every cell type difference
574 | is statistically significant. Consider testing cell type categories that are 
575 | broader for more meaningful results, perhaps by combining clusters that are 
576 | highly similar. The refined clusters can always be explored in terms of gene 
577 | expression differences later on. 
578 | 
579 | It may be appropriate to perform cell type assignment using classification 
580 | methods rather than clustering. This allows 
581 | the user to classify cells into known cell types, but you may run the risk of 
582 | missing novel cell types.
583 | A combination of approaches may be necessary depending on the dataset. 
584 | Good luck. The more heterogenous the dataset, the more tricky this becomes.
585 | 
586 | # See also
587 | 
588 | Another approach is to model the counts directly using statistical models that 
589 | can take into account biological variability, such as negative binomial. 
590 | You can read the 
591 | [OSCA bookchapter](http://bioconductor.org/books/release/OSCA/multi-sample-comparisons.html)
592 | on how to use the `r BiocStyle::Biocpkg("edgeR")` package to do this.
593 | 
594 | # Classifying male and female cells from scRNA-seq data
595 | 
596 | A common quality control check in bulk RNA-seq is to check the sex of the 
597 | samples. The simplest method to do this is to check expression of *XIST*, which 
598 | is the gene responsible for X-inactivation. It is highly expressed in females, 
599 | and not expressed in males. In experiments where the sex of the samples has not 
600 | been recorded, the variation due to sex can often be captured by the top 
601 | principal component in an MDS or PCA plot. It is important to know if the 
602 | samples are a mix of males and females and to take this information into account 
603 | in downstream analysis.
604 | 
605 | It turns out that for single cell data, it is not a simple matter to classify
606 | cells as male or female. The main reason for this is that the cells are
607 | much more lowly sequenced compared to bulk RNA-seq samples resulting in low or 
608 | zero counts for many genes, including *XIST* and other X- and Y-chromosome 
609 | genes. Thus simply trying to classify cells as male or female based on observed 
610 | counts for *XIST* leave many cells unable to be classified.
611 | 
612 | There are a few reasons for wanting to classify male and female cells. First, it
613 | could form part of the analysis assessing quality of the cells, and if sex is 
614 | not information that is easily available, it is an additional variable we can 
615 | predict from the gene expression. This may then inform further analysis of the 
616 | data, by allowing us to take sex into account in the analysis.
617 | 
618 | We have built a classifier to predict the sex of each cell based on logistic 
619 | regression for human and mouse cells. The input is the matrix of counts where
620 | the genes are the rows and the cells are the columns. The rownames of the counts
621 | matrix needs to be gene symbol. The `allcounts` data object contains the counts
622 | for all the cells used in the `propeller` function, but the rownames are ENSEMBL
623 | IDs. The first step is thus converting the ENSEMBL IDs to gene symbol.
624 | 
625 | ```{r}
626 | allcounts[1:5,1:5]
627 | nc <- normCounts(allcounts, log=TRUE)
628 | avgexp <- rowMeans(nc)
629 | o <- order(avgexp, decreasing = TRUE)
630 | allcounts2 <- allcounts[o,]
631 | allcounts2[1:5,1:5]
632 | ann <- AnnotationDbi::select(org.Hs.eg.db, keys=rownames(allcounts2),
633 |               columns=c("ENSEMBL","SYMBOL"), keytype="ENSEMBL")
634 | m <- match(rownames(allcounts2), ann$ENSEMBL)
635 | symbol <- ann$SYMBOL[m]
636 | table(duplicated(symbol))
637 | allcounts2 <- allcounts2[!duplicated(symbol) & !is.na(symbol),]
638 | 
639 | rownames(allcounts2) <- symbol[!duplicated(symbol) & !is.na(symbol)]
640 | 
641 | table(duplicated(colnames(allcounts2)))
642 | colnames(allcounts2) <- paste(colnames(allcounts2),1:ncol(allcounts2), sep=".")
643 | table(duplicated(rownames(allcounts2)))
644 | ```
645 | 
646 | Now that the counts matrix is in the correct format we can call the 
647 | `classifySex` function.
648 | 
649 | ```{r}
650 | sex <- classifySex(allcounts2, genome="Hs", qc=FALSE)
651 | table(sex$prediction)
652 | ```
653 | 
654 | The cell lines were all derived from females, so the sex of the cells is 
655 | correctly classified as female.
656 | 
657 | 
658 | 
659 | # Session Info
660 | 
661 | ```{r}
662 | sessionInfo()
663 | ```
664 | 
665 | 
666 | 
667 | 
668 | 


--------------------------------------------------------------------------------