├── DESCRIPTION ├── LICENSE.md ├── NAMESPACE ├── NEWS.md ├── R ├── classifySex.R ├── estimateBetaParam.R ├── estimateBetaParamsFromCounts.R ├── getTransformedProps.R ├── ggplotColors.R ├── normCounts.R ├── normSca.R ├── plotCellTypeMeanVar.R ├── plotCellTypeProps.R ├── plotCellTypePropsMeanVar.R ├── preprocess.R ├── propeller.R ├── propeller.anova.R ├── propeller.ttest.R ├── speckle-package.R ├── speckle_example_data.R └── sysdata.rda ├── README.md ├── man ├── classifySex.Rd ├── dot-extractSCE.Rd ├── dot-extractSeurat.Rd ├── estimateBetaParam.Rd ├── estimateBetaParamsFromCounts.Rd ├── getTransformedProps.Rd ├── ggplotColors.Rd ├── normCounts.Rd ├── normSca.Rd ├── plotCellTypeMeanVar.Rd ├── plotCellTypeProps.Rd ├── plotCellTypePropsMeanVar.Rd ├── preprocess.Rd ├── propeller.Rd ├── propeller.anova.Rd ├── propeller.ttest.Rd ├── speckle-package.Rd └── speckle_example_data.Rd └── vignettes └── speckle.Rmd /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: speckle 2 | Type: Package 3 | Title: Statistical methods for analysing single cell RNA-seq data 4 | Version: 0.0.3 5 | Date: 2020-12-02 6 | Author: Belinda Phipson 7 | Maintainer: Belinda Phipson 8 | Depends: R (>= 3.6.0) 9 | Imports: limma, edgeR, SingleCellExperiment, Seurat, statmod, ggplot2, methods, caret, scuttle, stringr, AnnotationDbi, org.Hs.eg.db, org.Mm.eg.db 10 | VignetteBuilder: knitr 11 | Suggests: BiocStyle, knitr, rmarkdown, CellBench, scater, patchwork 12 | Description: speckle contains functions for the analysis of single cell RNA-seq data. 13 | License: GPL-3 14 | biocViews: SingleCell, RNASeq, Regression, GeneExpression 15 | RoxygenNote: 7.1.1 16 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | GNU General Public License 2 | ========================== 3 | 4 | _Version 3, 29 June 2007_ 5 | _Copyright © 2007 Free Software Foundation, Inc. <>_ 6 | 7 | Everyone is permitted to copy and distribute verbatim copies of this license 8 | document, but changing it is not allowed. 9 | 10 | ## Preamble 11 | 12 | The GNU General Public License is a free, copyleft license for software and other 13 | kinds of works. 14 | 15 | The licenses for most software and other practical works are designed to take away 16 | your freedom to share and change the works. By contrast, the GNU General Public 17 | License is intended to guarantee your freedom to share and change all versions of a 18 | program--to make sure it remains free software for all its users. We, the Free 19 | Software Foundation, use the GNU General Public License for most of our software; it 20 | applies also to any other work released this way by its authors. You can apply it to 21 | your programs, too. 22 | 23 | When we speak of free software, we are referring to freedom, not price. Our General 24 | Public Licenses are designed to make sure that you have the freedom to distribute 25 | copies of free software (and charge for them if you wish), that you receive source 26 | code or can get it if you want it, that you can change the software or use pieces of 27 | it in new free programs, and that you know you can do these things. 28 | 29 | To protect your rights, we need to prevent others from denying you these rights or 30 | asking you to surrender the rights. Therefore, you have certain responsibilities if 31 | you distribute copies of the software, or if you modify it: responsibilities to 32 | respect the freedom of others. 33 | 34 | For example, if you distribute copies of such a program, whether gratis or for a fee, 35 | you must pass on to the recipients the same freedoms that you received. You must make 36 | sure that they, too, receive or can get the source code. And you must show them these 37 | terms so they know their rights. 38 | 39 | Developers that use the GNU GPL protect your rights with two steps: **(1)** assert 40 | copyright on the software, and **(2)** offer you this License giving you legal permission 41 | to copy, distribute and/or modify it. 42 | 43 | For the developers' and authors' protection, the GPL clearly explains that there is 44 | no warranty for this free software. For both users' and authors' sake, the GPL 45 | requires that modified versions be marked as changed, so that their problems will not 46 | be attributed erroneously to authors of previous versions. 47 | 48 | Some devices are designed to deny users access to install or run modified versions of 49 | the software inside them, although the manufacturer can do so. This is fundamentally 50 | incompatible with the aim of protecting users' freedom to change the software. The 51 | systematic pattern of such abuse occurs in the area of products for individuals to 52 | use, which is precisely where it is most unacceptable. Therefore, we have designed 53 | this version of the GPL to prohibit the practice for those products. If such problems 54 | arise substantially in other domains, we stand ready to extend this provision to 55 | those domains in future versions of the GPL, as needed to protect the freedom of 56 | users. 57 | 58 | Finally, every program is threatened constantly by software patents. States should 59 | not allow patents to restrict development and use of software on general-purpose 60 | computers, but in those that do, we wish to avoid the special danger that patents 61 | applied to a free program could make it effectively proprietary. To prevent this, the 62 | GPL assures that patents cannot be used to render the program non-free. 63 | 64 | The precise terms and conditions for copying, distribution and modification follow. 65 | 66 | ## TERMS AND CONDITIONS 67 | 68 | ### 0. Definitions 69 | 70 | “This License” refers to version 3 of the GNU General Public License. 71 | 72 | “Copyright” also means copyright-like laws that apply to other kinds of 73 | works, such as semiconductor masks. 74 | 75 | “The Program” refers to any copyrightable work licensed under this 76 | License. Each licensee is addressed as “you”. “Licensees” and 77 | “recipients” may be individuals or organizations. 78 | 79 | To “modify” a work means to copy from or adapt all or part of the work in 80 | a fashion requiring copyright permission, other than the making of an exact copy. The 81 | resulting work is called a “modified version” of the earlier work or a 82 | work “based on” the earlier work. 83 | 84 | A “covered work” means either the unmodified Program or a work based on 85 | the Program. 86 | 87 | To “propagate” a work means to do anything with it that, without 88 | permission, would make you directly or secondarily liable for infringement under 89 | applicable copyright law, except executing it on a computer or modifying a private 90 | copy. Propagation includes copying, distribution (with or without modification), 91 | making available to the public, and in some countries other activities as well. 92 | 93 | To “convey” a work means any kind of propagation that enables other 94 | parties to make or receive copies. Mere interaction with a user through a computer 95 | network, with no transfer of a copy, is not conveying. 96 | 97 | An interactive user interface displays “Appropriate Legal Notices” to the 98 | extent that it includes a convenient and prominently visible feature that **(1)** 99 | displays an appropriate copyright notice, and **(2)** tells the user that there is no 100 | warranty for the work (except to the extent that warranties are provided), that 101 | licensees may convey the work under this License, and how to view a copy of this 102 | License. If the interface presents a list of user commands or options, such as a 103 | menu, a prominent item in the list meets this criterion. 104 | 105 | ### 1. Source Code 106 | 107 | The “source code” for a work means the preferred form of the work for 108 | making modifications to it. “Object code” means any non-source form of a 109 | work. 110 | 111 | A “Standard Interface” means an interface that either is an official 112 | standard defined by a recognized standards body, or, in the case of interfaces 113 | specified for a particular programming language, one that is widely used among 114 | developers working in that language. 115 | 116 | The “System Libraries” of an executable work include anything, other than 117 | the work as a whole, that **(a)** is included in the normal form of packaging a Major 118 | Component, but which is not part of that Major Component, and **(b)** serves only to 119 | enable use of the work with that Major Component, or to implement a Standard 120 | Interface for which an implementation is available to the public in source code form. 121 | A “Major Component”, in this context, means a major essential component 122 | (kernel, window system, and so on) of the specific operating system (if any) on which 123 | the executable work runs, or a compiler used to produce the work, or an object code 124 | interpreter used to run it. 125 | 126 | The “Corresponding Source” for a work in object code form means all the 127 | source code needed to generate, install, and (for an executable work) run the object 128 | code and to modify the work, including scripts to control those activities. However, 129 | it does not include the work's System Libraries, or general-purpose tools or 130 | generally available free programs which are used unmodified in performing those 131 | activities but which are not part of the work. For example, Corresponding Source 132 | includes interface definition files associated with source files for the work, and 133 | the source code for shared libraries and dynamically linked subprograms that the work 134 | is specifically designed to require, such as by intimate data communication or 135 | control flow between those subprograms and other parts of the work. 136 | 137 | The Corresponding Source need not include anything that users can regenerate 138 | automatically from other parts of the Corresponding Source. 139 | 140 | The Corresponding Source for a work in source code form is that same work. 141 | 142 | ### 2. Basic Permissions 143 | 144 | All rights granted under this License are granted for the term of copyright on the 145 | Program, and are irrevocable provided the stated conditions are met. This License 146 | explicitly affirms your unlimited permission to run the unmodified Program. The 147 | output from running a covered work is covered by this License only if the output, 148 | given its content, constitutes a covered work. This License acknowledges your rights 149 | of fair use or other equivalent, as provided by copyright law. 150 | 151 | You may make, run and propagate covered works that you do not convey, without 152 | conditions so long as your license otherwise remains in force. You may convey covered 153 | works to others for the sole purpose of having them make modifications exclusively 154 | for you, or provide you with facilities for running those works, provided that you 155 | comply with the terms of this License in conveying all material for which you do not 156 | control copyright. Those thus making or running the covered works for you must do so 157 | exclusively on your behalf, under your direction and control, on terms that prohibit 158 | them from making any copies of your copyrighted material outside their relationship 159 | with you. 160 | 161 | Conveying under any other circumstances is permitted solely under the conditions 162 | stated below. Sublicensing is not allowed; section 10 makes it unnecessary. 163 | 164 | ### 3. Protecting Users' Legal Rights From Anti-Circumvention Law 165 | 166 | No covered work shall be deemed part of an effective technological measure under any 167 | applicable law fulfilling obligations under article 11 of the WIPO copyright treaty 168 | adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention 169 | of such measures. 170 | 171 | When you convey a covered work, you waive any legal power to forbid circumvention of 172 | technological measures to the extent such circumvention is effected by exercising 173 | rights under this License with respect to the covered work, and you disclaim any 174 | intention to limit operation or modification of the work as a means of enforcing, 175 | against the work's users, your or third parties' legal rights to forbid circumvention 176 | of technological measures. 177 | 178 | ### 4. Conveying Verbatim Copies 179 | 180 | You may convey verbatim copies of the Program's source code as you receive it, in any 181 | medium, provided that you conspicuously and appropriately publish on each copy an 182 | appropriate copyright notice; keep intact all notices stating that this License and 183 | any non-permissive terms added in accord with section 7 apply to the code; keep 184 | intact all notices of the absence of any warranty; and give all recipients a copy of 185 | this License along with the Program. 186 | 187 | You may charge any price or no price for each copy that you convey, and you may offer 188 | support or warranty protection for a fee. 189 | 190 | ### 5. Conveying Modified Source Versions 191 | 192 | You may convey a work based on the Program, or the modifications to produce it from 193 | the Program, in the form of source code under the terms of section 4, provided that 194 | you also meet all of these conditions: 195 | 196 | * **a)** The work must carry prominent notices stating that you modified it, and giving a 197 | relevant date. 198 | * **b)** The work must carry prominent notices stating that it is released under this 199 | License and any conditions added under section 7. This requirement modifies the 200 | requirement in section 4 to “keep intact all notices”. 201 | * **c)** You must license the entire work, as a whole, under this License to anyone who 202 | comes into possession of a copy. This License will therefore apply, along with any 203 | applicable section 7 additional terms, to the whole of the work, and all its parts, 204 | regardless of how they are packaged. This License gives no permission to license the 205 | work in any other way, but it does not invalidate such permission if you have 206 | separately received it. 207 | * **d)** If the work has interactive user interfaces, each must display Appropriate Legal 208 | Notices; however, if the Program has interactive interfaces that do not display 209 | Appropriate Legal Notices, your work need not make them do so. 210 | 211 | A compilation of a covered work with other separate and independent works, which are 212 | not by their nature extensions of the covered work, and which are not combined with 213 | it such as to form a larger program, in or on a volume of a storage or distribution 214 | medium, is called an “aggregate” if the compilation and its resulting 215 | copyright are not used to limit the access or legal rights of the compilation's users 216 | beyond what the individual works permit. Inclusion of a covered work in an aggregate 217 | does not cause this License to apply to the other parts of the aggregate. 218 | 219 | ### 6. Conveying Non-Source Forms 220 | 221 | You may convey a covered work in object code form under the terms of sections 4 and 222 | 5, provided that you also convey the machine-readable Corresponding Source under the 223 | terms of this License, in one of these ways: 224 | 225 | * **a)** Convey the object code in, or embodied in, a physical product (including a 226 | physical distribution medium), accompanied by the Corresponding Source fixed on a 227 | durable physical medium customarily used for software interchange. 228 | * **b)** Convey the object code in, or embodied in, a physical product (including a 229 | physical distribution medium), accompanied by a written offer, valid for at least 230 | three years and valid for as long as you offer spare parts or customer support for 231 | that product model, to give anyone who possesses the object code either **(1)** a copy of 232 | the Corresponding Source for all the software in the product that is covered by this 233 | License, on a durable physical medium customarily used for software interchange, for 234 | a price no more than your reasonable cost of physically performing this conveying of 235 | source, or **(2)** access to copy the Corresponding Source from a network server at no 236 | charge. 237 | * **c)** Convey individual copies of the object code with a copy of the written offer to 238 | provide the Corresponding Source. This alternative is allowed only occasionally and 239 | noncommercially, and only if you received the object code with such an offer, in 240 | accord with subsection 6b. 241 | * **d)** Convey the object code by offering access from a designated place (gratis or for 242 | a charge), and offer equivalent access to the Corresponding Source in the same way 243 | through the same place at no further charge. You need not require recipients to copy 244 | the Corresponding Source along with the object code. If the place to copy the object 245 | code is a network server, the Corresponding Source may be on a different server 246 | (operated by you or a third party) that supports equivalent copying facilities, 247 | provided you maintain clear directions next to the object code saying where to find 248 | the Corresponding Source. Regardless of what server hosts the Corresponding Source, 249 | you remain obligated to ensure that it is available for as long as needed to satisfy 250 | these requirements. 251 | * **e)** Convey the object code using peer-to-peer transmission, provided you inform 252 | other peers where the object code and Corresponding Source of the work are being 253 | offered to the general public at no charge under subsection 6d. 254 | 255 | A separable portion of the object code, whose source code is excluded from the 256 | Corresponding Source as a System Library, need not be included in conveying the 257 | object code work. 258 | 259 | A “User Product” is either **(1)** a “consumer product”, which 260 | means any tangible personal property which is normally used for personal, family, or 261 | household purposes, or **(2)** anything designed or sold for incorporation into a 262 | dwelling. In determining whether a product is a consumer product, doubtful cases 263 | shall be resolved in favor of coverage. For a particular product received by a 264 | particular user, “normally used” refers to a typical or common use of 265 | that class of product, regardless of the status of the particular user or of the way 266 | in which the particular user actually uses, or expects or is expected to use, the 267 | product. A product is a consumer product regardless of whether the product has 268 | substantial commercial, industrial or non-consumer uses, unless such uses represent 269 | the only significant mode of use of the product. 270 | 271 | “Installation Information” for a User Product means any methods, 272 | procedures, authorization keys, or other information required to install and execute 273 | modified versions of a covered work in that User Product from a modified version of 274 | its Corresponding Source. The information must suffice to ensure that the continued 275 | functioning of the modified object code is in no case prevented or interfered with 276 | solely because modification has been made. 277 | 278 | If you convey an object code work under this section in, or with, or specifically for 279 | use in, a User Product, and the conveying occurs as part of a transaction in which 280 | the right of possession and use of the User Product is transferred to the recipient 281 | in perpetuity or for a fixed term (regardless of how the transaction is 282 | characterized), the Corresponding Source conveyed under this section must be 283 | accompanied by the Installation Information. But this requirement does not apply if 284 | neither you nor any third party retains the ability to install modified object code 285 | on the User Product (for example, the work has been installed in ROM). 286 | 287 | The requirement to provide Installation Information does not include a requirement to 288 | continue to provide support service, warranty, or updates for a work that has been 289 | modified or installed by the recipient, or for the User Product in which it has been 290 | modified or installed. Access to a network may be denied when the modification itself 291 | materially and adversely affects the operation of the network or violates the rules 292 | and protocols for communication across the network. 293 | 294 | Corresponding Source conveyed, and Installation Information provided, in accord with 295 | this section must be in a format that is publicly documented (and with an 296 | implementation available to the public in source code form), and must require no 297 | special password or key for unpacking, reading or copying. 298 | 299 | ### 7. Additional Terms 300 | 301 | “Additional permissions” are terms that supplement the terms of this 302 | License by making exceptions from one or more of its conditions. Additional 303 | permissions that are applicable to the entire Program shall be treated as though they 304 | were included in this License, to the extent that they are valid under applicable 305 | law. If additional permissions apply only to part of the Program, that part may be 306 | used separately under those permissions, but the entire Program remains governed by 307 | this License without regard to the additional permissions. 308 | 309 | When you convey a copy of a covered work, you may at your option remove any 310 | additional permissions from that copy, or from any part of it. (Additional 311 | permissions may be written to require their own removal in certain cases when you 312 | modify the work.) You may place additional permissions on material, added by you to a 313 | covered work, for which you have or can give appropriate copyright permission. 314 | 315 | Notwithstanding any other provision of this License, for material you add to a 316 | covered work, you may (if authorized by the copyright holders of that material) 317 | supplement the terms of this License with terms: 318 | 319 | * **a)** Disclaiming warranty or limiting liability differently from the terms of 320 | sections 15 and 16 of this License; or 321 | * **b)** Requiring preservation of specified reasonable legal notices or author 322 | attributions in that material or in the Appropriate Legal Notices displayed by works 323 | containing it; or 324 | * **c)** Prohibiting misrepresentation of the origin of that material, or requiring that 325 | modified versions of such material be marked in reasonable ways as different from the 326 | original version; or 327 | * **d)** Limiting the use for publicity purposes of names of licensors or authors of the 328 | material; or 329 | * **e)** Declining to grant rights under trademark law for use of some trade names, 330 | trademarks, or service marks; or 331 | * **f)** Requiring indemnification of licensors and authors of that material by anyone 332 | who conveys the material (or modified versions of it) with contractual assumptions of 333 | liability to the recipient, for any liability that these contractual assumptions 334 | directly impose on those licensors and authors. 335 | 336 | All other non-permissive additional terms are considered “further 337 | restrictions” within the meaning of section 10. If the Program as you received 338 | it, or any part of it, contains a notice stating that it is governed by this License 339 | along with a term that is a further restriction, you may remove that term. If a 340 | license document contains a further restriction but permits relicensing or conveying 341 | under this License, you may add to a covered work material governed by the terms of 342 | that license document, provided that the further restriction does not survive such 343 | relicensing or conveying. 344 | 345 | If you add terms to a covered work in accord with this section, you must place, in 346 | the relevant source files, a statement of the additional terms that apply to those 347 | files, or a notice indicating where to find the applicable terms. 348 | 349 | Additional terms, permissive or non-permissive, may be stated in the form of a 350 | separately written license, or stated as exceptions; the above requirements apply 351 | either way. 352 | 353 | ### 8. Termination 354 | 355 | You may not propagate or modify a covered work except as expressly provided under 356 | this License. Any attempt otherwise to propagate or modify it is void, and will 357 | automatically terminate your rights under this License (including any patent licenses 358 | granted under the third paragraph of section 11). 359 | 360 | However, if you cease all violation of this License, then your license from a 361 | particular copyright holder is reinstated **(a)** provisionally, unless and until the 362 | copyright holder explicitly and finally terminates your license, and **(b)** permanently, 363 | if the copyright holder fails to notify you of the violation by some reasonable means 364 | prior to 60 days after the cessation. 365 | 366 | Moreover, your license from a particular copyright holder is reinstated permanently 367 | if the copyright holder notifies you of the violation by some reasonable means, this 368 | is the first time you have received notice of violation of this License (for any 369 | work) from that copyright holder, and you cure the violation prior to 30 days after 370 | your receipt of the notice. 371 | 372 | Termination of your rights under this section does not terminate the licenses of 373 | parties who have received copies or rights from you under this License. If your 374 | rights have been terminated and not permanently reinstated, you do not qualify to 375 | receive new licenses for the same material under section 10. 376 | 377 | ### 9. Acceptance Not Required for Having Copies 378 | 379 | You are not required to accept this License in order to receive or run a copy of the 380 | Program. Ancillary propagation of a covered work occurring solely as a consequence of 381 | using peer-to-peer transmission to receive a copy likewise does not require 382 | acceptance. However, nothing other than this License grants you permission to 383 | propagate or modify any covered work. These actions infringe copyright if you do not 384 | accept this License. Therefore, by modifying or propagating a covered work, you 385 | indicate your acceptance of this License to do so. 386 | 387 | ### 10. Automatic Licensing of Downstream Recipients 388 | 389 | Each time you convey a covered work, the recipient automatically receives a license 390 | from the original licensors, to run, modify and propagate that work, subject to this 391 | License. You are not responsible for enforcing compliance by third parties with this 392 | License. 393 | 394 | An “entity transaction” is a transaction transferring control of an 395 | organization, or substantially all assets of one, or subdividing an organization, or 396 | merging organizations. If propagation of a covered work results from an entity 397 | transaction, each party to that transaction who receives a copy of the work also 398 | receives whatever licenses to the work the party's predecessor in interest had or 399 | could give under the previous paragraph, plus a right to possession of the 400 | Corresponding Source of the work from the predecessor in interest, if the predecessor 401 | has it or can get it with reasonable efforts. 402 | 403 | You may not impose any further restrictions on the exercise of the rights granted or 404 | affirmed under this License. For example, you may not impose a license fee, royalty, 405 | or other charge for exercise of rights granted under this License, and you may not 406 | initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging 407 | that any patent claim is infringed by making, using, selling, offering for sale, or 408 | importing the Program or any portion of it. 409 | 410 | ### 11. Patents 411 | 412 | A “contributor” is a copyright holder who authorizes use under this 413 | License of the Program or a work on which the Program is based. The work thus 414 | licensed is called the contributor's “contributor version”. 415 | 416 | A contributor's “essential patent claims” are all patent claims owned or 417 | controlled by the contributor, whether already acquired or hereafter acquired, that 418 | would be infringed by some manner, permitted by this License, of making, using, or 419 | selling its contributor version, but do not include claims that would be infringed 420 | only as a consequence of further modification of the contributor version. For 421 | purposes of this definition, “control” includes the right to grant patent 422 | sublicenses in a manner consistent with the requirements of this License. 423 | 424 | Each contributor grants you a non-exclusive, worldwide, royalty-free patent license 425 | under the contributor's essential patent claims, to make, use, sell, offer for sale, 426 | import and otherwise run, modify and propagate the contents of its contributor 427 | version. 428 | 429 | In the following three paragraphs, a “patent license” is any express 430 | agreement or commitment, however denominated, not to enforce a patent (such as an 431 | express permission to practice a patent or covenant not to sue for patent 432 | infringement). To “grant” such a patent license to a party means to make 433 | such an agreement or commitment not to enforce a patent against the party. 434 | 435 | If you convey a covered work, knowingly relying on a patent license, and the 436 | Corresponding Source of the work is not available for anyone to copy, free of charge 437 | and under the terms of this License, through a publicly available network server or 438 | other readily accessible means, then you must either **(1)** cause the Corresponding 439 | Source to be so available, or **(2)** arrange to deprive yourself of the benefit of the 440 | patent license for this particular work, or **(3)** arrange, in a manner consistent with 441 | the requirements of this License, to extend the patent license to downstream 442 | recipients. “Knowingly relying” means you have actual knowledge that, but 443 | for the patent license, your conveying the covered work in a country, or your 444 | recipient's use of the covered work in a country, would infringe one or more 445 | identifiable patents in that country that you have reason to believe are valid. 446 | 447 | If, pursuant to or in connection with a single transaction or arrangement, you 448 | convey, or propagate by procuring conveyance of, a covered work, and grant a patent 449 | license to some of the parties receiving the covered work authorizing them to use, 450 | propagate, modify or convey a specific copy of the covered work, then the patent 451 | license you grant is automatically extended to all recipients of the covered work and 452 | works based on it. 453 | 454 | A patent license is “discriminatory” if it does not include within the 455 | scope of its coverage, prohibits the exercise of, or is conditioned on the 456 | non-exercise of one or more of the rights that are specifically granted under this 457 | License. You may not convey a covered work if you are a party to an arrangement with 458 | a third party that is in the business of distributing software, under which you make 459 | payment to the third party based on the extent of your activity of conveying the 460 | work, and under which the third party grants, to any of the parties who would receive 461 | the covered work from you, a discriminatory patent license **(a)** in connection with 462 | copies of the covered work conveyed by you (or copies made from those copies), or **(b)** 463 | primarily for and in connection with specific products or compilations that contain 464 | the covered work, unless you entered into that arrangement, or that patent license 465 | was granted, prior to 28 March 2007. 466 | 467 | Nothing in this License shall be construed as excluding or limiting any implied 468 | license or other defenses to infringement that may otherwise be available to you 469 | under applicable patent law. 470 | 471 | ### 12. No Surrender of Others' Freedom 472 | 473 | If conditions are imposed on you (whether by court order, agreement or otherwise) 474 | that contradict the conditions of this License, they do not excuse you from the 475 | conditions of this License. If you cannot convey a covered work so as to satisfy 476 | simultaneously your obligations under this License and any other pertinent 477 | obligations, then as a consequence you may not convey it at all. For example, if you 478 | agree to terms that obligate you to collect a royalty for further conveying from 479 | those to whom you convey the Program, the only way you could satisfy both those terms 480 | and this License would be to refrain entirely from conveying the Program. 481 | 482 | ### 13. Use with the GNU Affero General Public License 483 | 484 | Notwithstanding any other provision of this License, you have permission to link or 485 | combine any covered work with a work licensed under version 3 of the GNU Affero 486 | General Public License into a single combined work, and to convey the resulting work. 487 | The terms of this License will continue to apply to the part which is the covered 488 | work, but the special requirements of the GNU Affero General Public License, section 489 | 13, concerning interaction through a network will apply to the combination as such. 490 | 491 | ### 14. Revised Versions of this License 492 | 493 | The Free Software Foundation may publish revised and/or new versions of the GNU 494 | General Public License from time to time. Such new versions will be similar in spirit 495 | to the present version, but may differ in detail to address new problems or concerns. 496 | 497 | Each version is given a distinguishing version number. If the Program specifies that 498 | a certain numbered version of the GNU General Public License “or any later 499 | version” applies to it, you have the option of following the terms and 500 | conditions either of that numbered version or of any later version published by the 501 | Free Software Foundation. If the Program does not specify a version number of the GNU 502 | General Public License, you may choose any version ever published by the Free 503 | Software Foundation. 504 | 505 | If the Program specifies that a proxy can decide which future versions of the GNU 506 | General Public License can be used, that proxy's public statement of acceptance of a 507 | version permanently authorizes you to choose that version for the Program. 508 | 509 | Later license versions may give you additional or different permissions. However, no 510 | additional obligations are imposed on any author or copyright holder as a result of 511 | your choosing to follow a later version. 512 | 513 | ### 15. Disclaimer of Warranty 514 | 515 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. 516 | EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES 517 | PROVIDE THE PROGRAM “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER 518 | EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 519 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE 520 | QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE 521 | DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 522 | 523 | ### 16. Limitation of Liability 524 | 525 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY 526 | COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS 527 | PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, 528 | INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE 529 | PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE 530 | OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE 531 | WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE 532 | POSSIBILITY OF SUCH DAMAGES. 533 | 534 | ### 17. Interpretation of Sections 15 and 16 535 | 536 | If the disclaimer of warranty and limitation of liability provided above cannot be 537 | given local legal effect according to their terms, reviewing courts shall apply local 538 | law that most closely approximates an absolute waiver of all civil liability in 539 | connection with the Program, unless a warranty or assumption of liability accompanies 540 | a copy of the Program in return for a fee. 541 | 542 | _END OF TERMS AND CONDITIONS_ 543 | 544 | ## How to Apply These Terms to Your New Programs 545 | 546 | If you develop a new program, and you want it to be of the greatest possible use to 547 | the public, the best way to achieve this is to make it free software which everyone 548 | can redistribute and change under these terms. 549 | 550 | To do so, attach the following notices to the program. It is safest to attach them 551 | to the start of each source file to most effectively state the exclusion of warranty; 552 | and each file should have at least the “copyright” line and a pointer to 553 | where the full notice is found. 554 | 555 | 556 | Copyright (C) 2020 Belinda Phipson 557 | 558 | This program is free software: you can redistribute it and/or modify 559 | it under the terms of the GNU General Public License as published by 560 | the Free Software Foundation, either version 3 of the License, or 561 | (at your option) any later version. 562 | 563 | This program is distributed in the hope that it will be useful, 564 | but WITHOUT ANY WARRANTY; without even the implied warranty of 565 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 566 | GNU General Public License for more details. 567 | 568 | You should have received a copy of the GNU General Public License 569 | along with this program. If not, see . 570 | 571 | Also add information on how to contact you by electronic and paper mail. 572 | 573 | If the program does terminal interaction, make it output a short notice like this 574 | when it starts in an interactive mode: 575 | 576 | speckle Copyright (C) 2020 Belinda Phipson 577 | This program comes with ABSOLUTELY NO WARRANTY; for details type 'show w'. 578 | This is free software, and you are welcome to redistribute it 579 | under certain conditions; type 'show c' for details. 580 | 581 | The hypothetical commands `show w` and `show c` should show the appropriate parts of 582 | the General Public License. Of course, your program's commands might be different; 583 | for a GUI interface, you would use an “about box”. 584 | 585 | You should also get your employer (if you work as a programmer) or school, if any, to 586 | sign a “copyright disclaimer” for the program, if necessary. For more 587 | information on this, and how to apply and follow the GNU GPL, see 588 | <>. 589 | 590 | The GNU General Public License does not permit incorporating your program into 591 | proprietary programs. If your program is a subroutine library, you may consider it 592 | more useful to permit linking proprietary applications with the library. If this is 593 | what you want to do, use the GNU Lesser General Public License instead of this 594 | License. But first, please read 595 | <>. 596 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | export(classifySex) 4 | export(estimateBetaParam) 5 | export(estimateBetaParamsFromCounts) 6 | export(getTransformedProps) 7 | export(ggplotColors) 8 | export(normCounts) 9 | export(normSca) 10 | export(plotCellTypeMeanVar) 11 | export(plotCellTypeProps) 12 | export(plotCellTypePropsMeanVar) 13 | export(preprocess) 14 | export(propeller) 15 | export(propeller.anova) 16 | export(propeller.ttest) 17 | export(speckle_example_data) 18 | importFrom(AnnotationDbi,select) 19 | importFrom(Seurat,Idents) 20 | importFrom(SingleCellExperiment,colData) 21 | importFrom(edgeR,DGEList) 22 | importFrom(edgeR,estimateDisp) 23 | importFrom(ggplot2,aes) 24 | importFrom(ggplot2,element_text) 25 | importFrom(ggplot2,geom_bar) 26 | importFrom(ggplot2,ggplot) 27 | importFrom(ggplot2,theme) 28 | importFrom(grDevices,hcl) 29 | importFrom(graphics,legend) 30 | importFrom(graphics,lines) 31 | importFrom(graphics,par) 32 | importFrom(graphics,title) 33 | importFrom(limma,contrasts.fit) 34 | importFrom(limma,eBayes) 35 | importFrom(limma,lmFit) 36 | importFrom(methods,is) 37 | importFrom(org.Hs.eg.db,org.Hs.eg.db) 38 | importFrom(org.Mm.eg.db,org.Mm.eg.db) 39 | importFrom(scuttle,perCellQCMetrics) 40 | importFrom(scuttle,quickPerCellQC) 41 | importFrom(stats,lowess) 42 | importFrom(stats,median) 43 | importFrom(stats,p.adjust) 44 | importFrom(stats,predict) 45 | importFrom(stats,var) 46 | importFrom(stringr,str_to_title) 47 | -------------------------------------------------------------------------------- /NEWS.md: -------------------------------------------------------------------------------- 1 | # speckle 0.0.3 2 | * Added functions to classify cells as male or female 3 | * Change propeller transform default to logit 4 | 5 | # speckle 0.0.2 6 | * Added functions to plot the mean variance relationship of the cell type 7 | counts and proportions 8 | * Added functions to estimate parameters of a Beta distribution 9 | * Added logit transformation option to propeller 10 | 11 | # speckle 0.0.1 12 | 13 | * First version of the speckle package contains propeller functions to test for 14 | differences in cell type composition between groups of samples in single cell 15 | RNA-Seq data 16 | -------------------------------------------------------------------------------- /R/classifySex.R: -------------------------------------------------------------------------------- 1 | #' Predict sex of cells in scRNA-seq data 2 | #' 3 | #' This function will predict the sex for each cell in scRNA-seq data. The 4 | #' classifier is based on logistic regression models that have been trained 5 | #' on mouse and human single cell RNA-seq data. 6 | #' 7 | #' For bulk RNA-seq, checking the sex of the samples for mouse and human 8 | #' experiments is trivial as we can simply check the expression of *Xist/XIST*. 9 | #' It is not as simple for single cell RNA-seq data as the number of counts 10 | #' measured per gene and per cell is often quite low. Simply relying on cut-offs 11 | #' on the expression of genes like *Xist* means that many cells are unable to 12 | #' be classified. Hence we have developed a classifier based on a combination of 13 | #' X- and Y-linked genes in order to accurately predict the sex of each cell. 14 | #' 15 | #' Cells with zero counts on Xist and the sum of the Y chromosome genes will 16 | #' not be classified as there is simply not enough information to accurately 17 | #' classify as Male/Female, and NAs will be returned. In addition, the user has 18 | #' the option to perform quality control on the data first, by specifying 19 | #' \code{qc=TRUE}, which will not classify cells that are deemed low-quality. 20 | #' 21 | #' @aliases classifySex 22 | #' @param x counts matrix, rows correspond to genes and columns correspond to 23 | #' cells. Row names must be gene symbols. 24 | #' @param genome the genome the data arises from. Current options are 25 | #' human: genome = "Hs" or mouse: genome = "Mm". 26 | #' @param qc logical, indicates whether to perform quality control or not. 27 | #' qc = TRUE will predict cells that pass quality control only and the filtered 28 | #' cells will not be classified. qc = FALSE will predict every cell except the 29 | #' cells with zero counts on *XIST/Xist* and the sum of the Y genes. Default is TRUE. 30 | #' 31 | #' @return a dataframe with predicted labels for each cell 32 | #' 33 | #' @importFrom stats predict 34 | #' @export classifySex 35 | #' 36 | #' @author Xinyi Jin 37 | #' 38 | #' @examples 39 | #' 40 | #' library(speckle) 41 | #' library(SingleCellExperiment) 42 | #' library(CellBench) 43 | #' library(org.Hs.eg.db) 44 | #' 45 | #' sc_data <- load_sc_data() 46 | #' sc_10x <- sc_data$sc_10x 47 | #' 48 | #' counts <- counts(sc_10x) 49 | #' ann <- select(org.Hs.eg.db, keys=rownames(sc_10x), 50 | #' columns=c("ENSEMBL","SYMBOL"), keytype="ENSEMBL") 51 | #' m <- match(rownames(counts), ann$ENSEMBL) 52 | #' rownames(counts) <- ann$SYMBOL[m] 53 | #' 54 | #' sex <- classifySex(counts, genome="Hs") 55 | #' 56 | #' table(sex$prediction) 57 | #' boxplot(counts["XIST",]~sex$prediction) 58 | #' 59 | classifySex<-function(x, genome=NULL, qc = TRUE) 60 | # Classify cells as male or female 61 | # Xinyi Jin and Belinda Phipson 62 | # 11 February 2021 63 | # Modified 11 February 2021 64 | { 65 | # Perform some checks on the data 66 | if(is.null(x)) stop("Counts matrix missing") 67 | x <- as.matrix(x) 68 | if(is.null(genome)){ 69 | message("Genome not specified. Human genome used. Options are 'Hs' for 70 | human and 'Mm' for mouse. We currently don't support other genomes.") 71 | } 72 | # Default is Hs 73 | genome <- match.arg(genome,c("Hs","Mm")) 74 | 75 | # pre-process 76 | processed.data<-preprocess(x, genome = genome, qc = qc) 77 | 78 | # the processed transposed count matrix 79 | tcm <-processed.data$tcm.final 80 | 81 | # the normalised, scaled transposed count matrix 82 | data.df <- processed.data$data.df 83 | 84 | # cells that filtered by QC 85 | discarded.cells <- processed.data$discarded.cells 86 | 87 | # cells with zero count on XIST and superY.all 88 | zero.cells <- processed.data$zero.cells 89 | 90 | # store the final predictions 91 | final.pred<-data.frame(prediction=rep("NA", ncol(x))) 92 | row.names(final.pred)<- colnames(x) 93 | 94 | # load trained models 95 | if(genome == "Mm"){ 96 | model <- Mm.model 97 | } 98 | else{ 99 | model <- Hs.model 100 | } 101 | 102 | preds <- predict(model, newdata = data.df) 103 | final.pred[row.names(data.df), "prediction"]<- as.character(preds) 104 | 105 | final.pred 106 | } -------------------------------------------------------------------------------- /R/estimateBetaParam.R: -------------------------------------------------------------------------------- 1 | #' Estimate the parameters of a Beta distribution 2 | #' 3 | #' This function estimates the two parameters of the Beta distribution, alpha 4 | #' and beta, given a vector of proportions. It uses the method of moments to 5 | #' do this. 6 | #' 7 | #' @param x a vector of proportions. 8 | #' 9 | #' @return a list object with the estimate of alpha in \code{a} and beta in 10 | #' \code{b}. 11 | #' 12 | #' @export estimateBetaParam 13 | #' @importFrom stats var 14 | #' 15 | #' @author Belinda Phipson 16 | #' 17 | #' @examples 18 | #' # Generate proportions from a beta distribution 19 | #' props <- rbeta(1000, shape1=2, shape2=10) 20 | #' estimateBetaParam(props) 21 | #' 22 | estimateBetaParam <- function(x){ 23 | # solve for the hyperparameters of the beta distribution given a vector 24 | # of proportions 25 | mu <- mean(x) 26 | V <- var(x) 27 | a =((1-mu)/V - 1/mu)*mu^2 28 | b = ((1-mu)/V - 1/mu)*mu*(1-mu) 29 | list(a=a,b=b) 30 | } 31 | -------------------------------------------------------------------------------- /R/estimateBetaParamsFromCounts.R: -------------------------------------------------------------------------------- 1 | #' Estimate parameters of a Beta distribution from counts 2 | #' 3 | #' This function estimates the two parameters of the Beta distribution, alpha 4 | #' and beta for each cell type. The input is a matrix of cell type counts, 5 | #' where the rows are the cell types/clusters and the columns are the samples. 6 | #' 7 | #' This function is called from the plotting function \code{plotCellTypeMeanVar} 8 | #' in order to estimate the variance for the Beta-Binomial distribution for 9 | #' each cell type. 10 | #' 11 | #' @param x a matrix of counts 12 | #' 13 | #' @return outputs a list object with the following components 14 | #' \item{n }{Normalised library size} 15 | #' \item{alpha }{a vector of alpha parameters for the Beta distribution for 16 | #' each cell type} 17 | #' \item{beta }{vector of beta parameters for the Beta distribution for 18 | #' each cell type} 19 | #' \item{pi }{Estimate of the true proportion for each cell type} 20 | #' \item{dispersion }{Dispersion estimates for each cell type} 21 | #' \item{var }{Variance estimates for each cell type} 22 | #' 23 | #' @export 24 | #' 25 | #' @author Belinda Phipson 26 | #' 27 | #' @examples 28 | #' data <- speckle_example_data() 29 | #' x <- table(data$clusters, data$samples) 30 | #' estimateBetaParamsFromCounts(x) 31 | #' 32 | #' 33 | estimateBetaParamsFromCounts <- function(x){ 34 | # Make sure input is a matrix 35 | counts <- as.matrix(x) 36 | # Normalise the counts so that the total number of counts per sample is equal 37 | nc <- normCounts(counts) 38 | # Get cell type means 39 | m1 <- rowMeans(nc) 40 | # Get variance estimate for each cell type 41 | m2 <- rowSums(nc^2)/ncol(nc) 42 | n <- mean(colSums(nc)) 43 | alpha <- (n*m1-m2)/(n*(m2/m1-m1-1)+m1) 44 | beta <- ((n-m1)*(n-m2/m1))/(n*(m2/m1-m1-1)+m1) 45 | disp <- 1/(alpha+beta) 46 | pi <- alpha/(alpha+beta) 47 | var <- n*pi*(1-pi)*(n*disp+1)/(1+disp) 48 | output <- list(n=n, alpha=alpha, beta=beta, pi=pi, dispersion=disp, var=var) 49 | output 50 | } 51 | -------------------------------------------------------------------------------- /R/getTransformedProps.R: -------------------------------------------------------------------------------- 1 | #' Calculates and transforms cell type proportions 2 | #' 3 | #' Calculates cell types proportions based on clusters/cell types and sample 4 | #' information and performs a variance stabilising transformation on the 5 | #' proportions. 6 | #' 7 | #' This function is called by the \code{propeller} function and calculates cell 8 | #' type proportions and performs an arcsin-square root transformation. 9 | #' 10 | #' @param clusters a factor specifying the cluster or cell type for every cell. 11 | #' @param sample a factor specifying the biological replicate for every cell. 12 | #' @param transform a character scalar specifying which transformation of the 13 | #' proportions to perform. Possible values include "asin" or "logit". Defaults 14 | #' to "asin". 15 | #' 16 | #' @return outputs a list object with the following components 17 | #' \item{Counts }{A matrix of cell type counts with 18 | #' the rows corresponding to the clusters/cell types and the columns 19 | #' corresponding to the biological replicates/samples.} 20 | #' \item{TransformedProps }{A matrix of transformed cell type proportions with 21 | #' the rows corresponding to the clusters/cell types and the columns 22 | #' corresponding to the biological replicates/samples.} 23 | #' \item{Proportions }{A matrix of cell type proportions with the rows 24 | #' corresponding to the clusters/cell types and the columns corresponding to 25 | #' the biological replicates/samples.} 26 | #' 27 | #' @export 28 | #' 29 | #' @author Belinda Phipson 30 | #' 31 | #' @seealso \code{\link{propeller}} 32 | #' 33 | #' @examples 34 | #' 35 | #' library(speckle) 36 | #' library(ggplot2) 37 | #' library(limma) 38 | #' 39 | #' # Make up some data 40 | #' 41 | #' # True cell type proportions for 4 samples 42 | #' p_s1 <- c(0.5,0.3,0.2) 43 | #' p_s2 <- c(0.6,0.3,0.1) 44 | #' p_s3 <- c(0.3,0.4,0.3) 45 | #' p_s4 <- c(0.4,0.3,0.3) 46 | #' 47 | #' # Total numbers of cells per sample 48 | #' numcells <- c(1000,1500,900,1200) 49 | #' 50 | #' # Generate cell-level vector for sample info 51 | #' biorep <- rep(c("s1","s2","s3","s4"),numcells) 52 | #' length(biorep) 53 | #' 54 | #' # Numbers of cells for each of 3 clusters per sample 55 | #' n_s1 <- p_s1*numcells[1] 56 | #' n_s2 <- p_s2*numcells[2] 57 | #' n_s3 <- p_s3*numcells[3] 58 | #' n_s4 <- p_s4*numcells[4] 59 | #' 60 | #' cl_s1 <- rep(c("c0","c1","c2"),n_s1) 61 | #' cl_s2 <- rep(c("c0","c1","c2"),n_s2) 62 | #' cl_s3 <- rep(c("c0","c1","c2"),n_s3) 63 | #' cl_s4 <- rep(c("c0","c1","c2"),n_s4) 64 | #' 65 | #' # Generate cell-level vector for cluster info 66 | #' clust <- c(cl_s1,cl_s2,cl_s3,cl_s4) 67 | #' length(clust) 68 | #' 69 | #' getTransformedProps(clusters = clust, sample = biorep) 70 | #' 71 | getTransformedProps <- function(clusters=clusters, sample=sample, 72 | transform=NULL) 73 | { 74 | if(is.null(transform)) transform <- "logit" 75 | 76 | tab <- table(sample, clusters) 77 | props <- tab/rowSums(tab) 78 | if(transform=="asin"){ 79 | message("Performing arcsin square root transformation of proportions") 80 | prop.trans <- asin(sqrt(props)) 81 | } 82 | else if(transform=="logit"){ 83 | message("Performing logit transformation of proportions") 84 | props.pseudo <- (tab+0.5)/rowSums(tab+0.5) 85 | prop.trans <- log(props.pseudo/(1-props.pseudo)) 86 | } 87 | list(Counts=t(tab), TransformedProps=t(prop.trans), Proportions=t(props)) 88 | } 89 | -------------------------------------------------------------------------------- /R/ggplotColors.R: -------------------------------------------------------------------------------- 1 | #' Output a vector of colours based on the ggplot colour scheme 2 | #' 3 | #' This function takes as input the number of colours the user would like, and 4 | #' outputs a vector of colours in the ggplot colour scheme. 5 | #' 6 | #' @param g the number of colours to be generated. 7 | #' 8 | #' @return a vector with the names of the colours. 9 | #' @export ggplotColors 10 | #' 11 | #' @importFrom grDevices hcl 12 | #' 13 | #' @author Belinda Phipson 14 | #' 15 | #' @examples 16 | #' # Generate a palette of 6 colours 17 | #' cols <- ggplotColors(6) 18 | #' cols 19 | #' 20 | #' # Generate some count data 21 | #' y <- matrix(rnbinom(600, mu=100, size=1), ncol=6) 22 | #' 23 | #' par(mfrow=c(1,1)) 24 | #' boxplot(y, col=cols) 25 | #' 26 | ggplotColors <- function(g){ 27 | 28 | d <- 360/g 29 | 30 | h <- cumsum(c(15, rep(d,g - 1))) 31 | 32 | hcl(h = h, c = 100, l = 65) 33 | 34 | } 35 | -------------------------------------------------------------------------------- /R/normCounts.R: -------------------------------------------------------------------------------- 1 | #' Normalise a counts matrix to the median library size 2 | #' 3 | #' This function takes a \code{DGEList} object or matrix of counts and 4 | #' normalises the counts to the median library size. This puts the normalised 5 | #' counts on a similar scale to the original counts. 6 | #' 7 | #' If the input is a DGEList object, the normalisation factors in 8 | #' \code{norm.factors} are taken into account in the normalisation. The prior 9 | #' counts are added proportionally to the library size 10 | #' 11 | #' @param x a \code{DGEList} object or matrix of counts. 12 | #' @param log logical, indicates whether the output should be on the log2 scale 13 | #' or counts scale. Default is FALSE. 14 | #' @param prior.count The prior count to add if the data is log2 normalised. 15 | #' Default is a small count of 0.5. 16 | #' @param lib.size a vector of library sizes to be used during the normalisation 17 | #' step. Default is NULL and will be computed from the counts matrix. 18 | #' 19 | #' @return a matrix of normalised counts 20 | #' 21 | #' @export normCounts 22 | #' @importFrom stats median 23 | #' @author Belinda Phipson 24 | #' 25 | #' @examples 26 | #' # Simulate some data from a negative binomial distribution with mean equal 27 | #' # to 100 and dispersion set to 1. Simulate 1000 genes and 6 samples. 28 | #' y <- matrix(rnbinom(6000, mu = 100, size = 1), ncol = 6) 29 | #' 30 | #' # Normalise the counts 31 | #' norm.y <- normCounts(y) 32 | #' 33 | #' # Return log2 normalised counts 34 | #' lnorm.y <- normCounts(y, log=TRUE) 35 | #' 36 | #' # Return log2 normalised counts with prior.count = 2 37 | #' lnorm.y2 <- normCounts(y, log=TRUE, prior.count=2) 38 | #' 39 | #' par(mfrow=c(1,2)) 40 | #' boxplot(norm.y, main="Normalised counts") 41 | #' boxplot(lnorm.y, main="Log2-normalised counts") 42 | #' 43 | normCounts <-function(x, log=FALSE, prior.count=0.5, lib.size=NULL) 44 | # Function to normalise to median library size instead of counts per million 45 | # Input is DGEList object or matrix 46 | # Belinda Phipson 47 | # 30 November 2015 48 | { 49 | if(any(class(x)=="DGEList")){ 50 | lib.size <- x$samples$lib.size*x$samples$norm.factors 51 | counts <- x$counts 52 | } 53 | else{ 54 | counts <- as.matrix(x) 55 | if(is.null(lib.size)){ 56 | lib.size <- colSums(counts) 57 | } 58 | else{ 59 | if(length(lib.size)==ncol(x)) 60 | lib.size <- as.vector(lib.size) 61 | else{ 62 | message("Vector of library sizes does not match dimensions of input 63 | data. Calculating library sizes from the counts matrix.") 64 | lib.size <- colSums(counts) 65 | } 66 | } 67 | 68 | } 69 | 70 | M <- median(lib.size) 71 | if(log){ 72 | prior.count.scaled <- lib.size/mean(lib.size)*prior.count 73 | lib.size <- lib.size + 2*prior.count.scaled 74 | log2(t((t(counts)+prior.count.scaled)/lib.size*M)) 75 | } 76 | else t(t(counts)/lib.size*M) 77 | } 78 | -------------------------------------------------------------------------------- /R/normSca.R: -------------------------------------------------------------------------------- 1 | #' Normalise and scale counts matrix 2 | #' 3 | #' This function is called by the \code{preprocess} function and performs 4 | #' log-normalisation and scaling of a counts matrix. 5 | #' 6 | #' @param x a matrix of counts 7 | #' @param lib.size the library size 8 | #' @param log logical, indicates whether the output should be on the log2 scale 9 | #' or counts scale. Default is TRUE. 10 | #' @param prior.count The prior count to add if the data is log2 normalised. 11 | #' Default is a small count of 0.5. 12 | #' 13 | #' @return a dataframe of log-normalised and scaled counts 14 | #' 15 | #' @export normSca 16 | #' 17 | #' @examples 18 | #' 19 | #' y <- matrix(rnbinom(1000, size=2, mu= 20),ncol=10) 20 | #' colnames(y)<- paste("Cell",1:10, sep="") 21 | #' row.names(y)<-paste("Gene",1:100,sep="") 22 | #' norm.data <- normSca(y,lib.size=colSums(y)) 23 | #' 24 | #' #Visualise the counts vs scaled data 25 | #' boxplot(y) 26 | #' boxplot(norm.data) 27 | #' 28 | normSca<-function(x, lib.size=lib.size, log = TRUE, prior.count = 0.5) 29 | # Normalise and scale counts matrix 30 | # Xinyi Jin and Belinda Phipson 31 | # 17 February 2021 32 | # Modified 17 February 2021 33 | { 34 | x <- as.matrix(x) 35 | # log normalise 36 | normalisedVal<- normCounts(x, log = log, prior.count = prior.count, 37 | lib.size=lib.size) 38 | # scale 39 | #scaledVal<-t(scale(t(normalisedVal))) 40 | #scaledVal 41 | 42 | normalisedVal 43 | } 44 | 45 | 46 | -------------------------------------------------------------------------------- /R/plotCellTypeMeanVar.R: -------------------------------------------------------------------------------- 1 | #' Plot cell type counts means versus variances 2 | #' 3 | #' This function returns a plot of the log10(mean) versus log10(variance) of 4 | #' the cell type counts. The function takes a matrix of cell type counts as 5 | #' input. The rows are the clusters/cell types and the columns are the samples. 6 | #' 7 | #' The expected variance under a binomial distribution is shown in the solid 8 | #' line, and the points represent the observed variance for each cell type/row 9 | #' in the counts table. The expected variance under different model assumptions 10 | #' are shown in the different coloured lines. 11 | #' 12 | #' The mean and variance for each cell type is calculated across all samples. 13 | #' 14 | #' @param x a matrix or table of counts 15 | #' 16 | #' @return a base R plot 17 | #' 18 | #' @importFrom edgeR estimateDisp 19 | #' @importFrom edgeR DGEList 20 | #' @importFrom graphics legend lines par title 21 | #' @importFrom stats lowess 22 | #' 23 | #' @export 24 | #' 25 | #' @author Belinda Phipson 26 | #' 27 | #' @examples 28 | #' library(edgeR) 29 | #' # Generate some data 30 | #' # Total number of samples 31 | #' nsamp <- 10 32 | #' # True cell type proportions 33 | #' p <- c(0.05, 0.15, 0.35, 0.45) 34 | #' 35 | #' # Parameters for beta distribution 36 | #' a <- 40 37 | #' b <- a*(1-p)/p 38 | #' 39 | #' # Sample total cell counts per sample from negative binomial distribution 40 | #' numcells <- rnbinom(nsamp,size=20,mu=5000) 41 | #' true.p <- matrix(c(rbeta(nsamp,a,b[1]),rbeta(nsamp,a,b[2]), 42 | #' rbeta(nsamp,a,b[3]),rbeta(nsamp,a,b[4])),byrow=TRUE, ncol=nsamp) 43 | #' 44 | #' counts <- matrix(NA,ncol=nsamp, nrow=nrow(true.p)) 45 | #' rownames(counts) <- paste("c",0:(nrow(true.p)-1), sep="") 46 | #' for(j in 1:length(p)){ 47 | #' counts[j,] <- rbinom(nsamp, size=numcells, prob=true.p[j,]) 48 | #' } 49 | #' 50 | #' plotCellTypeMeanVar(counts) 51 | #' 52 | plotCellTypeMeanVar <- function(x){ 53 | # Make sure input is a matrix 54 | x <- as.matrix(x) 55 | # Normalise the cell type counts 56 | nc <- normCounts(x) 57 | 58 | # Beta binomial variance 59 | params <- estimateBetaParamsFromCounts(x) 60 | 61 | #Observed variance 62 | means <- rowMeans(nc) 63 | means.mat <- matrix(rep(means,ncol(nc)),ncol=ncol(nc)) 64 | vars <- rowSums((nc-means.mat)^2)/(ncol(nc)-1) 65 | 66 | # Binomial variance 67 | ebv <- params$n*params$pi*(1-params$pi) 68 | 69 | # Negative binomial variance 70 | y <- estimateDisp(DGEList(x)) 71 | tagvars <- means + means^2 * y$tagwise.dispersion 72 | 73 | ylimits.min <- min(log10(vars),log10(ebv), log10(params$var), 74 | log10(tagvars), log10(params$n*params$pi)) 75 | ylimits.max <- max(log10(vars),log10(ebv), log10(params$var), 76 | log10(tagvars), log10(params$n*params$pi)) 77 | 78 | par(mar=c(5,5,2,2)) 79 | #par(mfrow=c(1,1)) 80 | plot(log10(means),log10(vars), pch=16, cex=1.5, cex.lab=1.5, cex.axis=1.5, 81 | ylim=c(ylimits.min,ylimits.max), 82 | xlab = "log10(mean)", ylab = "log10(variance)") 83 | lines(lowess(log10(means), log10(ebv)), col=1, lwd=2) 84 | lines(lowess(log10(means), log10(params$var)), lwd=2, col=4) 85 | lines(lowess(log10(means),log10(tagvars)),lwd=2,col="purple", lty=2) 86 | lines(lowess(log10(means),log10(params$n*params$pi)),lwd=2,col="red", lty=2) 87 | legend("bottomright", legend=c("Beta-binomial","Negative binomial", "Binomial", 88 | "Poisson"), 89 | col=c(4,"purple",1,2), lty=c(1,2,1,2), lwd=2) 90 | title("Mean-variance relationship: cell type counts", cex.main=1.5) 91 | 92 | } 93 | -------------------------------------------------------------------------------- /R/plotCellTypeProps.R: -------------------------------------------------------------------------------- 1 | #' Plot cell type proportions for each sample 2 | #' 3 | #' This is a plotting function that shows the cell type composition for each 4 | #' sample as a stacked barplot. The \code{plotCellTypeProps} returns a 5 | #' \code{ggplot2} object enabling the user to make style changes as required. 6 | #' 7 | #' @param x object of class \code{SingleCellExperiment} or \code{Seurat} 8 | #' @param clusters a factor specifying the cluster or cell type for every cell. 9 | #' For \code{SingleCellExperiment} objects this should correspond to a column 10 | #' called \code{clusters} in the \code{colData} assay. For \code{Seurat} 11 | #' objects this will be extracted by a call to \code{Idents(x)}. 12 | #' @param sample a factor specifying the biological replicate for each cell. 13 | #' For \code{SingleCellExperiment} objects this should correspond to a column 14 | #' called \code{sample} in the \code{colData} assay and for \code{Seurat} 15 | #' objects this should correspond to \code{x$sample}. 16 | #' 17 | #' @return a ggplot2 object 18 | #' 19 | #' @importFrom ggplot2 aes 20 | #' @importFrom ggplot2 element_text 21 | #' @importFrom ggplot2 geom_bar 22 | #' @importFrom ggplot2 ggplot 23 | #' @importFrom ggplot2 theme 24 | #' @export 25 | #' 26 | #' @author Belinda Phipson 27 | #' 28 | #' @examples 29 | #' 30 | #' library(speckle) 31 | #' library(ggplot2) 32 | #' library(limma) 33 | #' 34 | #' # Generate some fake data from a multinomial distribution 35 | #' # Group A, 4 samples, 1000 cells in each sample 36 | #' countsA <- rmultinom(4, size=1000, prob=c(0.1,0.3,0.6)) 37 | #' colnames(countsA) <- paste("s",1:4,sep="") 38 | #' 39 | #' # Group B, 3 samples, 800 cells in each sample 40 | #' 41 | #' countsB <- rmultinom(3, size=800, prob=c(0.2,0.05,0.75)) 42 | #' colnames(countsB) <- paste("s",5:7,sep="") 43 | #' rownames(countsA) <- rownames(countsB) <- paste("c",0:2,sep="") 44 | #' 45 | #' allcounts <- cbind(countsA, countsB) 46 | #' sample <- c(rep(colnames(allcounts),allcounts[1,]), 47 | #' rep(colnames(allcounts),allcounts[2,]), 48 | #' rep(colnames(allcounts),allcounts[3,])) 49 | #' clust <- rep(rownames(allcounts),rowSums(allcounts)) 50 | #' 51 | #' plotCellTypeProps(clusters=clust, sample=sample) 52 | #' 53 | plotCellTypeProps <- function(x=NULL, clusters=NULL, sample=NULL) 54 | { 55 | if(is.null(x) & is.null(sample) & is.null(clusters)) 56 | stop("Please provide either a SingleCellExperiment object or Seurat 57 | object with required annotation metadata, or explicitly provide 58 | clusters and sample information") 59 | 60 | if((is.null(clusters) | is.null(sample)) & !is.null(x)){ 61 | # Extract cluster, sample and group info from SCE object 62 | if(is(x,"SingleCellExperiment")) 63 | y <- .extractSCE(x) 64 | 65 | # Extract cluster, sample and group info from Seurat object 66 | if(is(x,"Seurat")) 67 | y <- .extractSeurat(x) 68 | 69 | clusters <- y$clusters 70 | sample <- y$sample 71 | } 72 | 73 | prop.list <- getTransformedProps(clusters, sample) 74 | 75 | Proportions <- as.vector(t(prop.list$Proportions)) 76 | Samples <- rep(colnames(prop.list$Proportions), nrow(prop.list$Proportions)) 77 | Clusters <- rep(rownames(prop.list$Proportions), 78 | each=ncol(prop.list$Proportions)) 79 | 80 | plotdf <- data.frame(Samples=Samples, Clusters=Clusters, 81 | Proportions=Proportions) 82 | 83 | ggplot(plotdf,aes(x=Samples,y=Proportions,fill=Clusters)) + 84 | geom_bar(stat="identity") + 85 | theme(axis.text.x = element_text(size=12), 86 | axis.text.y = element_text(size=12), 87 | axis.title = element_text(size=14), 88 | legend.text = element_text(size=12), 89 | legend.title = element_text(size=14)) 90 | 91 | } 92 | -------------------------------------------------------------------------------- /R/plotCellTypePropsMeanVar.R: -------------------------------------------------------------------------------- 1 | #' Plot cell type proportions versus variances 2 | #' 3 | #' This function returns a plot of the log10(proportion) versus log10(variance) 4 | #' given a matrix of cell type counts. The rows are the clusters/cell types and 5 | #' the columns are the samples. 6 | #' 7 | #' The expected variance under a binomial distribution is shown in the solid 8 | #' line, and the points represent the observed variance for each cell type/row 9 | #' in the counts table. The blue line shows the empirical Bayes variance 10 | #' that is used in \code{propeller}. 11 | #' 12 | #' The mean and variance for each cell type is calculated across all samples. 13 | #' 14 | #' @param x a matrix or table of counts 15 | #' 16 | #' @return a base R plot 17 | #' 18 | #' @importFrom limma lmFit 19 | #' @importFrom limma eBayes 20 | #' @importFrom graphics legend lines par title 21 | #' @importFrom stats lowess 22 | #' 23 | #' @export 24 | #' 25 | #' @author Belinda Phipson 26 | #' 27 | #' @examples 28 | #' library(limma) 29 | #' # Generate some data 30 | #' # Total number of samples 31 | #' nsamp <- 10 32 | #' # True cell type proportions 33 | #' p <- c(0.05, 0.15, 0.35, 0.45) 34 | #' 35 | #' # Parameters for beta distribution 36 | #' a <- 40 37 | #' b <- a*(1-p)/p 38 | #' 39 | #' # Sample total cell counts per sample from negative binomial distribution 40 | #' numcells <- rnbinom(nsamp,size=20,mu=5000) 41 | #' true.p <- matrix(c(rbeta(nsamp,a,b[1]),rbeta(nsamp,a,b[2]), 42 | #' rbeta(nsamp,a,b[3]),rbeta(nsamp,a,b[4])),byrow=TRUE, ncol=nsamp) 43 | #' 44 | #' counts <- matrix(NA,ncol=nsamp, nrow=nrow(true.p)) 45 | #' rownames(counts) <- paste("c",0:(nrow(true.p)-1), sep="") 46 | #' for(j in 1:length(p)){ 47 | #' counts[j,] <- rbinom(nsamp, size=numcells, prob=true.p[j,]) 48 | #' } 49 | #' 50 | #' plotCellTypePropsMeanVar(counts) 51 | #' 52 | plotCellTypePropsMeanVar <- function(x){ 53 | x <- as.matrix(x) 54 | params <- estimateBetaParamsFromCounts(x) 55 | tot.cells <- colSums(x) 56 | props <- t(t(x)/tot.cells) 57 | 58 | varp <- params$pi*(1-params$pi)/params$n 59 | var.props <- apply(props,1,var) 60 | fitp <- lmFit(props) 61 | fitp <- eBayes(fitp, robust=TRUE) 62 | 63 | ylimits.min <- min(log10(var.props),log10(varp), log10(fitp$s2.post)) 64 | ylimits.max <- max(log10(var.props),log10(varp), log10(fitp$s2.post)) 65 | 66 | # par(mfrow=c(1,1)) 67 | plot(log10(rowMeans(props)), log10(var.props), pch=16, cex=2, 68 | xlab="log10(proportion)", ylab="log10(variance)", 69 | ylim=c(ylimits.min,ylimits.max), cex.lab=1.5, cex.axis=1.5) 70 | lines(lowess(log10(rowMeans(props)),log10(varp))) 71 | 72 | lines(lowess(log10(rowMeans(props)), log10(fitp$s2.post)), lwd=2, col=4) 73 | legend("bottomright", legend=c("Empirical Bayes variance","Binomial variance"), 74 | col=c(4,1), lty=1, lwd=2) 75 | 76 | title("Mean-variance relationship: cell type proportions", cex.main=1.5) 77 | } 78 | -------------------------------------------------------------------------------- /R/preprocess.R: -------------------------------------------------------------------------------- 1 | #' Pre-processing function for sex classification 2 | #' 3 | #' The purpose of this function is to process a single cell counts matrix into 4 | #' the appropriate format for the \code{classifySex} function. 5 | #' 6 | #' This function will filter out cells that are unable to be classified due to 7 | #' zero counts on *XIST/Xist* and all of the Y chromosome genes. If 8 | #' \code{qc=TRUE} additional cells are removed as identified by the 9 | #' \code{perCellQCMetrics} and \code{quickPerCellQC} functions from the 10 | #' \code{scuttle} package. The resulting counts matrix is then log-normalised 11 | #' and scaled. 12 | #' 13 | #' @param x the counts matrix, rows are genes and columns are cells. Row names 14 | #' must be gene symbols. 15 | #' @param genome the genome the data arises from. Current options are 16 | #' human: genome = "Hs" or mouse: genome = "Mm". 17 | #' @param qc logical, indicates whether to perform additional quality control on 18 | #' the cells. qc = TRUE will predict cells that pass quality control only and 19 | #' the filtered cells will not be classified. qc = FALSE will predict every cell 20 | #' except the cells with zero counts on *XIST/Xist* and the sum of the Y genes. 21 | #' Default is TRUE. 22 | #' 23 | #' @return outputs a list object with the following components 24 | #' \item{tcm.final }{A transposed count matrix where rows are cells and columns 25 | #' are the features used for classification.} 26 | #' \item{data.df }{The normalised and scaled \code{tcm.final} matrix.} 27 | #' \item{discarded.cells }{Character vector of cell IDs for the cells that are 28 | #' discarded when \code{qc=TRUE}.} 29 | #' \item{zero.cells }{Character vector of cell IDs for the cells that can not 30 | #' be classified as male/female due to zero counts on *Xist* and all the 31 | #' Y chromosome genes.} 32 | #' 33 | #' @importFrom AnnotationDbi select 34 | #' @importFrom stringr str_to_title 35 | #' @importFrom scuttle perCellQCMetrics 36 | #' @importFrom scuttle quickPerCellQC 37 | #' @importFrom org.Hs.eg.db org.Hs.eg.db 38 | #' @importFrom org.Mm.eg.db org.Mm.eg.db 39 | #' @export preprocess 40 | #' 41 | #' @examples 42 | #' 43 | #' library(speckle) 44 | #' library(SingleCellExperiment) 45 | #' library(CellBench) 46 | #' library(org.Hs.eg.db) 47 | #' 48 | #' # Get data from CellBench library 49 | #' sc_data <- load_sc_data() 50 | #' sc_10x <- sc_data$sc_10x 51 | #' 52 | #' # Get counts matrix in correct format with gene symbol as rownames 53 | #' # rather than ENSEMBL ID. 54 | #' counts <- counts(sc_10x) 55 | #' ann <- select(org.Hs.eg.db, keys=rownames(sc_10x), 56 | #' columns=c("ENSEMBL","SYMBOL"), keytype="ENSEMBL") 57 | #' m <- match(rownames(counts), ann$ENSEMBL) 58 | #' rownames(counts) <- ann$SYMBOL[m] 59 | #' 60 | #' # Preprocess data 61 | #' pro.data <- preprocess(counts, genome="Hs", qc = TRUE) 62 | #' 63 | #' # Look at counts on XIST and superY.all 64 | #' plot(pro.data$tcm.final$XIST, pro.data$tcm.final$superY) 65 | #' 66 | #' # Cells that are identified as low quality 67 | #' pro.data$discarded.cells 68 | #' 69 | #' # Cells with zero counts on XIST and all Y genes 70 | #' pro.data$zero.cells 71 | #' 72 | preprocess<- function(x, genome=genome, qc=qc){ 73 | 74 | x <- as.matrix(x) 75 | row.names(x)<- toupper(row.names(x)) 76 | # genes located in the X chromosome that have been reported to escape 77 | # X-inactivation 78 | # http://bioinf.wehi.edu.au/software/GenderGenes/index.html 79 | Xgenes<- c("ARHGAP4","STS","ARSD", "ARSL", "AVPR2", "BRS3", "S100G", "CHM", 80 | "CLCN4", "DDX3X","EIF1AX","EIF2S3", "GPM6B", "GRPR", "HCFC1", 81 | "L1CAM", "MAOA", "MYCLP1", "NAP1L3", "GPR143", "CDK16", "PLXNB3", 82 | "PRKX", "RBBP7", "RENBP", "RPS4X", "TRAPPC2", "SH3BGRL", "TBL1X", 83 | "UBA1", "KDM6A", "XG", "XIST", "ZFX", "PUDP", "PNPLA4", "USP9X", 84 | "KDM5C", "SMC1A", "NAA10", "OFD1", "IKBKG", "PIR", "INE2", "INE1", 85 | "AP1S2", "GYG2", "MED14", "RAB9A", "ITM2A", "MORF4L2", "CA5B", 86 | "SRPX2", "GEMIN8", "CTPS2", "CLTRN", "NLGN4X", "DUSP21", "ALG13", 87 | "SYAP1", "SYTL4", "FUNDC1", "GAB3", "RIBC1", "FAM9C","CA5BP1") 88 | 89 | # genes belonging to the male-specific region of chromosome Y (unique genes) 90 | # http://bioinf.wehi.edu.au/software/GenderGenes/index.html 91 | Ygenes<-c("AMELY", "DAZ1", "PRKY", "RBMY1A1", "RBMY1HP", "RPS4Y1", "SRY", 92 | "TSPY1", "UTY", "ZFY","KDM5D", "USP9Y", "DDX3Y", "PRY", "XKRY", 93 | "BPY2", "VCY", "CDY1", "EIF1AY", "TMSB4Y","CDY2A", "NLGN4Y", 94 | "PCDH11Y", "HSFY1", "TGIF2LY", "TBL1Y", "RPS4Y2", "HSFY2", 95 | "CDY2B", "TXLNGY","CDY1B", "DAZ3", "DAZ2", "DAZ4") 96 | 97 | # build artificial genes 98 | Xgene.set <-Xgenes[Xgenes %in% row.names(x)] 99 | Ygene.set <-Ygenes[Ygenes %in% row.names(x)] 100 | cm.new<-as.data.frame(matrix(rep(0, 3*ncol(x)), ncol = ncol(x),nrow = 3)) 101 | row.names(cm.new) <- c("XIST","superX","superY") 102 | colnames(cm.new) <- colnames(x) 103 | cm.new["XIST", ]<- x["XIST", ] 104 | cm.new["superX", ] <-colSums(x[Xgene.set,]) 105 | cm.new["superY", ] <-colSums(x[Ygene.set,]) 106 | 107 | # if (genome == "Mm"){ 108 | # ann <- suppressWarnings(AnnotationDbi::select(org.Mm.eg.db,keys=str_to_title(row.names(x)), 109 | # columns=c("SYMBOL","GENENAME","CHR"), 110 | # keytype="SYMBOL")) 111 | # }else{ 112 | # ann <- suppressWarnings(AnnotationDbi::select(org.Hs.eg.db,keys=row.names(x), 113 | # columns=c("SYMBOL","GENENAME","CHR"), 114 | # keytype="SYMBOL")) 115 | # } 116 | # # create superY.all 117 | # Ychr.genes<- toupper(unique(ann[which(ann$CHR=="Y"), "SYMBOL"])) 118 | # missing <- setdiff(Ychr.genes, row.names(x)) 119 | # if (length(missing) >0){ 120 | # Ychr.genes <- Ychr.genes[-match(missing, Ychr.genes)] 121 | # } 122 | # cm.new["superY.all", ] <- colSums(x[Ychr.genes,]) 123 | 124 | ############################################################################ 125 | # Pre-processing 126 | # perform simple QC 127 | # keep a copy of library size 128 | discarded.cells <- NA 129 | if (qc == TRUE){ 130 | #data.sce <-SingleCellExperiment(assays = list(counts = x)) 131 | qcstats <- scuttle::perCellQCMetrics(x,subsets=list(Mito=1:100)) 132 | qcfilter <- scuttle::quickPerCellQC(qcstats, 133 | percent_subsets=c("subsets_Mito_percent")) 134 | # save the discarded cells 135 | discarded.cells <- colnames(x[,qcfilter$discard]) 136 | 137 | # cm.new only contains cells that pass the quality control 138 | cm.new <-cm.new[,!qcfilter$discard] 139 | } 140 | 141 | tcm.final <- t(cm.new) 142 | tcm.final <- as.data.frame(tcm.final) 143 | 144 | #Do Not Classify 145 | zero.cells <- NA 146 | dnc <- tcm.final$superY==0 & tcm.final$superX==0 147 | if(any(dnc)==TRUE){ 148 | zero.cells <- row.names(tcm.final)[dnc] 149 | cat(length(zero.cells), "cell/s are unable to be classified due to an 150 | abundance of zeroes on X and Y chromosome genes\n") 151 | } 152 | tcm.final <- tcm.final[!dnc, ] 153 | 154 | cm.new <- cm.new[,!dnc] 155 | cm.lib.size<- colSums(x[,colnames(cm.new)], na.rm=TRUE) 156 | 157 | # log-normalisation performed for each cell 158 | # scaling performed for each gene 159 | normsca.cm <- normSca(cm.new, lib.size=cm.lib.size) 160 | data.df <- t(normsca.cm) 161 | data.df <- as.data.frame(data.df) 162 | 163 | list(tcm.final=tcm.final, data.df=data.df, discarded.cells=discarded.cells, 164 | zero.cells=zero.cells) 165 | } 166 | 167 | -------------------------------------------------------------------------------- /R/propeller.R: -------------------------------------------------------------------------------- 1 | #' Finding statistically significant differences in cell type proportions 2 | #' 3 | #' Calculates cell type proportions, performs a variance stabilising 4 | #' transformation on the proportions and determines whether the cell type 5 | #' proportions are statistically significant between different groups using 6 | #' linear modelling. 7 | #' 8 | #' This function will take a \code{SingleCellExperiment} or \code{Seurat} 9 | #' object and extract the \code{group}, \code{sample} and \code{clusters} cell 10 | #' information. The user can either state these factor vectors explicitly in 11 | #' the call to the \code{propeller} function, or internal functions will 12 | #' extract them from the relevants objects. The user must ensure that 13 | #' \code{group} and \code{sample} are columns in the metadata assays of the 14 | #' relevant objects (any combination of upper/lower case is acceptable). For 15 | #' \code{Seurat} objects the clusters are extracted using the \code{Idents} 16 | #' function. For \code{SingleCellExperiment} objects, \code{clusters} needs to 17 | #' be a column in the \code{colData} assay. 18 | #' 19 | #' The \code{propeller} function calculates cell type proportions for each 20 | #' biological replicate, performs a variance stabilising transformation on the 21 | #' matrix of proportions and fits a linear model for each cell type or cluster 22 | #' using the \code{limma} framework. There are two options for the 23 | #' transformation: arcsin square root or logit. Propeller tests whether there 24 | #' is a difference in the cell type proportions between multiple groups. 25 | #' If there are only 2 groups, a t-test is used to calculate p-values, and if 26 | #' there are more than 2 groups, an F-test (ANOVA) is used. Cell type 27 | #' proportions of 1 or 0 are accommodated. Benjamini and Hochberg false 28 | #' discovery rates are calculated to account to multiple testing of 29 | #' cell types/clusters. 30 | #' 31 | #' @aliases propeller 32 | #' @param x object of class \code{SingleCellExperiment} or \code{Seurat} 33 | #' @param clusters a factor specifying the cluster or cell type for every cell. 34 | #' For \code{SingleCellExperiment} objects this should correspond to a column 35 | #' called \code{clusters} in the \code{colData} assay. For \code{Seurat} 36 | #' objects this will be extracted by a call to \code{Idents(x)}. 37 | #' @param sample a factor specifying the biological replicate for each cell. 38 | #' For \code{SingleCellExperiment} objects this should correspond to a column 39 | #' called \code{sample} in the \code{colData} assay and for \code{Seurat} 40 | #' objects this should correspond to \code{x$sample}. 41 | #' @param group a factor specifying the groups of interest for performing the 42 | #' differential proportions analysis. For \code{SingleCellExperiment} objects 43 | #' this should correspond to a column called \code{group} in the \code{colData} 44 | #' assay. For \code{Seurat} objects this should correspond to \code{x$group}. 45 | #' @param trend logical, if true fits a mean variance trend on the transformed 46 | #' proportions 47 | #' @param robust logical, if true performs robust empirical Bayes shrinkage of 48 | #' the variances 49 | #' @param transform a character scalar specifying which transformation of the 50 | #' proportions to perform. Possible values include "asin" or "logit". Defaults 51 | #' to "logit". 52 | #' 53 | #' @return produces a dataframe of results 54 | #' 55 | #' @importFrom stats p.adjust 56 | #' @export propeller 57 | #' 58 | #' @author Belinda Phipson 59 | #' 60 | #' @seealso \code{\link{propeller.ttest}} \code{\link{propeller.anova}} 61 | #' \code{\link{lmFit}}, \code{\link{eBayes}}, 62 | #' \code{\link{getTransformedProps}} 63 | #' 64 | #' @references Smyth, G.K. (2004). Linear models and empirical Bayes methods 65 | #' for assessing differential expression in microarray experiments. 66 | #' \emph{Statistical Applications in Genetics and Molecular Biology}, Volume 67 | #' \bold{3}, Article 3. 68 | #' 69 | #' Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery 70 | #' rate: a practical and powerful approach to multiple testing. \emph{Journal 71 | #' of the Royal Statistical Society Series}, B, \bold{57}, 289-300. 72 | #' 73 | #' @examples 74 | #' 75 | #' library(speckle) 76 | #' library(ggplot2) 77 | #' library(limma) 78 | #' 79 | #' # Make up some data 80 | #' # True cell type proportions for 4 samples 81 | #' p_s1 <- c(0.5,0.3,0.2) 82 | #' p_s2 <- c(0.6,0.3,0.1) 83 | #' p_s3 <- c(0.3,0.4,0.3) 84 | #' p_s4 <- c(0.4,0.3,0.3) 85 | #' 86 | #' # Total numbers of cells per sample 87 | #' numcells <- c(1000,1500,900,1200) 88 | #' 89 | #' # Generate cell-level vector for sample info 90 | #' biorep <- rep(c("s1","s2","s3","s4"),numcells) 91 | #' length(biorep) 92 | #' 93 | #' # Numbers of cells for each of the 3 clusters per sample 94 | #' n_s1 <- p_s1*numcells[1] 95 | #' n_s2 <- p_s2*numcells[2] 96 | #' n_s3 <- p_s3*numcells[3] 97 | #' n_s4 <- p_s4*numcells[4] 98 | #' 99 | #' # Assign cluster labels for 4 samples 100 | #' cl_s1 <- rep(c("c0","c1","c2"),n_s1) 101 | #' cl_s2 <- rep(c("c0","c1","c2"),n_s2) 102 | #' cl_s3 <- rep(c("c0","c1","c2"),n_s3) 103 | #' cl_s4 <- rep(c("c0","c1","c2"),n_s4) 104 | #' 105 | #' # Generate cell-level vector for cluster info 106 | #' clust <- c(cl_s1,cl_s2,cl_s3,cl_s4) 107 | #' length(clust) 108 | #' 109 | #' # Assume s1 and s2 belong to group 1 and s3 and s4 belong to group 2 110 | #' grp <- rep(c("grp1","grp2"),c(sum(numcells[1:2]),sum(numcells[3:4]))) 111 | #' 112 | #' propeller(clusters = clust, sample = biorep, group = grp, 113 | #' robust = FALSE, trend = FALSE, transform="asin") 114 | #' 115 | propeller <- function(x=NULL, clusters=NULL, sample=NULL, group=NULL, 116 | trend=FALSE, robust=TRUE, transform="logit") 117 | # Testing for differences in cell type proportions 118 | # Belinda Phipson 119 | # 29 July 2019 120 | # Modified 22 April 2020 121 | { 122 | 123 | if(is.null(x) & is.null(sample) & is.null(group) & is.null(clusters)) 124 | stop("Please provide either a SingleCellExperiment object or Seurat 125 | object with required annotation metadata, or explicitly provide 126 | clusters, sample and group information") 127 | 128 | if((is.null(clusters) | is.null(sample) | is.null(group)) & !is.null(x)){ 129 | # Extract cluster, sample and group info from SCE object 130 | if(is(x,"SingleCellExperiment")) 131 | y <- .extractSCE(x) 132 | 133 | # Extract cluster, sample and group info from Seurat object 134 | if(is(x,"Seurat")) 135 | y <- .extractSeurat(x) 136 | 137 | clusters <- y$clusters 138 | sample <- y$sample 139 | group <- y$group 140 | } 141 | 142 | if(is.null(transform)) transform <- "logit" 143 | 144 | # Get transformed proportions 145 | prop.list <- getTransformedProps(clusters, sample, transform) 146 | 147 | # Calculate baseline proportions for each cluster 148 | baseline.props <- table(clusters)/sum(table(clusters)) 149 | 150 | # Collapse group information 151 | group.coll <- table(sample, group) 152 | 153 | design <- matrix(as.integer(group.coll != 0), ncol=ncol(group.coll)) 154 | colnames(design) <- colnames(group.coll) 155 | 156 | if(ncol(design)==2){ 157 | message("group variable has 2 levels, t-tests will be performed") 158 | contrasts <- c(1,-1) 159 | out <- propeller.ttest(prop.list, design, contrasts=contrasts, 160 | robust=robust, trend=trend, sort=FALSE) 161 | out <- data.frame(BaselineProp=baseline.props,out) 162 | o <- order(out$P.Value) 163 | out[o,] 164 | } 165 | else if(ncol(design)>=2){ 166 | message("group variable has > 2 levels, ANOVA will be performed") 167 | coef <- seq_len(ncol(design)) 168 | out <- propeller.anova(prop.list, design, coef=coef, robust=robust, 169 | trend=trend, sort=FALSE) 170 | out <- data.frame(BaselineProp=as.vector(baseline.props),out) 171 | o <- order(out$P.Value) 172 | out[o,] 173 | } 174 | 175 | } 176 | 177 | #' Extract metadata from \code{SingleCellExperiment} object 178 | #' 179 | #' This is an accessor function that extracts cluster, sample and group 180 | #' information for each cell. 181 | #' 182 | #' @param x object of class \code{SingleCellExperiment} 183 | #' 184 | #' @return a dataframe containing clusters, sample and group 185 | #' 186 | #' @importFrom methods is 187 | #' @importFrom SingleCellExperiment colData 188 | #' 189 | #' @author Belinda Phipson 190 | #' 191 | .extractSCE <- function(x){ 192 | message("extracting sample information from SingleCellExperiment object") 193 | colnames(colData(x)) <- toupper(colnames(colData(x))) 194 | clusters <- factor(colData(x)$CLUSTER) 195 | sample <- factor(colData(x)$SAMPLE) 196 | group <- factor(colData(x)$GROUP) 197 | data.frame(clusters=clusters,sample=sample,group=group) 198 | } 199 | 200 | #' Extract metadata from \code{Seurat} object 201 | #' 202 | #' This is an accessor function that extracts cluster, sample and group 203 | #' information for each cell. 204 | #' 205 | #' @param x object of class \code{Seurat} 206 | #' 207 | #' @return a dataframe containing clusters, sample and group 208 | #' 209 | #' @importFrom Seurat Idents 210 | #' 211 | #' @author Belinda Phipson 212 | #' 213 | .extractSeurat <- function(x){ 214 | message("extracting sample information from Seurat object") 215 | colnames(x@meta.data) <- toupper(colnames(x@meta.data)) 216 | clusters <- factor(Idents(x)) 217 | sample <- factor(x$SAMPLE) 218 | group <- factor(x$GROUP) 219 | data.frame(clusters=clusters,sample=sample,group=group) 220 | } 221 | 222 | 223 | 224 | -------------------------------------------------------------------------------- /R/propeller.anova.R: -------------------------------------------------------------------------------- 1 | #' Performs F-tests for transformed cell type proportions 2 | #' 3 | #' This function is called by \code{propeller} and performs F-tests between 4 | #' multiple experimental groups or conditions (> 2) on transformed cell type 5 | #' proportions. 6 | #' 7 | #' In order to run this function, the user needs to run the 8 | #' \code{getTransformedProps} function first. The output from 9 | #' \code{getTransformedProps} is used as input. The \code{propeller.anova} 10 | #' function expects that the design matrix is not in the intercept format. 11 | #' This \code{coef} vector will identify the columns in the design matrix that 12 | #' correspond to the groups being tested. 13 | #' Note that additional confounding covariates can be accounted for as extra 14 | #' columns in the design matrix, but need to come after the group-specific 15 | #' columns. 16 | #' 17 | #' The \code{propeller.anova} function uses the \code{limma} functions 18 | #' \code{lmFit} and \code{eBayes} to extract F statistics and p-values. 19 | #' This has the additional advantage that empirical Bayes shrinkage of the 20 | #' variances are performed. 21 | #' 22 | #' @param prop.list a list object containing two matrices: 23 | #' \code{TransformedProps} and \code{Proportions} 24 | #' @param design a design matrix with rows corresponding to samples and columns 25 | #' to coefficients to be estimated 26 | #' @param coef a vector specifying which the columns of the design matrix 27 | #' corresponding to the groups to test 28 | #' @param robust logical, should robust variance estimation be used. Defaults to 29 | #' TRUE. 30 | #' @param trend logical, should a trend between means and variances be accounted 31 | #' for. Defaults to FALSE. 32 | #' @param sort logical, should the output be sorted by P-value. 33 | #' 34 | #' @return produces a dataframe of results 35 | #' 36 | #' @importFrom stats p.adjust 37 | #' @importFrom limma lmFit 38 | #' @importFrom limma eBayes 39 | #' @export propeller.anova 40 | #' 41 | #' @author Belinda Phipson 42 | #' 43 | #'@seealso \code{\link{propeller}}, \code{\link{getTransformedProps}}, 44 | #' \code{\link{lmFit}}, \code{\link{eBayes}} 45 | #' 46 | #' @examples 47 | #' library(speckle) 48 | #' library(ggplot2) 49 | #' library(limma) 50 | #' 51 | #' # Make up some data 52 | #' 53 | #' # True cell type proportions for 4 samples 54 | #' p_s1 <- c(0.5,0.3,0.2) 55 | #' p_s2 <- c(0.6,0.3,0.1) 56 | #' p_s3 <- c(0.3,0.4,0.3) 57 | #' p_s4 <- c(0.4,0.3,0.3) 58 | #' p_s5 <- c(0.8,0.1,0.1) 59 | #' p_s6 <- c(0.75,0.2,0.05) 60 | #' 61 | #' # Total numbers of cells per sample 62 | #' numcells <- c(1000,1500,900,1200,1000,800) 63 | #' 64 | #' # Generate cell-level vector for sample info 65 | #' biorep <- rep(c("s1","s2","s3","s4","s5","s6"),numcells) 66 | #' length(biorep) 67 | #' 68 | #' # Numbers of cells for each of 3 clusters per sample 69 | #' n_s1 <- p_s1*numcells[1] 70 | #' n_s2 <- p_s2*numcells[2] 71 | #' n_s3 <- p_s3*numcells[3] 72 | #' n_s4 <- p_s4*numcells[4] 73 | #' n_s5 <- p_s5*numcells[5] 74 | #' n_s6 <- p_s6*numcells[6] 75 | #' 76 | #' cl_s1 <- rep(c("c0","c1","c2"),n_s1) 77 | #' cl_s2 <- rep(c("c0","c1","c2"),n_s2) 78 | #' cl_s3 <- rep(c("c0","c1","c2"),n_s3) 79 | #' cl_s4 <- rep(c("c0","c1","c2"),n_s4) 80 | #' cl_s5 <- rep(c("c0","c1","c2"),n_s5) 81 | #' cl_s6 <- rep(c("c0","c1","c2"),n_s6) 82 | #' 83 | #' # Generate cell-level vector for cluster info 84 | #' clust <- c(cl_s1,cl_s2,cl_s3,cl_s4,cl_s5,cl_s6) 85 | #' length(clust) 86 | #' 87 | #' prop.list <- getTransformedProps(clusters = clust, sample = biorep) 88 | #' 89 | #' # Assume s1 and s2 belong to group A, s3 and s4 belong to group B, s5 and 90 | #' # s6 belong to group C 91 | #' grp <- rep(c("A","B","C"), each=2) 92 | #' 93 | #' # Make sure design matrix does not have an intercept term 94 | #' design <- model.matrix(~0+grp) 95 | #' design 96 | #' 97 | #' propeller.anova(prop.list, design=design, coef=c(1,2,3), robust=TRUE, 98 | #' trend=FALSE, sort=TRUE) 99 | #' 100 | propeller.anova <- function(prop.list=prop.list, design=design, coef = coef, 101 | robust=robust, trend=trend, sort=sort) 102 | { 103 | prop.trans <- prop.list$TransformedProps 104 | prop <- prop.list$Proportions 105 | 106 | # get cell type mean proportions ignoring other variables 107 | # this assumes that the design matrix is not in Intercept format 108 | fit.prop <- lmFit(prop, design[,coef]) 109 | 110 | # Change design matrix to intercept format 111 | design[,1] <- 1 112 | colnames(design)[1] <- "Int" 113 | 114 | # Fit linear model taking into account all confounding variables 115 | fit <- lmFit(prop.trans,design) 116 | 117 | # Get F statistics corresponding to group information only 118 | # You have to remove the intercept term for this to work 119 | fit <- eBayes(fit[,coef[-1]], robust=robust, trend=trend) 120 | 121 | # Extract F p-value 122 | p.value <- fit$F.p.value 123 | # and perform FDR adjustment 124 | fdr <- p.adjust(fit$F.p.value, method="BH") 125 | 126 | out <- data.frame(PropMean=fit.prop$coefficients, Fstatistic= fit$F, 127 | P.Value=p.value, FDR=fdr) 128 | if(sort){ 129 | o <- order(out$P.Value) 130 | out[o,] 131 | } 132 | else out 133 | } 134 | -------------------------------------------------------------------------------- /R/propeller.ttest.R: -------------------------------------------------------------------------------- 1 | #' Performs t-tests of transformed cell type proportions 2 | #' 3 | #' This function is called by \code{propeller} and performs t-tests between two 4 | #' experimental groups or conditions on the transformed cell type proportions. 5 | #' 6 | #' In order to run this function, the user needs to run the 7 | #' \code{getTransformedProps} function first. The output from 8 | #' \code{getTransformedProps} is used as input. The \code{propeller.ttest} 9 | #' function expects that the design matrix is not in the intercept format 10 | #' and a contrast vector needs to be supplied. This contrast vector will 11 | #' identify the two groups to be tested. Note that additional confounding 12 | #' covariates can be accounted for as extra columns in the design matrix after 13 | #' the group-specific columns. 14 | #' 15 | #' The \code{propeller.ttest} function uses the \code{limma} functions 16 | #' \code{lmFit}, \code{contrasts.fit} and \code{eBayes} which has the additional 17 | #' advantage that empirical Bayes shrinkage of the variances are performed. 18 | #' 19 | #' @param prop.list a list object containing two matrices: 20 | #' \code{TransformedProps} and \code{Proportions} 21 | #' @param design a design matrix with rows corresponding to samples and columns 22 | #' to coefficients to be estimated 23 | #' @param contrasts a vector specifying which columns of the design matrix 24 | #' correspond to the two groups to test 25 | #' @param robust logical, should robust variance estimation be used. Defaults to 26 | #' TRUE. 27 | #' @param trend logical, should a trend between means and variances be accounted 28 | #' for. Defaults to FALSE. 29 | #' @param sort logical, should the output be sorted by P-value. 30 | #' 31 | #' @return produces a dataframe of results 32 | #' 33 | #' @importFrom stats p.adjust 34 | #' @importFrom limma lmFit 35 | #' @importFrom limma eBayes 36 | #' @importFrom limma contrasts.fit 37 | #' @export propeller.ttest 38 | #' 39 | #' @author Belinda Phipson 40 | #' 41 | #' @seealso \code{\link{propeller}}, \code{\link{getTransformedProps}}, 42 | #' \code{\link{lmFit}}, \code{\link{contrasts.fit}}, \code{\link{eBayes}} 43 | #' 44 | #' @examples 45 | #' library(speckle) 46 | #' library(ggplot2) 47 | #' library(limma) 48 | #' 49 | #' # Make up some data 50 | #' 51 | #' # True cell type proportions for 4 samples 52 | #' p_s1 <- c(0.5,0.3,0.2) 53 | #' p_s2 <- c(0.6,0.3,0.1) 54 | #' p_s3 <- c(0.3,0.4,0.3) 55 | #' p_s4 <- c(0.4,0.3,0.3) 56 | #' 57 | #' # Total numbers of cells per sample 58 | #' numcells <- c(1000,1500,900,1200) 59 | #' 60 | #' # Generate cell-level vector for sample info 61 | #' biorep <- rep(c("s1","s2","s3","s4"),numcells) 62 | #' length(biorep) 63 | #' 64 | #' # Numbers of cells for each of 3 clusters per sample 65 | #' n_s1 <- p_s1*numcells[1] 66 | #' n_s2 <- p_s2*numcells[2] 67 | #' n_s3 <- p_s3*numcells[3] 68 | #' n_s4 <- p_s4*numcells[4] 69 | #' 70 | #' cl_s1 <- rep(c("c0","c1","c2"),n_s1) 71 | #' cl_s2 <- rep(c("c0","c1","c2"),n_s2) 72 | #' cl_s3 <- rep(c("c0","c1","c2"),n_s3) 73 | #' cl_s4 <- rep(c("c0","c1","c2"),n_s4) 74 | #' 75 | #' # Generate cell-level vector for cluster info 76 | #' clust <- c(cl_s1,cl_s2,cl_s3,cl_s4) 77 | #' length(clust) 78 | #' 79 | #' prop.list <- getTransformedProps(clusters = clust, sample = biorep) 80 | #' 81 | #' # Assume s1 and s2 belong to group 1 and s3 and s4 belong to group 2 82 | #' grp <- rep(c("A","B"), each=2) 83 | #' 84 | #' design <- model.matrix(~0+grp) 85 | #' design 86 | #' 87 | #' # Compare Grp A to B 88 | #' contrasts <- c(1,-1) 89 | #' 90 | #' propeller.ttest(prop.list, design=design, contrasts=contrasts, robust=TRUE, 91 | #' trend=FALSE, sort=TRUE) 92 | #' 93 | #' # Pretend additional sex variable exists and we want to control for it 94 | #' # in the linear model 95 | #' sex <- rep(c(0,1),2) 96 | #' des.sex <- model.matrix(~0+grp+sex) 97 | #' des.sex 98 | #' 99 | #' # Compare Grp A to B 100 | #' cont.sex <- c(1,-1,0) 101 | #' 102 | #' propeller.ttest(prop.list, design=des.sex, contrasts=cont.sex, robust=TRUE, 103 | #' trend=FALSE, sort=TRUE) 104 | #' 105 | #' 106 | propeller.ttest <- function(prop.list=prop.list, design=design, 107 | contrasts=contrasts, robust=robust, trend=trend, 108 | sort=sort) 109 | { 110 | prop.trans <- prop.list$TransformedProps 111 | prop <- prop.list$Proportions 112 | 113 | fit <- lmFit(prop.trans, design) 114 | fit.cont <- contrasts.fit(fit, contrasts=contrasts) 115 | fit.cont <- eBayes(fit.cont, robust=robust, trend=trend) 116 | 117 | # Get mean cell type proportions and relative risk for output 118 | # If no confounding variable included in design matrix 119 | if(length(contrasts)==2){ 120 | fit.prop <- lmFit(prop, design) 121 | z <- apply(fit.prop$coefficients, 1, function(x) x^contrasts) 122 | RR <- apply(z, 2, prod) 123 | } 124 | # If confounding variables included in design matrix exclude them 125 | else{ 126 | new.des <- design[,contrasts!=0] 127 | fit.prop <- lmFit(prop,new.des) 128 | new.cont <- contrasts[contrasts!=0] 129 | z <- apply(fit.prop$coefficients, 1, function(x) x^new.cont) 130 | RR <- apply(z, 2, prod) 131 | } 132 | 133 | fdr <- p.adjust(fit.cont$p.value, method="BH") 134 | 135 | out <- data.frame(PropMean=fit.prop$coefficients, PropRatio=RR, 136 | Tstatistic=fit.cont$t[,1], P.Value=fit.cont$p.value[,1], 137 | FDR=fdr) 138 | if(sort){ 139 | o <- order(out$P.Value) 140 | out[o,] 141 | } 142 | else out 143 | } 144 | -------------------------------------------------------------------------------- /R/speckle-package.R: -------------------------------------------------------------------------------- 1 | #' @keywords internal 2 | "_PACKAGE" 3 | 4 | # The following block is used by usethis to automatically manage 5 | # roxygen namespace tags. Modify with care! 6 | ## usethis namespace: start 7 | ## usethis namespace: end 8 | NULL 9 | -------------------------------------------------------------------------------- /R/speckle_example_data.R: -------------------------------------------------------------------------------- 1 | #' Generate example data 2 | #' 3 | #' @return a dataframe containing cell-level information for sample, group and 4 | #' clusters 5 | #' 6 | #' @export 7 | #' 8 | #' @examples 9 | #' 10 | #' speckle_example_data() 11 | #' 12 | speckle_example_data <- function(){ 13 | # Make up some data with two groups, two biological replicates in each 14 | # group and three cell types 15 | 16 | # True cell type proportions for 4 samples 17 | p_s1 <- c(0.5,0.3,0.2) 18 | p_s2 <- c(0.6,0.3,0.1) 19 | p_s3 <- c(0.3,0.4,0.3) 20 | p_s4 <- c(0.4,0.3,0.3) 21 | 22 | # Total numbers of cells per sample 23 | numcells <- c(1000,1500,900,1200) 24 | 25 | # Generate cell-level vector for sample info 26 | biorep <- rep(c("s1","s2","s3","s4"),numcells) 27 | length(biorep) 28 | 29 | # Numbers of cells for each of the 3 clusters per sample 30 | n_s1 <- p_s1*numcells[1] 31 | n_s2 <- p_s2*numcells[2] 32 | n_s3 <- p_s3*numcells[3] 33 | n_s4 <- p_s4*numcells[4] 34 | 35 | # Assign cluster labels for 4 samples 36 | cl_s1 <- rep(c("c0","c1","c2"),n_s1) 37 | cl_s2 <- rep(c("c0","c1","c2"),n_s2) 38 | cl_s3 <- rep(c("c0","c1","c2"),n_s3) 39 | cl_s4 <- rep(c("c0","c1","c2"),n_s4) 40 | 41 | # Generate cell-level vector for cluster info 42 | clust <- c(cl_s1,cl_s2,cl_s3,cl_s4) 43 | length(clust) 44 | 45 | # Assume s1 and s2 belong to group 1 and s3 and s4 belong to group 2 46 | grp <- rep(c("grp1","grp2"),c(sum(numcells[1:2]),sum(numcells[3:4]))) 47 | 48 | data.frame(clusters=clust, samples=biorep, group=grp) 49 | } 50 | -------------------------------------------------------------------------------- /R/sysdata.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Oshlack/speckle/9347bf07b5cdc49ecedc0042d3a007742db01691/R/sysdata.rda -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # speckle 3 | 4 | 5 | 6 | 7 | **Please note that the speckle package has moved to the following repository: 8 | https://github.com/phipsonlab. Version 0.0.3 of 9 | the package will remain here but will no longer be actively updated/maintained. 10 | This version of the package contains the propeller and classifySex functions.** 11 | 12 | ## Citation 13 | The propeller method has now been accepted for publicaton in *Bioinformatics*. 14 | Please use the following citation when using propeller: \ 15 | Belinda Phipson, Choon Boon Sim, Enzo R Porrello, Alex W Hewitt, Joseph Powell, Alicia Oshlack, propeller: testing for differences in cell type proportions in single cell data, *Bioinformatics*, 2022;, btac582, [https://doi.org/10.1093/bioinformatics/btac582](https://doi.org/10.1093/bioinformatics/btac582) 16 | 17 | The preprint is available on [bioRxiv](https://www.biorxiv.org/content/10.1101/2021.11.28.470236v2.full). 18 | 19 | ## Introduction 20 | The speckle package currently contains functions to analyse differences in cell 21 | type proportions in single cell RNA-seq data, and to classify cells as male or 22 | female. As our research into specialised 23 | analyses of single cell data continues we anticipate that the package will be 24 | updated with new functions. 25 | 26 | The propeller function performs statistical tests for differences in cell 27 | type composition in single cell data. In order to test for differences in cell 28 | type proportions between multiple experimental conditions at least one of the 29 | groups must have some form of biological replication (i.e. at least two 30 | samples). For a two group scenario, the absolute minimum sample size is thus 31 | three. Since there are many technical aspects which can affect cell type 32 | proportion estimates, having biological replication is essential for a 33 | meaningful analysis. 34 | 35 | The propeller function takes a SingleCellExperiment or Seurat object as input, 36 | extracts the relevant cell information, and tests whether the cell type 37 | proportions are statistically significantly different between experimental 38 | conditions/groups. The user can also explicitly pass the cluster, sample and 39 | experimental group information to propeller. P-values and false discovery rates 40 | are outputted for each cell type. 41 | 42 | ## Installation 43 | 44 | If you would like to view the speckle vignette, you can install the released 45 | version of speckle from [github](https://github.com/Oshlack/speckle) using the 46 | following commands: 47 | 48 | ``` r 49 | # devtools/remotes won't install Suggested packages from Bioconductor 50 | BiocManager::install(c("CellBench", "BiocStyle", "scater")) 51 | 52 | remotes::install_github("Oshlack/speckle", build_vignettes = TRUE, 53 | dependencies = "Suggest") 54 | ``` 55 | 56 | In order to view the vignette for speckle use the following command: 57 | 58 | ``` r 59 | browseVignettes("speckle") 60 | ``` 61 | 62 | If you don't care to view the glorious vignette you can also install speckle as 63 | follows: 64 | 65 | ``` r 66 | library(devtools) 67 | devtools::install_github("Oshlack/speckle") 68 | ``` 69 | 70 | ## Example 71 | 72 | This is a basic example which shows you how to test for differences in cell 73 | type proportions in a two group experimental set up. 74 | 75 | ``` r 76 | library(speckle) 77 | library(limma) 78 | library(ggplot2) 79 | 80 | # Get some example data which has two groups, three cell types and two 81 | # biological replicates in each group 82 | cell_info <- speckle_example_data() 83 | head(cell_info) 84 | 85 | # Run propeller testing for cell type proportion differences between the two 86 | # groups 87 | propeller(clusters = cell_info$clusters, sample = cell_info$samples, 88 | group = cell_info$group) 89 | 90 | # Plot cell type proportions 91 | plotCellTypeProps(clusters=cell_info$clusters, sample=cell_info$samples) 92 | ``` 93 | Please note that this basic implementation is for when you are only modelling 94 | group information. When you have additional covariates that you would like to 95 | account for, please use the propeller.ttest() and propeller.anova() functions 96 | directly. Please read the vignette for examples on how to model a continuous 97 | variable, account for additional covariates and include a random effect in the 98 | analysis. 99 | 100 | 101 | -------------------------------------------------------------------------------- /man/classifySex.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/classifySex.R 3 | \name{classifySex} 4 | \alias{classifySex} 5 | \title{Predict sex of cells in scRNA-seq data} 6 | \usage{ 7 | classifySex(x, genome = NULL, qc = TRUE) 8 | } 9 | \arguments{ 10 | \item{x}{counts matrix, rows correspond to genes and columns correspond to 11 | cells. Row names must be gene symbols.} 12 | 13 | \item{genome}{the genome the data arises from. Current options are 14 | human: genome = "Hs" or mouse: genome = "Mm".} 15 | 16 | \item{qc}{logical, indicates whether to perform quality control or not. 17 | qc = TRUE will predict cells that pass quality control only and the filtered 18 | cells will not be classified. qc = FALSE will predict every cell except the 19 | cells with zero counts on *XIST/Xist* and the sum of the Y genes. Default is TRUE.} 20 | } 21 | \value{ 22 | a dataframe with predicted labels for each cell 23 | } 24 | \description{ 25 | This function will predict the sex for each cell in scRNA-seq data. The 26 | classifier is based on logistic regression models that have been trained 27 | on mouse and human single cell RNA-seq data. 28 | } 29 | \details{ 30 | For bulk RNA-seq, checking the sex of the samples for mouse and human 31 | experiments is trivial as we can simply check the expression of *Xist/XIST*. 32 | It is not as simple for single cell RNA-seq data as the number of counts 33 | measured per gene and per cell is often quite low. Simply relying on cut-offs 34 | on the expression of genes like *Xist* means that many cells are unable to 35 | be classified. Hence we have developed a classifier based on a combination of 36 | X- and Y-linked genes in order to accurately predict the sex of each cell. 37 | 38 | Cells with zero counts on Xist and the sum of the Y chromosome genes will 39 | not be classified as there is simply not enough information to accurately 40 | classify as Male/Female, and NAs will be returned. In addition, the user has 41 | the option to perform quality control on the data first, by specifying 42 | \code{qc=TRUE}, which will not classify cells that are deemed low-quality. 43 | } 44 | \examples{ 45 | 46 | library(speckle) 47 | library(SingleCellExperiment) 48 | library(CellBench) 49 | library(org.Hs.eg.db) 50 | 51 | sc_data <- load_sc_data() 52 | sc_10x <- sc_data$sc_10x 53 | 54 | counts <- counts(sc_10x) 55 | ann <- select(org.Hs.eg.db, keys=rownames(sc_10x), 56 | columns=c("ENSEMBL","SYMBOL"), keytype="ENSEMBL") 57 | m <- match(rownames(counts), ann$ENSEMBL) 58 | rownames(counts) <- ann$SYMBOL[m] 59 | 60 | sex <- classifySex(counts, genome="Hs") 61 | 62 | table(sex$prediction) 63 | boxplot(counts["XIST",]~sex$prediction) 64 | 65 | } 66 | \author{ 67 | Xinyi Jin 68 | } 69 | -------------------------------------------------------------------------------- /man/dot-extractSCE.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/propeller.R 3 | \name{.extractSCE} 4 | \alias{.extractSCE} 5 | \title{Extract metadata from \code{SingleCellExperiment} object} 6 | \usage{ 7 | .extractSCE(x) 8 | } 9 | \arguments{ 10 | \item{x}{object of class \code{SingleCellExperiment}} 11 | } 12 | \value{ 13 | a dataframe containing clusters, sample and group 14 | } 15 | \description{ 16 | This is an accessor function that extracts cluster, sample and group 17 | information for each cell. 18 | } 19 | \author{ 20 | Belinda Phipson 21 | } 22 | -------------------------------------------------------------------------------- /man/dot-extractSeurat.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/propeller.R 3 | \name{.extractSeurat} 4 | \alias{.extractSeurat} 5 | \title{Extract metadata from \code{Seurat} object} 6 | \usage{ 7 | .extractSeurat(x) 8 | } 9 | \arguments{ 10 | \item{x}{object of class \code{Seurat}} 11 | } 12 | \value{ 13 | a dataframe containing clusters, sample and group 14 | } 15 | \description{ 16 | This is an accessor function that extracts cluster, sample and group 17 | information for each cell. 18 | } 19 | \author{ 20 | Belinda Phipson 21 | } 22 | -------------------------------------------------------------------------------- /man/estimateBetaParam.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/estimateBetaParam.R 3 | \name{estimateBetaParam} 4 | \alias{estimateBetaParam} 5 | \title{Estimate the parameters of a Beta distribution} 6 | \usage{ 7 | estimateBetaParam(x) 8 | } 9 | \arguments{ 10 | \item{x}{a vector of proportions.} 11 | } 12 | \value{ 13 | a list object with the estimate of alpha in \code{a} and beta in 14 | \code{b}. 15 | } 16 | \description{ 17 | This function estimates the two parameters of the Beta distribution, alpha 18 | and beta, given a vector of proportions. It uses the method of moments to 19 | do this. 20 | } 21 | \examples{ 22 | # Generate proportions from a beta distribution 23 | props <- rbeta(1000, shape1=2, shape2=10) 24 | estimateBetaParam(props) 25 | 26 | } 27 | \author{ 28 | Belinda Phipson 29 | } 30 | -------------------------------------------------------------------------------- /man/estimateBetaParamsFromCounts.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/estimateBetaParamsFromCounts.R 3 | \name{estimateBetaParamsFromCounts} 4 | \alias{estimateBetaParamsFromCounts} 5 | \title{Estimate parameters of a Beta distribution from counts} 6 | \usage{ 7 | estimateBetaParamsFromCounts(x) 8 | } 9 | \arguments{ 10 | \item{x}{a matrix of counts} 11 | } 12 | \value{ 13 | outputs a list object with the following components 14 | \item{n }{Normalised library size} 15 | \item{alpha }{a vector of alpha parameters for the Beta distribution for 16 | each cell type} 17 | \item{beta }{vector of beta parameters for the Beta distribution for 18 | each cell type} 19 | \item{pi }{Estimate of the true proportion for each cell type} 20 | \item{dispersion }{Dispersion estimates for each cell type} 21 | \item{var }{Variance estimates for each cell type} 22 | } 23 | \description{ 24 | This function estimates the two parameters of the Beta distribution, alpha 25 | and beta for each cell type. The input is a matrix of cell type counts, 26 | where the rows are the cell types/clusters and the columns are the samples. 27 | } 28 | \details{ 29 | This function is called from the plotting function \code{plotCellTypeMeanVar} 30 | in order to estimate the variance for the Beta-Binomial distribution for 31 | each cell type. 32 | } 33 | \examples{ 34 | data <- speckle_example_data() 35 | x <- table(data$clusters, data$samples) 36 | estimateBetaParamsFromCounts(x) 37 | 38 | 39 | } 40 | \author{ 41 | Belinda Phipson 42 | } 43 | -------------------------------------------------------------------------------- /man/getTransformedProps.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/getTransformedProps.R 3 | \name{getTransformedProps} 4 | \alias{getTransformedProps} 5 | \title{Calculates and transforms cell type proportions} 6 | \usage{ 7 | getTransformedProps(clusters = clusters, sample = sample, transform = NULL) 8 | } 9 | \arguments{ 10 | \item{clusters}{a factor specifying the cluster or cell type for every cell.} 11 | 12 | \item{sample}{a factor specifying the biological replicate for every cell.} 13 | 14 | \item{transform}{a character scalar specifying which transformation of the 15 | proportions to perform. Possible values include "asin" or "logit". Defaults 16 | to "asin".} 17 | } 18 | \value{ 19 | outputs a list object with the following components 20 | \item{Counts }{A matrix of cell type counts with 21 | the rows corresponding to the clusters/cell types and the columns 22 | corresponding to the biological replicates/samples.} 23 | \item{TransformedProps }{A matrix of transformed cell type proportions with 24 | the rows corresponding to the clusters/cell types and the columns 25 | corresponding to the biological replicates/samples.} 26 | \item{Proportions }{A matrix of cell type proportions with the rows 27 | corresponding to the clusters/cell types and the columns corresponding to 28 | the biological replicates/samples.} 29 | } 30 | \description{ 31 | Calculates cell types proportions based on clusters/cell types and sample 32 | information and performs a variance stabilising transformation on the 33 | proportions. 34 | } 35 | \details{ 36 | This function is called by the \code{propeller} function and calculates cell 37 | type proportions and performs an arcsin-square root transformation. 38 | } 39 | \examples{ 40 | 41 | library(speckle) 42 | library(ggplot2) 43 | library(limma) 44 | 45 | # Make up some data 46 | 47 | # True cell type proportions for 4 samples 48 | p_s1 <- c(0.5,0.3,0.2) 49 | p_s2 <- c(0.6,0.3,0.1) 50 | p_s3 <- c(0.3,0.4,0.3) 51 | p_s4 <- c(0.4,0.3,0.3) 52 | 53 | # Total numbers of cells per sample 54 | numcells <- c(1000,1500,900,1200) 55 | 56 | # Generate cell-level vector for sample info 57 | biorep <- rep(c("s1","s2","s3","s4"),numcells) 58 | length(biorep) 59 | 60 | # Numbers of cells for each of 3 clusters per sample 61 | n_s1 <- p_s1*numcells[1] 62 | n_s2 <- p_s2*numcells[2] 63 | n_s3 <- p_s3*numcells[3] 64 | n_s4 <- p_s4*numcells[4] 65 | 66 | cl_s1 <- rep(c("c0","c1","c2"),n_s1) 67 | cl_s2 <- rep(c("c0","c1","c2"),n_s2) 68 | cl_s3 <- rep(c("c0","c1","c2"),n_s3) 69 | cl_s4 <- rep(c("c0","c1","c2"),n_s4) 70 | 71 | # Generate cell-level vector for cluster info 72 | clust <- c(cl_s1,cl_s2,cl_s3,cl_s4) 73 | length(clust) 74 | 75 | getTransformedProps(clusters = clust, sample = biorep) 76 | 77 | } 78 | \seealso{ 79 | \code{\link{propeller}} 80 | } 81 | \author{ 82 | Belinda Phipson 83 | } 84 | -------------------------------------------------------------------------------- /man/ggplotColors.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/ggplotColors.R 3 | \name{ggplotColors} 4 | \alias{ggplotColors} 5 | \title{Output a vector of colours based on the ggplot colour scheme} 6 | \usage{ 7 | ggplotColors(g) 8 | } 9 | \arguments{ 10 | \item{g}{the number of colours to be generated.} 11 | } 12 | \value{ 13 | a vector with the names of the colours. 14 | } 15 | \description{ 16 | This function takes as input the number of colours the user would like, and 17 | outputs a vector of colours in the ggplot colour scheme. 18 | } 19 | \examples{ 20 | # Generate a palette of 6 colours 21 | cols <- ggplotColors(6) 22 | cols 23 | 24 | # Generate some count data 25 | y <- matrix(rnbinom(600, mu=100, size=1), ncol=6) 26 | 27 | par(mfrow=c(1,1)) 28 | boxplot(y, col=cols) 29 | 30 | } 31 | \author{ 32 | Belinda Phipson 33 | } 34 | -------------------------------------------------------------------------------- /man/normCounts.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/normCounts.R 3 | \name{normCounts} 4 | \alias{normCounts} 5 | \title{Normalise a counts matrix to the median library size} 6 | \usage{ 7 | normCounts(x, log = FALSE, prior.count = 0.5, lib.size = NULL) 8 | } 9 | \arguments{ 10 | \item{x}{a \code{DGEList} object or matrix of counts.} 11 | 12 | \item{log}{logical, indicates whether the output should be on the log2 scale 13 | or counts scale. Default is FALSE.} 14 | 15 | \item{prior.count}{The prior count to add if the data is log2 normalised. 16 | Default is a small count of 0.5.} 17 | 18 | \item{lib.size}{a vector of library sizes to be used during the normalisation 19 | step. Default is NULL and will be computed from the counts matrix.} 20 | } 21 | \value{ 22 | a matrix of normalised counts 23 | } 24 | \description{ 25 | This function takes a \code{DGEList} object or matrix of counts and 26 | normalises the counts to the median library size. This puts the normalised 27 | counts on a similar scale to the original counts. 28 | } 29 | \details{ 30 | If the input is a DGEList object, the normalisation factors in 31 | \code{norm.factors} are taken into account in the normalisation. The prior 32 | counts are added proportionally to the library size 33 | } 34 | \examples{ 35 | # Simulate some data from a negative binomial distribution with mean equal 36 | # to 100 and dispersion set to 1. Simulate 1000 genes and 6 samples. 37 | y <- matrix(rnbinom(6000, mu = 100, size = 1), ncol = 6) 38 | 39 | # Normalise the counts 40 | norm.y <- normCounts(y) 41 | 42 | # Return log2 normalised counts 43 | lnorm.y <- normCounts(y, log=TRUE) 44 | 45 | # Return log2 normalised counts with prior.count = 2 46 | lnorm.y2 <- normCounts(y, log=TRUE, prior.count=2) 47 | 48 | par(mfrow=c(1,2)) 49 | boxplot(norm.y, main="Normalised counts") 50 | boxplot(lnorm.y, main="Log2-normalised counts") 51 | 52 | } 53 | \author{ 54 | Belinda Phipson 55 | } 56 | -------------------------------------------------------------------------------- /man/normSca.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/normSca.R 3 | \name{normSca} 4 | \alias{normSca} 5 | \title{Normalise and scale counts matrix} 6 | \usage{ 7 | normSca(x, lib.size = lib.size, log = TRUE, prior.count = 0.5) 8 | } 9 | \arguments{ 10 | \item{x}{a matrix of counts} 11 | 12 | \item{lib.size}{the library size} 13 | 14 | \item{log}{logical, indicates whether the output should be on the log2 scale 15 | or counts scale. Default is TRUE.} 16 | 17 | \item{prior.count}{The prior count to add if the data is log2 normalised. 18 | Default is a small count of 0.5.} 19 | } 20 | \value{ 21 | a dataframe of log-normalised and scaled counts 22 | } 23 | \description{ 24 | This function is called by the \code{preprocess} function and performs 25 | log-normalisation and scaling of a counts matrix. 26 | } 27 | \examples{ 28 | 29 | y <- matrix(rnbinom(1000, size=2, mu= 20),ncol=10) 30 | colnames(y)<- paste("Cell",1:10, sep="") 31 | row.names(y)<-paste("Gene",1:100,sep="") 32 | norm.data <- normSca(y,lib.size=colSums(y)) 33 | 34 | #Visualise the counts vs scaled data 35 | boxplot(y) 36 | boxplot(norm.data) 37 | 38 | } 39 | -------------------------------------------------------------------------------- /man/plotCellTypeMeanVar.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/plotCellTypeMeanVar.R 3 | \name{plotCellTypeMeanVar} 4 | \alias{plotCellTypeMeanVar} 5 | \title{Plot cell type counts means versus variances} 6 | \usage{ 7 | plotCellTypeMeanVar(x) 8 | } 9 | \arguments{ 10 | \item{x}{a matrix or table of counts} 11 | } 12 | \value{ 13 | a base R plot 14 | } 15 | \description{ 16 | This function returns a plot of the log10(mean) versus log10(variance) of 17 | the cell type counts. The function takes a matrix of cell type counts as 18 | input. The rows are the clusters/cell types and the columns are the samples. 19 | } 20 | \details{ 21 | The expected variance under a binomial distribution is shown in the solid 22 | line, and the points represent the observed variance for each cell type/row 23 | in the counts table. The expected variance under different model assumptions 24 | are shown in the different coloured lines. 25 | 26 | The mean and variance for each cell type is calculated across all samples. 27 | } 28 | \examples{ 29 | library(edgeR) 30 | # Generate some data 31 | # Total number of samples 32 | nsamp <- 10 33 | # True cell type proportions 34 | p <- c(0.05, 0.15, 0.35, 0.45) 35 | 36 | # Parameters for beta distribution 37 | a <- 40 38 | b <- a*(1-p)/p 39 | 40 | # Sample total cell counts per sample from negative binomial distribution 41 | numcells <- rnbinom(nsamp,size=20,mu=5000) 42 | true.p <- matrix(c(rbeta(nsamp,a,b[1]),rbeta(nsamp,a,b[2]), 43 | rbeta(nsamp,a,b[3]),rbeta(nsamp,a,b[4])),byrow=TRUE, ncol=nsamp) 44 | 45 | counts <- matrix(NA,ncol=nsamp, nrow=nrow(true.p)) 46 | rownames(counts) <- paste("c",0:(nrow(true.p)-1), sep="") 47 | for(j in 1:length(p)){ 48 | counts[j,] <- rbinom(nsamp, size=numcells, prob=true.p[j,]) 49 | } 50 | 51 | plotCellTypeMeanVar(counts) 52 | 53 | } 54 | \author{ 55 | Belinda Phipson 56 | } 57 | -------------------------------------------------------------------------------- /man/plotCellTypeProps.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/plotCellTypeProps.R 3 | \name{plotCellTypeProps} 4 | \alias{plotCellTypeProps} 5 | \title{Plot cell type proportions for each sample} 6 | \usage{ 7 | plotCellTypeProps(x = NULL, clusters = NULL, sample = NULL) 8 | } 9 | \arguments{ 10 | \item{x}{object of class \code{SingleCellExperiment} or \code{Seurat}} 11 | 12 | \item{clusters}{a factor specifying the cluster or cell type for every cell. 13 | For \code{SingleCellExperiment} objects this should correspond to a column 14 | called \code{clusters} in the \code{colData} assay. For \code{Seurat} 15 | objects this will be extracted by a call to \code{Idents(x)}.} 16 | 17 | \item{sample}{a factor specifying the biological replicate for each cell. 18 | For \code{SingleCellExperiment} objects this should correspond to a column 19 | called \code{sample} in the \code{colData} assay and for \code{Seurat} 20 | objects this should correspond to \code{x$sample}.} 21 | } 22 | \value{ 23 | a ggplot2 object 24 | } 25 | \description{ 26 | This is a plotting function that shows the cell type composition for each 27 | sample as a stacked barplot. The \code{plotCellTypeProps} returns a 28 | \code{ggplot2} object enabling the user to make style changes as required. 29 | } 30 | \examples{ 31 | 32 | library(speckle) 33 | library(ggplot2) 34 | library(limma) 35 | 36 | # Generate some fake data from a multinomial distribution 37 | # Group A, 4 samples, 1000 cells in each sample 38 | countsA <- rmultinom(4, size=1000, prob=c(0.1,0.3,0.6)) 39 | colnames(countsA) <- paste("s",1:4,sep="") 40 | 41 | # Group B, 3 samples, 800 cells in each sample 42 | 43 | countsB <- rmultinom(3, size=800, prob=c(0.2,0.05,0.75)) 44 | colnames(countsB) <- paste("s",5:7,sep="") 45 | rownames(countsA) <- rownames(countsB) <- paste("c",0:2,sep="") 46 | 47 | allcounts <- cbind(countsA, countsB) 48 | sample <- c(rep(colnames(allcounts),allcounts[1,]), 49 | rep(colnames(allcounts),allcounts[2,]), 50 | rep(colnames(allcounts),allcounts[3,])) 51 | clust <- rep(rownames(allcounts),rowSums(allcounts)) 52 | 53 | plotCellTypeProps(clusters=clust, sample=sample) 54 | 55 | } 56 | \author{ 57 | Belinda Phipson 58 | } 59 | -------------------------------------------------------------------------------- /man/plotCellTypePropsMeanVar.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/plotCellTypePropsMeanVar.R 3 | \name{plotCellTypePropsMeanVar} 4 | \alias{plotCellTypePropsMeanVar} 5 | \title{Plot cell type proportions versus variances} 6 | \usage{ 7 | plotCellTypePropsMeanVar(x) 8 | } 9 | \arguments{ 10 | \item{x}{a matrix or table of counts} 11 | } 12 | \value{ 13 | a base R plot 14 | } 15 | \description{ 16 | This function returns a plot of the log10(proportion) versus log10(variance) 17 | given a matrix of cell type counts. The rows are the clusters/cell types and 18 | the columns are the samples. 19 | } 20 | \details{ 21 | The expected variance under a binomial distribution is shown in the solid 22 | line, and the points represent the observed variance for each cell type/row 23 | in the counts table. The blue line shows the empirical Bayes variance 24 | that is used in \code{propeller}. 25 | 26 | The mean and variance for each cell type is calculated across all samples. 27 | } 28 | \examples{ 29 | library(limma) 30 | # Generate some data 31 | # Total number of samples 32 | nsamp <- 10 33 | # True cell type proportions 34 | p <- c(0.05, 0.15, 0.35, 0.45) 35 | 36 | # Parameters for beta distribution 37 | a <- 40 38 | b <- a*(1-p)/p 39 | 40 | # Sample total cell counts per sample from negative binomial distribution 41 | numcells <- rnbinom(nsamp,size=20,mu=5000) 42 | true.p <- matrix(c(rbeta(nsamp,a,b[1]),rbeta(nsamp,a,b[2]), 43 | rbeta(nsamp,a,b[3]),rbeta(nsamp,a,b[4])),byrow=TRUE, ncol=nsamp) 44 | 45 | counts <- matrix(NA,ncol=nsamp, nrow=nrow(true.p)) 46 | rownames(counts) <- paste("c",0:(nrow(true.p)-1), sep="") 47 | for(j in 1:length(p)){ 48 | counts[j,] <- rbinom(nsamp, size=numcells, prob=true.p[j,]) 49 | } 50 | 51 | plotCellTypePropsMeanVar(counts) 52 | 53 | } 54 | \author{ 55 | Belinda Phipson 56 | } 57 | -------------------------------------------------------------------------------- /man/preprocess.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/preprocess.R 3 | \name{preprocess} 4 | \alias{preprocess} 5 | \title{Pre-processing function for sex classification} 6 | \usage{ 7 | preprocess(x, genome = genome, qc = qc) 8 | } 9 | \arguments{ 10 | \item{x}{the counts matrix, rows are genes and columns are cells. Row names 11 | must be gene symbols.} 12 | 13 | \item{genome}{the genome the data arises from. Current options are 14 | human: genome = "Hs" or mouse: genome = "Mm".} 15 | 16 | \item{qc}{logical, indicates whether to perform additional quality control on 17 | the cells. qc = TRUE will predict cells that pass quality control only and 18 | the filtered cells will not be classified. qc = FALSE will predict every cell 19 | except the cells with zero counts on *XIST/Xist* and the sum of the Y genes. 20 | Default is TRUE.} 21 | } 22 | \value{ 23 | outputs a list object with the following components 24 | \item{tcm.final }{A transposed count matrix where rows are cells and columns 25 | are the features used for classification.} 26 | \item{data.df }{The normalised and scaled \code{tcm.final} matrix.} 27 | \item{discarded.cells }{Character vector of cell IDs for the cells that are 28 | discarded when \code{qc=TRUE}.} 29 | \item{zero.cells }{Character vector of cell IDs for the cells that can not 30 | be classified as male/female due to zero counts on *Xist* and all the 31 | Y chromosome genes.} 32 | } 33 | \description{ 34 | The purpose of this function is to process a single cell counts matrix into 35 | the appropriate format for the \code{classifySex} function. 36 | } 37 | \details{ 38 | This function will filter out cells that are unable to be classified due to 39 | zero counts on *XIST/Xist* and all of the Y chromosome genes. If 40 | \code{qc=TRUE} additional cells are removed as identified by the 41 | \code{perCellQCMetrics} and \code{quickPerCellQC} functions from the 42 | \code{scuttle} package. The resulting counts matrix is then log-normalised 43 | and scaled. 44 | } 45 | \examples{ 46 | 47 | library(speckle) 48 | library(SingleCellExperiment) 49 | library(CellBench) 50 | library(org.Hs.eg.db) 51 | 52 | # Get data from CellBench library 53 | sc_data <- load_sc_data() 54 | sc_10x <- sc_data$sc_10x 55 | 56 | # Get counts matrix in correct format with gene symbol as rownames 57 | # rather than ENSEMBL ID. 58 | counts <- counts(sc_10x) 59 | ann <- select(org.Hs.eg.db, keys=rownames(sc_10x), 60 | columns=c("ENSEMBL","SYMBOL"), keytype="ENSEMBL") 61 | m <- match(rownames(counts), ann$ENSEMBL) 62 | rownames(counts) <- ann$SYMBOL[m] 63 | 64 | # Preprocess data 65 | pro.data <- preprocess(counts, genome="Hs", qc = TRUE) 66 | 67 | # Look at counts on XIST and superY.all 68 | plot(pro.data$tcm.final$XIST, pro.data$tcm.final$superY) 69 | 70 | # Cells that are identified as low quality 71 | pro.data$discarded.cells 72 | 73 | # Cells with zero counts on XIST and all Y genes 74 | pro.data$zero.cells 75 | 76 | } 77 | -------------------------------------------------------------------------------- /man/propeller.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/propeller.R 3 | \name{propeller} 4 | \alias{propeller} 5 | \title{Finding statistically significant differences in cell type proportions} 6 | \usage{ 7 | propeller( 8 | x = NULL, 9 | clusters = NULL, 10 | sample = NULL, 11 | group = NULL, 12 | trend = FALSE, 13 | robust = TRUE, 14 | transform = "logit" 15 | ) 16 | } 17 | \arguments{ 18 | \item{x}{object of class \code{SingleCellExperiment} or \code{Seurat}} 19 | 20 | \item{clusters}{a factor specifying the cluster or cell type for every cell. 21 | For \code{SingleCellExperiment} objects this should correspond to a column 22 | called \code{clusters} in the \code{colData} assay. For \code{Seurat} 23 | objects this will be extracted by a call to \code{Idents(x)}.} 24 | 25 | \item{sample}{a factor specifying the biological replicate for each cell. 26 | For \code{SingleCellExperiment} objects this should correspond to a column 27 | called \code{sample} in the \code{colData} assay and for \code{Seurat} 28 | objects this should correspond to \code{x$sample}.} 29 | 30 | \item{group}{a factor specifying the groups of interest for performing the 31 | differential proportions analysis. For \code{SingleCellExperiment} objects 32 | this should correspond to a column called \code{group} in the \code{colData} 33 | assay. For \code{Seurat} objects this should correspond to \code{x$group}.} 34 | 35 | \item{trend}{logical, if true fits a mean variance trend on the transformed 36 | proportions} 37 | 38 | \item{robust}{logical, if true performs robust empirical Bayes shrinkage of 39 | the variances} 40 | 41 | \item{transform}{a character scalar specifying which transformation of the 42 | proportions to perform. Possible values include "asin" or "logit". Defaults 43 | to "logit".} 44 | } 45 | \value{ 46 | produces a dataframe of results 47 | } 48 | \description{ 49 | Calculates cell type proportions, performs a variance stabilising 50 | transformation on the proportions and determines whether the cell type 51 | proportions are statistically significant between different groups using 52 | linear modelling. 53 | } 54 | \details{ 55 | This function will take a \code{SingleCellExperiment} or \code{Seurat} 56 | object and extract the \code{group}, \code{sample} and \code{clusters} cell 57 | information. The user can either state these factor vectors explicitly in 58 | the call to the \code{propeller} function, or internal functions will 59 | extract them from the relevants objects. The user must ensure that 60 | \code{group} and \code{sample} are columns in the metadata assays of the 61 | relevant objects (any combination of upper/lower case is acceptable). For 62 | \code{Seurat} objects the clusters are extracted using the \code{Idents} 63 | function. For \code{SingleCellExperiment} objects, \code{clusters} needs to 64 | be a column in the \code{colData} assay. 65 | 66 | The \code{propeller} function calculates cell type proportions for each 67 | biological replicate, performs a variance stabilising transformation on the 68 | matrix of proportions and fits a linear model for each cell type or cluster 69 | using the \code{limma} framework. There are two options for the 70 | transformation: arcsin square root or logit. Propeller tests whether there 71 | is a difference in the cell type proportions between multiple groups. 72 | If there are only 2 groups, a t-test is used to calculate p-values, and if 73 | there are more than 2 groups, an F-test (ANOVA) is used. Cell type 74 | proportions of 1 or 0 are accommodated. Benjamini and Hochberg false 75 | discovery rates are calculated to account to multiple testing of 76 | cell types/clusters. 77 | } 78 | \examples{ 79 | 80 | library(speckle) 81 | library(ggplot2) 82 | library(limma) 83 | 84 | # Make up some data 85 | # True cell type proportions for 4 samples 86 | p_s1 <- c(0.5,0.3,0.2) 87 | p_s2 <- c(0.6,0.3,0.1) 88 | p_s3 <- c(0.3,0.4,0.3) 89 | p_s4 <- c(0.4,0.3,0.3) 90 | 91 | # Total numbers of cells per sample 92 | numcells <- c(1000,1500,900,1200) 93 | 94 | # Generate cell-level vector for sample info 95 | biorep <- rep(c("s1","s2","s3","s4"),numcells) 96 | length(biorep) 97 | 98 | # Numbers of cells for each of the 3 clusters per sample 99 | n_s1 <- p_s1*numcells[1] 100 | n_s2 <- p_s2*numcells[2] 101 | n_s3 <- p_s3*numcells[3] 102 | n_s4 <- p_s4*numcells[4] 103 | 104 | # Assign cluster labels for 4 samples 105 | cl_s1 <- rep(c("c0","c1","c2"),n_s1) 106 | cl_s2 <- rep(c("c0","c1","c2"),n_s2) 107 | cl_s3 <- rep(c("c0","c1","c2"),n_s3) 108 | cl_s4 <- rep(c("c0","c1","c2"),n_s4) 109 | 110 | # Generate cell-level vector for cluster info 111 | clust <- c(cl_s1,cl_s2,cl_s3,cl_s4) 112 | length(clust) 113 | 114 | # Assume s1 and s2 belong to group 1 and s3 and s4 belong to group 2 115 | grp <- rep(c("grp1","grp2"),c(sum(numcells[1:2]),sum(numcells[3:4]))) 116 | 117 | propeller(clusters = clust, sample = biorep, group = grp, 118 | robust = FALSE, trend = FALSE, transform="asin") 119 | 120 | } 121 | \references{ 122 | Smyth, G.K. (2004). Linear models and empirical Bayes methods 123 | for assessing differential expression in microarray experiments. 124 | \emph{Statistical Applications in Genetics and Molecular Biology}, Volume 125 | \bold{3}, Article 3. 126 | 127 | Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery 128 | rate: a practical and powerful approach to multiple testing. \emph{Journal 129 | of the Royal Statistical Society Series}, B, \bold{57}, 289-300. 130 | } 131 | \seealso{ 132 | \code{\link{propeller.ttest}} \code{\link{propeller.anova}} 133 | \code{\link{lmFit}}, \code{\link{eBayes}}, 134 | \code{\link{getTransformedProps}} 135 | } 136 | \author{ 137 | Belinda Phipson 138 | } 139 | -------------------------------------------------------------------------------- /man/propeller.anova.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/propeller.anova.R 3 | \name{propeller.anova} 4 | \alias{propeller.anova} 5 | \title{Performs F-tests for transformed cell type proportions} 6 | \usage{ 7 | propeller.anova( 8 | prop.list = prop.list, 9 | design = design, 10 | coef = coef, 11 | robust = robust, 12 | trend = trend, 13 | sort = sort 14 | ) 15 | } 16 | \arguments{ 17 | \item{prop.list}{a list object containing two matrices: 18 | \code{TransformedProps} and \code{Proportions}} 19 | 20 | \item{design}{a design matrix with rows corresponding to samples and columns 21 | to coefficients to be estimated} 22 | 23 | \item{coef}{a vector specifying which the columns of the design matrix 24 | corresponding to the groups to test} 25 | 26 | \item{robust}{logical, should robust variance estimation be used. Defaults to 27 | TRUE.} 28 | 29 | \item{trend}{logical, should a trend between means and variances be accounted 30 | for. Defaults to FALSE.} 31 | 32 | \item{sort}{logical, should the output be sorted by P-value.} 33 | } 34 | \value{ 35 | produces a dataframe of results 36 | } 37 | \description{ 38 | This function is called by \code{propeller} and performs F-tests between 39 | multiple experimental groups or conditions (> 2) on transformed cell type 40 | proportions. 41 | } 42 | \details{ 43 | In order to run this function, the user needs to run the 44 | \code{getTransformedProps} function first. The output from 45 | \code{getTransformedProps} is used as input. The \code{propeller.anova} 46 | function expects that the design matrix is not in the intercept format. 47 | This \code{coef} vector will identify the columns in the design matrix that 48 | correspond to the groups being tested. 49 | Note that additional confounding covariates can be accounted for as extra 50 | columns in the design matrix, but need to come after the group-specific 51 | columns. 52 | 53 | The \code{propeller.anova} function uses the \code{limma} functions 54 | \code{lmFit} and \code{eBayes} to extract F statistics and p-values. 55 | This has the additional advantage that empirical Bayes shrinkage of the 56 | variances are performed. 57 | } 58 | \examples{ 59 | library(speckle) 60 | library(ggplot2) 61 | library(limma) 62 | 63 | # Make up some data 64 | 65 | # True cell type proportions for 4 samples 66 | p_s1 <- c(0.5,0.3,0.2) 67 | p_s2 <- c(0.6,0.3,0.1) 68 | p_s3 <- c(0.3,0.4,0.3) 69 | p_s4 <- c(0.4,0.3,0.3) 70 | p_s5 <- c(0.8,0.1,0.1) 71 | p_s6 <- c(0.75,0.2,0.05) 72 | 73 | # Total numbers of cells per sample 74 | numcells <- c(1000,1500,900,1200,1000,800) 75 | 76 | # Generate cell-level vector for sample info 77 | biorep <- rep(c("s1","s2","s3","s4","s5","s6"),numcells) 78 | length(biorep) 79 | 80 | # Numbers of cells for each of 3 clusters per sample 81 | n_s1 <- p_s1*numcells[1] 82 | n_s2 <- p_s2*numcells[2] 83 | n_s3 <- p_s3*numcells[3] 84 | n_s4 <- p_s4*numcells[4] 85 | n_s5 <- p_s5*numcells[5] 86 | n_s6 <- p_s6*numcells[6] 87 | 88 | cl_s1 <- rep(c("c0","c1","c2"),n_s1) 89 | cl_s2 <- rep(c("c0","c1","c2"),n_s2) 90 | cl_s3 <- rep(c("c0","c1","c2"),n_s3) 91 | cl_s4 <- rep(c("c0","c1","c2"),n_s4) 92 | cl_s5 <- rep(c("c0","c1","c2"),n_s5) 93 | cl_s6 <- rep(c("c0","c1","c2"),n_s6) 94 | 95 | # Generate cell-level vector for cluster info 96 | clust <- c(cl_s1,cl_s2,cl_s3,cl_s4,cl_s5,cl_s6) 97 | length(clust) 98 | 99 | prop.list <- getTransformedProps(clusters = clust, sample = biorep) 100 | 101 | # Assume s1 and s2 belong to group A, s3 and s4 belong to group B, s5 and 102 | # s6 belong to group C 103 | grp <- rep(c("A","B","C"), each=2) 104 | 105 | # Make sure design matrix does not have an intercept term 106 | design <- model.matrix(~0+grp) 107 | design 108 | 109 | propeller.anova(prop.list, design=design, coef=c(1,2,3), robust=TRUE, 110 | trend=FALSE, sort=TRUE) 111 | 112 | } 113 | \seealso{ 114 | \code{\link{propeller}}, \code{\link{getTransformedProps}}, 115 | \code{\link{lmFit}}, \code{\link{eBayes}} 116 | } 117 | \author{ 118 | Belinda Phipson 119 | } 120 | -------------------------------------------------------------------------------- /man/propeller.ttest.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/propeller.ttest.R 3 | \name{propeller.ttest} 4 | \alias{propeller.ttest} 5 | \title{Performs t-tests of transformed cell type proportions} 6 | \usage{ 7 | propeller.ttest( 8 | prop.list = prop.list, 9 | design = design, 10 | contrasts = contrasts, 11 | robust = robust, 12 | trend = trend, 13 | sort = sort 14 | ) 15 | } 16 | \arguments{ 17 | \item{prop.list}{a list object containing two matrices: 18 | \code{TransformedProps} and \code{Proportions}} 19 | 20 | \item{design}{a design matrix with rows corresponding to samples and columns 21 | to coefficients to be estimated} 22 | 23 | \item{contrasts}{a vector specifying which columns of the design matrix 24 | correspond to the two groups to test} 25 | 26 | \item{robust}{logical, should robust variance estimation be used. Defaults to 27 | TRUE.} 28 | 29 | \item{trend}{logical, should a trend between means and variances be accounted 30 | for. Defaults to FALSE.} 31 | 32 | \item{sort}{logical, should the output be sorted by P-value.} 33 | } 34 | \value{ 35 | produces a dataframe of results 36 | } 37 | \description{ 38 | This function is called by \code{propeller} and performs t-tests between two 39 | experimental groups or conditions on the transformed cell type proportions. 40 | } 41 | \details{ 42 | In order to run this function, the user needs to run the 43 | \code{getTransformedProps} function first. The output from 44 | \code{getTransformedProps} is used as input. The \code{propeller.ttest} 45 | function expects that the design matrix is not in the intercept format 46 | and a contrast vector needs to be supplied. This contrast vector will 47 | identify the two groups to be tested. Note that additional confounding 48 | covariates can be accounted for as extra columns in the design matrix after 49 | the group-specific columns. 50 | 51 | The \code{propeller.ttest} function uses the \code{limma} functions 52 | \code{lmFit}, \code{contrasts.fit} and \code{eBayes} which has the additional 53 | advantage that empirical Bayes shrinkage of the variances are performed. 54 | } 55 | \examples{ 56 | library(speckle) 57 | library(ggplot2) 58 | library(limma) 59 | 60 | # Make up some data 61 | 62 | # True cell type proportions for 4 samples 63 | p_s1 <- c(0.5,0.3,0.2) 64 | p_s2 <- c(0.6,0.3,0.1) 65 | p_s3 <- c(0.3,0.4,0.3) 66 | p_s4 <- c(0.4,0.3,0.3) 67 | 68 | # Total numbers of cells per sample 69 | numcells <- c(1000,1500,900,1200) 70 | 71 | # Generate cell-level vector for sample info 72 | biorep <- rep(c("s1","s2","s3","s4"),numcells) 73 | length(biorep) 74 | 75 | # Numbers of cells for each of 3 clusters per sample 76 | n_s1 <- p_s1*numcells[1] 77 | n_s2 <- p_s2*numcells[2] 78 | n_s3 <- p_s3*numcells[3] 79 | n_s4 <- p_s4*numcells[4] 80 | 81 | cl_s1 <- rep(c("c0","c1","c2"),n_s1) 82 | cl_s2 <- rep(c("c0","c1","c2"),n_s2) 83 | cl_s3 <- rep(c("c0","c1","c2"),n_s3) 84 | cl_s4 <- rep(c("c0","c1","c2"),n_s4) 85 | 86 | # Generate cell-level vector for cluster info 87 | clust <- c(cl_s1,cl_s2,cl_s3,cl_s4) 88 | length(clust) 89 | 90 | prop.list <- getTransformedProps(clusters = clust, sample = biorep) 91 | 92 | # Assume s1 and s2 belong to group 1 and s3 and s4 belong to group 2 93 | grp <- rep(c("A","B"), each=2) 94 | 95 | design <- model.matrix(~0+grp) 96 | design 97 | 98 | # Compare Grp A to B 99 | contrasts <- c(1,-1) 100 | 101 | propeller.ttest(prop.list, design=design, contrasts=contrasts, robust=TRUE, 102 | trend=FALSE, sort=TRUE) 103 | 104 | # Pretend additional sex variable exists and we want to control for it 105 | # in the linear model 106 | sex <- rep(c(0,1),2) 107 | des.sex <- model.matrix(~0+grp+sex) 108 | des.sex 109 | 110 | # Compare Grp A to B 111 | cont.sex <- c(1,-1,0) 112 | 113 | propeller.ttest(prop.list, design=des.sex, contrasts=cont.sex, robust=TRUE, 114 | trend=FALSE, sort=TRUE) 115 | 116 | 117 | } 118 | \seealso{ 119 | \code{\link{propeller}}, \code{\link{getTransformedProps}}, 120 | \code{\link{lmFit}}, \code{\link{contrasts.fit}}, \code{\link{eBayes}} 121 | } 122 | \author{ 123 | Belinda Phipson 124 | } 125 | -------------------------------------------------------------------------------- /man/speckle-package.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/speckle-package.R 3 | \docType{package} 4 | \name{speckle-package} 5 | \alias{speckle} 6 | \alias{speckle-package} 7 | \title{speckle: Statistical methods for analysing single cell RNA-seq data} 8 | \description{ 9 | speckle contains functions for the analysis of single cell RNA-seq data. 10 | } 11 | \keyword{internal} 12 | -------------------------------------------------------------------------------- /man/speckle_example_data.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/speckle_example_data.R 3 | \name{speckle_example_data} 4 | \alias{speckle_example_data} 5 | \title{Generate example data} 6 | \usage{ 7 | speckle_example_data() 8 | } 9 | \value{ 10 | a dataframe containing cell-level information for sample, group and 11 | clusters 12 | } 13 | \description{ 14 | Generate example data 15 | } 16 | \examples{ 17 | 18 | speckle_example_data() 19 | 20 | } 21 | -------------------------------------------------------------------------------- /vignettes/speckle.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | author: "Belinda Phipson" 3 | title: "speckle: statistical methods for analysing single cell RNA-seq data" 4 | date: "`r BiocStyle::doc_date()`" 5 | package: "`r BiocStyle::pkg_ver('speckle')`" 6 | vignette: > 7 | %\VignetteEncoding{UTF-8} 8 | %\VignetteIndexEntry{speckle: statistical methods for analysing single cell RNA-seq data} 9 | %\VignetteEngine{knitr::rmarkdown} 10 | output: > 11 | BiocStyle::html_document 12 | html_document: 13 | fig_caption: yes 14 | fig_retina: FALSE 15 | keep_md: FALSE 16 | editor_options: 17 | chunk_output_type: console 18 | --- 19 | 20 | ```{r setup, include=FALSE} 21 | knitr::opts_chunk$set( 22 | echo = TRUE, 23 | message = FALSE, 24 | warning = FALSE 25 | ) 26 | ``` 27 | 28 | # Introduction 29 | 30 | The `r BiocStyle::Biocpkg("speckle")` package contains functions to analyse 31 | differences in cell type proportions in single cell RNA-seq data. As our 32 | research into specialised analyses of single cell data continues we anticipate 33 | that the package will be updated with new functions. 34 | 35 | The analysis of single cell RNA-seq data consists of a large number of steps, 36 | which can be iterative and also depend on the research question. There are many 37 | R packages that can do some or most of these steps. The analysis steps are 38 | described here briefly. 39 | 40 | Once the sequencing data has been summarised into counts over genes, quality 41 | control is performed to remove poor quality cells. Poor quality cells are often 42 | characterised as having very low total counts (library size) and very few genes 43 | detected. Lowly expressed and uninformative genes are filtered out, followed by 44 | appropriate normalisation. Dimensionality reduction and clustering of the cells 45 | is then performed. Cells that have similar transcriptional profiles cluster 46 | together, and these clusters (hopefully) correspond to something biologically 47 | relevant, such as different cell types. Differential expression between each 48 | cluster compared to all other clusters can highlight genes that are more highly 49 | expressed in each cluster. These marker genes help to determine the cell type 50 | each cluster corresponds to. Cell type identification is a process that often 51 | uses marker genes as well as a list of curated genes that are known to be 52 | expressed in each cell type. It is always helpful to visualise the data in a lot 53 | of different ways to aid in interpretation of the clusters using tSNE/UMAP 54 | plots, clustering trees and heatmaps of known marker genes. 55 | 56 | # Installation 57 | 58 | ```{r eval=FALSE} 59 | library(devtools) 60 | install_github("/Oshlack/speckle") 61 | ``` 62 | 63 | # Finding significant differences in cell type proportions using propeller 64 | 65 | In order to determine whether there are statistically significant compositional 66 | differences between groups, there must be some form of biological replication in 67 | the experiment. This is so that we can estimate the variability of the cell type 68 | proportion estimates for each group. A classical statistical test for 69 | differences between two proportions is typically very sensitive to small changes 70 | and will almost always yield a significant p-value. Hence `propeller` is only 71 | suitable to use in single cell experiments where there are multiple groups and 72 | multiple biological replicates in at least one of the groups. The absolute 73 | minimum sample size is 2 in one group and 1 in the other group/s. Variance 74 | estimates are obtained from the group with more than 1 biological replicate 75 | which assumes that the cell type proportion variances estimates are similar 76 | between experimental conditions. 77 | 78 | The `propeller` test is performed after initial analysis of the single cell data 79 | has been done, i.e. after clustering and cell type assignment. The `propeller` 80 | function can take a `SingleCellExperiment` or `Seurat` object and extract the 81 | necessary information from the metadata. The basic model for `propeller` is that 82 | the cell type proportions for each sample are estimated based on the clustering 83 | information provided by the user or extracted from the relevant slots in the 84 | data objects. The proportions are then transformed using either an arcsin square 85 | root transformation, or logit transformation. For 86 | each cell type $i$, we fit a linear model with group as the explanatory variable 87 | using functions from the R Bioconductor package `r BiocStyle::Biocpkg("limma")`. 88 | Using `r BiocStyle::Biocpkg("limma")` to obtain p-values has the added benefit 89 | of performing empirical Bayes shrinkage of the variances. For every cell type 90 | we obtain a p-value that indicates whether that cell type proportion is 91 | statistically significantly different between two (or more) groups. 92 | 93 | # Load the libraries 94 | 95 | ```{r} 96 | library(speckle) 97 | library(SingleCellExperiment) 98 | library(CellBench) 99 | library(limma) 100 | library(ggplot2) 101 | library(scater) 102 | library(patchwork) 103 | library(edgeR) 104 | library(AnnotationDbi) 105 | library(org.Hs.eg.db) 106 | ``` 107 | 108 | 109 | # Loading data into R 110 | 111 | We are using single cell data from the `r BiocStyle::Biocpkg("CellBench")` 112 | package to illustrate how `propeller` works. This is an artificial dataset that 113 | is made up of an equal mixture of 3 different cell lines. There are three 114 | datasets corresponding to three different technologies: 10x genomics, CelSeq and 115 | DropSeq. 116 | 117 | ```{r} 118 | sc_data <- load_sc_data() 119 | ``` 120 | 121 | The way that `propeller` is designed to be used is in the context of a designed 122 | experiment where there are multiple biological replicates and multiple groups. 123 | Comparing cell type proportions without biological replication should be done 124 | with caution as there will be a large degree of variability in the cell type 125 | proportions between samples due to technical factors (cell capture bias, 126 | sampling, clustering errors), as well as biological variability. The 127 | `r BiocStyle::Biocpkg("CellBench")` dataset does not have biological 128 | replication, so we will create several artificial biological replicates by 129 | bootstrapping the data. Bootstrapping has the advantage that it induces 130 | variability between bootstrap samples by sampling with replacement. Here we will 131 | treat the three technologies as the groups, and create artifical biological 132 | replicates within each group. Note that bootstrapping only induces sampling 133 | variability between our biological replicates, which will almost certainly be 134 | much smaller than biological variability we would expect to see in a real 135 | dataset. 136 | 137 | The three single cell experiment objects in `sc_data` all have differing numbers 138 | of genes. The first step is to find all the common genes between all three 139 | experiments in order to create one large dataset. 140 | 141 | ```{r} 142 | commongenes1 <- rownames(sc_data$sc_dropseq)[rownames(sc_data$sc_dropseq) %in% 143 | rownames(sc_data$sc_celseq)] 144 | commongenes2 <- commongenes1[commongenes1 %in% rownames(sc_data$sc_10x)] 145 | 146 | sce_10x <- sc_data$sc_10x[commongenes2,] 147 | sce_celseq <- sc_data$sc_celseq[commongenes2,] 148 | sce_dropseq <- sc_data$sc_dropseq[commongenes2,] 149 | 150 | dim(sce_10x) 151 | dim(sce_celseq) 152 | dim(sce_dropseq) 153 | 154 | table(rownames(sce_10x) == rownames(sce_celseq)) 155 | table(rownames(sce_10x) == rownames(sce_dropseq)) 156 | ``` 157 | 158 | # Bootstrap additional samples 159 | 160 | This dataset does not have any biological replicates, so we will bootstrap 161 | additional samples and pretend that they are biological replicates. 162 | Bootstrapping won't replicate true biological variation between samples, but we 163 | will ignore that for the purpose of demonstrating how `propeller` works. Note 164 | that we don't need to simulate gene expression measurements; `propeller` only 165 | uses cluster information, hence we simply bootstrap the column indices of the 166 | single cell count matrices. 167 | 168 | ```{r} 169 | i.10x <- seq_len(ncol(sce_10x)) 170 | i.celseq <- seq_len(ncol(sce_celseq)) 171 | i.dropseq <- seq_len(ncol(sce_dropseq)) 172 | 173 | set.seed(10) 174 | boot.10x <- sample(i.10x, replace=TRUE) 175 | boot.celseq <- sample(i.celseq, replace=TRUE) 176 | boot.dropseq <- sample(i.dropseq, replace=TRUE) 177 | 178 | sce_10x_rep2 <- sce_10x[,boot.10x] 179 | sce_celseq_rep2 <- sce_celseq[,boot.celseq] 180 | sce_dropseq_rep2 <- sce_dropseq[,boot.dropseq] 181 | ``` 182 | 183 | # Combine all SingleCellExperiment objects 184 | 185 | The `SingleCellExperiment` objects don't combine very easily, so I will create a 186 | new object manually, and retain only the information needed to run `propeller`. 187 | 188 | ```{r} 189 | sample <- rep(c("S1","S2","S3","S4","S5","S6"), 190 | c(ncol(sce_10x),ncol(sce_10x_rep2),ncol(sce_celseq), 191 | ncol(sce_celseq_rep2), 192 | ncol(sce_dropseq),ncol(sce_dropseq_rep2))) 193 | cluster <- c(sce_10x$cell_line,sce_10x_rep2$cell_line,sce_celseq$cell_line, 194 | sce_celseq_rep2$cell_line,sce_dropseq$cell_line, 195 | sce_dropseq_rep2$cell_line) 196 | group <- rep(c("10x","celseq","dropseq"), 197 | c(2*ncol(sce_10x),2*ncol(sce_celseq),2*ncol(sce_dropseq))) 198 | 199 | allcounts <- cbind(counts(sce_10x),counts(sce_10x_rep2), 200 | counts(sce_celseq), counts(sce_celseq_rep2), 201 | counts(sce_dropseq), counts(sce_dropseq_rep2)) 202 | 203 | sce_all <- SingleCellExperiment(assays = list(counts = allcounts)) 204 | sce_all$sample <- sample 205 | sce_all$group <- group 206 | sce_all$cluster <- cluster 207 | ``` 208 | 209 | # Visualise the data 210 | 211 | Here I am going to use the Bioconductor package `r BiocStyle::Biocpkg("scater")` 212 | to visualise the data. The `r BiocStyle::Biocpkg("scater")` vignette goes 213 | quite deeply into quality 214 | control of the cells and the kinds of QC plots we like to look at. Here we will 215 | simply log-normalise the gene expression counts, perform dimensionality 216 | reduction (PCA) and generate PCA/TSNE/UMAP plots to visualise the relationships 217 | between the cells. 218 | 219 | ```{r} 220 | sce_all <- scater::logNormCounts(sce_all) 221 | sce_all <- scater::runPCA(sce_all) 222 | sce_all <- scater::runUMAP(sce_all) 223 | ``` 224 | 225 | Plot PC1 vs PC2 colouring by cell line and technology: 226 | 227 | ```{r, fig.width=12, fig.height=6} 228 | pca1 <- scater::plotReducedDim(sce_all, dimred = "PCA", colour_by = "cluster") + 229 | ggtitle("Cell line") 230 | pca2 <- scater::plotReducedDim(sce_all, dimred = "PCA", colour_by = "group") + 231 | ggtitle("Technology") 232 | pca1 + pca2 233 | ``` 234 | 235 | Plot UMAP highlighting cell line and technology: 236 | 237 | ```{r, fig.width=12, fig.height=6} 238 | umap1 <- scater::plotReducedDim(sce_all, dimred = "UMAP", 239 | colour_by = "cluster") + 240 | ggtitle("Cell line") 241 | umap2 <- scater::plotReducedDim(sce_all, dimred = "UMAP", colour_by = "group") + 242 | ggtitle("Technology") 243 | umap1 + umap2 244 | ``` 245 | 246 | For this dataset UMAP is a little bit of an overkill, the PCA plots show the 247 | relationships between the cells quite well. PC1 separates cells based on 248 | technology, and PC2 separates cells based on the cell line (clusters). From the 249 | PCA plots we can see that 10x is quite different to CelSeq and DropSeq, and the 250 | H2228 cell line is quite different to the remaining 2 cell lines. 251 | 252 | # Test for differences in cell line proportions in the three technologies 253 | 254 | In order to demonstrate `propeller` I will assume that the cell line information 255 | corresponds to clusters and all the analysis steps have beeen performed. Here we 256 | are interested in testing whether there are compositional differences between 257 | the three technologies: 10x, CelSeq and DropSeq. Since there are more than 2 258 | groups, `propeller` will perform an ANOVA to determine whether there is a 259 | significant shift in the cell type proportions between these three groups. 260 | 261 | The `propeller` function can take a `SingleCellExperiment` object or `Seurat` 262 | object as input and extract the three necessary pieces of information from the 263 | cell information stored in `colData`. The three essential pieces of information 264 | are 265 | 266 | * cluster 267 | * sample 268 | * group 269 | 270 | If these arguments are not explicitly passed to the `propeller` function, then 271 | these are extracted from the `SingleCellExperiment` or `Seurat` object. Upper 272 | or lower case is acceptable, but 273 | the variables need to be named exactly as stated in the list above. For a 274 | `Seurat` object, the cluster information is extracted from `Idents(x)`. 275 | 276 | The default of propeller is to perform the logit transformation: 277 | ```{r} 278 | # Perform logit transformation 279 | propeller(sce_all) 280 | ``` 281 | 282 | An alternative variance stabilising transformation is the arcsin square root 283 | transformation. 284 | 285 | ```{r} 286 | # Perform arcsin square root transformation 287 | propeller(sce_all, transform="asin") 288 | ``` 289 | 290 | The results from using the two different transforms are a little bit different, 291 | with the H1975 cell line being statistically significant using the arc sin 292 | square root transform, and not significant after using the logit transform. 293 | 294 | Another option for running `propeller` is for the user to supply the cluster, 295 | sample and group information explicitly to the `propeller` function. 296 | 297 | ```{r} 298 | propeller(clusters=sce_all$cluster, sample=sce_all$sample, group=sce_all$group) 299 | ``` 300 | 301 | The cell lines were mixed together in roughly equal proportions (~0.33) and 302 | hence we don't expect to see significant differences between the three clusters. 303 | However, because bootstrapping the samples doesn't incorporate enough 304 | variability between the samples to mimic true biological variability, we can see 305 | that the H1975 cluster looks significantly different between the three 306 | technologies. The proportion of this cell line is closer to 0.4 for CelSeq and 307 | DropSeq, and 0.34 for the 10x data. 308 | 309 | # Visualise the results 310 | 311 | In the `r BiocStyle::Biocpkg("speckle")` package there is a plotting function 312 | `plotCellTypeProps` that takes a `Seurat` or `SingleCellExperiment` object, 313 | extracts sample and cluster information and outputs a barplot of cell type 314 | proportions between the samples. The user also has the option of supplying the 315 | cluster and sample cell information instead of an R object. The output is a 316 | `ggplot2` object that the user can then manipulate however they please. 317 | 318 | ```{r,fig.height=4,fig.width=7} 319 | plotCellTypeProps(sce_all) 320 | ``` 321 | 322 | Alternatively, we can obtain the cell type proportions and transformed 323 | proportions directly by running the `getTransformedProps` function which takes 324 | the cluster and sample information as input. The output from 325 | `getTransformedProps` is a list with the cell type counts, transformed 326 | proportions and proportions as elements. 327 | 328 | ```{r,fig.height=5,fig.width=7} 329 | props <- getTransformedProps(sce_all$cluster, sce_all$sample, transform="logit") 330 | barplot(props$Proportions, col = c("orange","purple","dark green"),legend=TRUE, 331 | ylab="Proportions") 332 | ``` 333 | 334 | Call me old-school, but I still like looking at stripcharts to visualise results 335 | and see whether the significant p-values make sense. 336 | 337 | ```{r,fig.height=4,fig.width=10} 338 | par(mfrow=c(1,3)) 339 | for(i in 1:3){ 340 | stripchart(props$Proportions[i,]~rep(c("10x","celseq","dropseq"),each=2), 341 | vertical=TRUE, pch=16, method="jitter", 342 | col = c("orange","purple","dark green"),cex=2, ylab="Proportions") 343 | title(rownames(props$Proportions)[i]) 344 | } 345 | ``` 346 | 347 | If you are interested in seeing which models best fit the data in terms of the 348 | cell type variances, there are two plotting functions that can do this: 349 | `plotCellTypeMeanVar` and `plotCellTypePropsMeanVar`. For this particular 350 | dataset it isn't very informative because there are only three cell "types" 351 | and no biogical variability. 352 | 353 | ```{r} 354 | par(mfrow=c(1,1)) 355 | plotCellTypeMeanVar(props$Counts) 356 | plotCellTypePropsMeanVar(props$Counts) 357 | ``` 358 | 359 | 360 | # Fitting linear models using the transformed proportions directly 361 | 362 | If you are like me, you won't feel very comfortable with a black-box approach 363 | where one function simply spits out a table of results. If you would like to 364 | have more control over your linear model and include extra covariates then you 365 | can fit a linear model in a more direct way using the transformed proportions 366 | that can be obtained by running the `getTransformedProps` function. 367 | 368 | We have already obtained the proportions and transformed proportions when we ran 369 | the `getTransformedProps` function above. This function outputs a list object 370 | with three elements: `Counts`, `TransformedProps` and `Proportions`. These are 371 | all matrices with clusters/cell types in the rows and samples in the columns. 372 | 373 | ```{r} 374 | names(props) 375 | 376 | props$TransformedProps 377 | ``` 378 | 379 | First we need to set up our sample information in much the same way we would if 380 | we were analysing bulk RNA-seq data. We can pretend that we have pairing 381 | information which corresponds to our original vs bootstrapped samples to make 382 | our model a little more complicated for demonstration purposes. 383 | 384 | ```{r} 385 | group <- rep(c("10x","celseq","dropseq"),each=2) 386 | pair <- rep(c(1,2),3) 387 | data.frame(group,pair) 388 | ``` 389 | 390 | We can set up a design matrix taking into account group and pairing information. 391 | Please note that the way that `propeller` has been designed is such that the 392 | group information is *always* first in the design matrix specification, and 393 | there is NO intercept. If you are new to design matrices and linear modelling, I 394 | would highly recommend reading the `r BiocStyle::Biocpkg("limma")` manual, which 395 | is incredibly extensive and covers a variety of different experimental set ups. 396 | 397 | ```{r} 398 | design <- model.matrix(~ 0 + group + pair) 399 | design 400 | ``` 401 | 402 | In our example, we have three groups, 10x, CelSeq and DropSeq. In order to 403 | perform an ANOVA to test for cell type composition differences between these 404 | 3 technologies, we can use the `propeller.anova` function. The `coef` argument 405 | tells the function which columns of the design matrix correspond to the groups 406 | we are interested in testing. Here we are interested in the first three columns. 407 | 408 | ```{r} 409 | propeller.anova(prop.list=props, design=design, coef = c(1,2,3), 410 | robust=TRUE, trend=FALSE, sort=TRUE) 411 | ``` 412 | 413 | Note that the p-values are smaller here than before because we have specified 414 | a pairing vector that states which samples were bootstrapped and which are the 415 | original samples. 416 | 417 | If we were interested in testing only 10x versus DropSeq we could alternatively 418 | use the `propeller.ttest` function and specify a contrast that tests for this 419 | comparison with our design matrix. 420 | 421 | ```{r} 422 | design 423 | mycontr <- makeContrasts(group10x-groupdropseq, levels=design) 424 | propeller.ttest(props, design, contrasts = mycontr, robust=TRUE, trend=FALSE, 425 | sort=TRUE) 426 | ``` 427 | 428 | Finally note that the `robust` and `trend` parameters are parameters for the 429 | `eBayes` function in `r BiocStyle::Biocpkg("limma")`. When `robust = TRUE`, 430 | robust empirical Bayes shrinkage of the variances is performed which mitigates 431 | the effects of outlying observations. We set `trend = FALSE` as we don't expect 432 | a mean-variance trend after performing our variance stabilising transformation. 433 | There may also be an error when `trend` is set to TRUE because there are often 434 | not enough data points to estimate the trend. 435 | 436 | # More complex (or just different) experimental designs 437 | 438 | ## Fitting a continuous variable rather than groups 439 | 440 | Let us assume that we expect that the different technologies have a meaningful 441 | ordering to them, and we would like to find the cell types that are increasing 442 | or decreasing along this trend. In more complex scenarios beyond group 443 | comparisons I would recommend taking the transformed proportions from the 444 | `getTransformedProps` function and using the linear model fitting functions from 445 | the `r BiocStyle::Biocpkg("limma")` package directly. 446 | 447 | Let us assume that the ordering of the technologies is 10x -> celseq -> dropseq. 448 | Then we can recode them 1, 2, 3 and treat the technologies as a 449 | continuous variable. Obviously this scenario doesn't make much sense 450 | biologically, but we will continue for demonstration purposes. 451 | 452 | ```{r} 453 | group 454 | dose <- rep(c(1,2,3), each=2) 455 | 456 | des.dose <- model.matrix(~dose) 457 | des.dose 458 | 459 | fit <- lmFit(props$TransformedProps,des.dose) 460 | fit <- eBayes(fit, robust=TRUE) 461 | topTable(fit) 462 | ``` 463 | 464 | Here the log fold changes are reported on the transformed data, so they are 465 | not as easy to interpret directly. The positive logFC indicates that the cell 466 | type proportions are increasing (for example for H1975), and a negative 467 | logFC indicates that the proportions are decreasing across the ordered 468 | technologies 10x -> celseq -> dropseq. 469 | 470 | You can get the estimates from the model on the proportions directly by fitting 471 | the same model to the proportions. Here the `logFC` is the slope of the trend 472 | line on the proportions, and the `AveExpr` is the average of the proportions 473 | across all technologies. 474 | 475 | ```{r} 476 | fit.prop <- lmFit(props$Proportions,des.dose) 477 | fit.prop <- eBayes(fit.prop, robust=TRUE) 478 | topTable(fit.prop) 479 | ``` 480 | 481 | You could plot the continuous variable `dose` against the proportions and add 482 | trend lines, for example. 483 | 484 | ```{r,fig.height=4,fig.width=10} 485 | fit.prop$coefficients 486 | 487 | par(mfrow=c(1,3)) 488 | for(i in 1:3){ 489 | plot(dose, props$Proportions[i,], main = rownames(props$Proportions)[i], 490 | pch=16, cex=2, ylab="Proportions", cex.lab=1.5, cex.axis=1.5, 491 | cex.main=2) 492 | abline(a=fit.prop$coefficients[i,1], b=fit.prop$coefficients[i,2], col=4, 493 | lwd=2) 494 | } 495 | ``` 496 | 497 | What I recommend in this instance is using the p-values from the analysis on the 498 | transformed data, and the reported statistics (i.e. the coefficients from the 499 | model) obtained from the analysis on the proportions for visualisation purposes. 500 | 501 | ## Including random effects 502 | 503 | If you have a random effect that you would like to account for in your analysis, 504 | for example repeated measures on the same individual, then you can use the 505 | `duplicateCorrelation` function from the `r BiocStyle::Biocpkg("limma")`. 506 | 507 | For illustration purposes, let us assume that `pair` indicates samples taken 508 | from the same individual (or they could represent technical replicates), and we 509 | would like to account for this in our analysis 510 | using a random effect. Again, we fit our models on the transformed proportions 511 | in order to obtain the p-values. 512 | 513 | We will formulate the design matrix with an intercept for this example, and test 514 | the differences in technologies relative to 10x. The `block` parameter will be 515 | the `pair` variable. Note that the design matrix now does not include `pair` as 516 | a fixed effect. 517 | 518 | ```{r} 519 | des.tech <- model.matrix(~group) 520 | 521 | dupcor <- duplicateCorrelation(props$TransformedProps, design=des.tech, 522 | block=pair) 523 | dupcor 524 | ``` 525 | 526 | The consensus correlation is quite high (`r dupcor$consensus.correlation`), 527 | which we expect because we bootstrapped these additional samples. 528 | 529 | ```{r} 530 | # Fitting the linear model accounting for pair as a random effect 531 | fit1 <- lmFit(props$TransformedProps, design=des.tech, block=pair, 532 | correlation=dupcor$consensus) 533 | fit1 <- eBayes(fit1) 534 | summary(decideTests(fit1)) 535 | 536 | # Differences between celseq vs 10x 537 | topTable(fit1,coef=2) 538 | 539 | # Differences between dropseq vs 10x 540 | topTable(fit1, coef=3) 541 | ``` 542 | 543 | For celseq vs 10x, H1975 and H2228 are significantly different, with a greater 544 | proportion of H1975 545 | cells detected in celseq, and fewer H2228 cells. For dropseq vs 10x, there is a 546 | higher proportion of H1975 cells. 547 | 548 | If you want to do an ANOVA between the three groups: 549 | ```{r} 550 | topTable(fit1, coef=2:3) 551 | ``` 552 | 553 | Generally, you can perform any analysis on the transformed proportions that you 554 | would normally do when using limma (i.e. on roughly normally distributed data). 555 | For more complex random effects models with 2 or more random effects, you can 556 | use the ``r BiocStyle::Biocpkg("lme4")` package. 557 | 558 | 559 | # Tips for clustering 560 | 561 | The experimental groups are likely to contribute large sources of variation in 562 | the data. In the `r BiocStyle::Biocpkg("CellBench")` data the technology effect 563 | is larger than the cell line effect. In order to cluster the data to produce 564 | meaningful cell types that will then feed into meaningful tests for proportions, 565 | the cell types should be represented in as many samples as possible. Users 566 | should consider using integration techniques on their single cell data 567 | prior to clustering, integrating on biological sample or perhaps experimental 568 | group. See methods such as Harmony, Liger and Seurat's integration technique 569 | for more information. 570 | 571 | Cell type label assignment should not be too refined such that every sample has 572 | many unique cell types. The `propeller` function can handle proportions of 0 and 573 | 1 without breaking, but it is not very meaningful if every cell type difference 574 | is statistically significant. Consider testing cell type categories that are 575 | broader for more meaningful results, perhaps by combining clusters that are 576 | highly similar. The refined clusters can always be explored in terms of gene 577 | expression differences later on. 578 | 579 | It may be appropriate to perform cell type assignment using classification 580 | methods rather than clustering. This allows 581 | the user to classify cells into known cell types, but you may run the risk of 582 | missing novel cell types. 583 | A combination of approaches may be necessary depending on the dataset. 584 | Good luck. The more heterogenous the dataset, the more tricky this becomes. 585 | 586 | # See also 587 | 588 | Another approach is to model the counts directly using statistical models that 589 | can take into account biological variability, such as negative binomial. 590 | You can read the 591 | [OSCA bookchapter](http://bioconductor.org/books/release/OSCA/multi-sample-comparisons.html) 592 | on how to use the `r BiocStyle::Biocpkg("edgeR")` package to do this. 593 | 594 | # Classifying male and female cells from scRNA-seq data 595 | 596 | A common quality control check in bulk RNA-seq is to check the sex of the 597 | samples. The simplest method to do this is to check expression of *XIST*, which 598 | is the gene responsible for X-inactivation. It is highly expressed in females, 599 | and not expressed in males. In experiments where the sex of the samples has not 600 | been recorded, the variation due to sex can often be captured by the top 601 | principal component in an MDS or PCA plot. It is important to know if the 602 | samples are a mix of males and females and to take this information into account 603 | in downstream analysis. 604 | 605 | It turns out that for single cell data, it is not a simple matter to classify 606 | cells as male or female. The main reason for this is that the cells are 607 | much more lowly sequenced compared to bulk RNA-seq samples resulting in low or 608 | zero counts for many genes, including *XIST* and other X- and Y-chromosome 609 | genes. Thus simply trying to classify cells as male or female based on observed 610 | counts for *XIST* leave many cells unable to be classified. 611 | 612 | There are a few reasons for wanting to classify male and female cells. First, it 613 | could form part of the analysis assessing quality of the cells, and if sex is 614 | not information that is easily available, it is an additional variable we can 615 | predict from the gene expression. This may then inform further analysis of the 616 | data, by allowing us to take sex into account in the analysis. 617 | 618 | We have built a classifier to predict the sex of each cell based on logistic 619 | regression for human and mouse cells. The input is the matrix of counts where 620 | the genes are the rows and the cells are the columns. The rownames of the counts 621 | matrix needs to be gene symbol. The `allcounts` data object contains the counts 622 | for all the cells used in the `propeller` function, but the rownames are ENSEMBL 623 | IDs. The first step is thus converting the ENSEMBL IDs to gene symbol. 624 | 625 | ```{r} 626 | allcounts[1:5,1:5] 627 | nc <- normCounts(allcounts, log=TRUE) 628 | avgexp <- rowMeans(nc) 629 | o <- order(avgexp, decreasing = TRUE) 630 | allcounts2 <- allcounts[o,] 631 | allcounts2[1:5,1:5] 632 | ann <- AnnotationDbi::select(org.Hs.eg.db, keys=rownames(allcounts2), 633 | columns=c("ENSEMBL","SYMBOL"), keytype="ENSEMBL") 634 | m <- match(rownames(allcounts2), ann$ENSEMBL) 635 | symbol <- ann$SYMBOL[m] 636 | table(duplicated(symbol)) 637 | allcounts2 <- allcounts2[!duplicated(symbol) & !is.na(symbol),] 638 | 639 | rownames(allcounts2) <- symbol[!duplicated(symbol) & !is.na(symbol)] 640 | 641 | table(duplicated(colnames(allcounts2))) 642 | colnames(allcounts2) <- paste(colnames(allcounts2),1:ncol(allcounts2), sep=".") 643 | table(duplicated(rownames(allcounts2))) 644 | ``` 645 | 646 | Now that the counts matrix is in the correct format we can call the 647 | `classifySex` function. 648 | 649 | ```{r} 650 | sex <- classifySex(allcounts2, genome="Hs", qc=FALSE) 651 | table(sex$prediction) 652 | ``` 653 | 654 | The cell lines were all derived from females, so the sex of the cells is 655 | correctly classified as female. 656 | 657 | 658 | 659 | # Session Info 660 | 661 | ```{r} 662 | sessionInfo() 663 | ``` 664 | 665 | 666 | 667 | 668 | --------------------------------------------------------------------------------