├── .DS_Store
├── .gitignore
├── .readthedocs.yaml
├── README.md
├── docs
├── .DS_Store
├── Identifying cellular structure
│ ├── .DS_Store
│ ├── 3-1.ipynb
│ ├── 3-2.ipynb
│ ├── 3-3.ipynb
│ ├── 3-4.ipynb
│ └── 3-5.ipynb
├── Introduction
│ ├── 1-1.md
│ ├── 1-2.md
│ ├── 1-4.ipynb
│ └── mdata_sub.h5mu
├── error.md
├── index.md
├── others
│ ├── .DS_Store
│ ├── 7-1.md
│ └── ncbi_submitted.assets
│ │ ├── image-20230802030612620.png
│ │ ├── image-20230802030645455.png
│ │ ├── image-20230802030748770.png
│ │ ├── image-20230802030802855.png
│ │ ├── image-20230802030827209.png
│ │ ├── image-20230802030858128.png
│ │ ├── image-20230802031118413.png
│ │ ├── image-20230804130900926.png
│ │ ├── image-20230804130953860.png
│ │ ├── image-20230804131237199.png
│ │ ├── image-20230804131322335.png
│ │ ├── image-20230804131410657.png
│ │ ├── image-20230804131629730.png
│ │ ├── image-20230804131657927.png
│ │ ├── image-20230804145936324.png
│ │ ├── image-20230814184244061.png
│ │ ├── image-20230814184344231.png
│ │ ├── image-20230814184440738.png
│ │ ├── image-20230814184605700.png
│ │ ├── image-20230814184643354.png
│ │ ├── image-20230814200428123.png
│ │ ├── image-20230816012830474.png
│ │ ├── image-20230816012842851.png
│ │ ├── image-20230816012920228.png
│ │ ├── image-20230816013029023.png
│ │ ├── image-20230816013300087.png
│ │ ├── image-20230816013351757.png
│ │ ├── image-20230816013430633.png
│ │ ├── image-20230816015247633.png
│ │ ├── image-20230816015518442.png
│ │ ├── image-20230816015550553.png
│ │ ├── image-20230816130240947.png
│ │ ├── image-20230816130413781.png
│ │ ├── image-20230816131154766.png
│ │ ├── image-20230816131708566.png
│ │ ├── image-20230816131944726.png
│ │ ├── image-20230816132034270.png
│ │ ├── image-20230816132100337.png
│ │ ├── image-20230816132200585.png
│ │ └── image-20230816132431751.png
├── overrides
│ └── main.html
├── preprocess
│ ├── 2-1.ipynb
│ ├── 2-2.ipynb
│ ├── 2-3.ipynb
│ ├── 2-4.ipynb
│ └── 2-5.ipynb
├── ref
│ └── ref.bib
├── stages
│ ├── 4-1.ipynb
│ ├── 4-2.ipynb
│ └── 4-3.ipynb
└── trajectory
│ ├── 5-1.ipynb
│ └── 5-2.ipynb
├── mkdocs.yml
└── requirements.txt
/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/.DS_Store
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | data
2 | .ipynb_checkpoints
3 | *.h5
4 | *.h5ad
5 | *.loom
6 | *.txt
7 | *temp*
--------------------------------------------------------------------------------
/.readthedocs.yaml:
--------------------------------------------------------------------------------
1 | # .readthedocs.yaml
2 | # Read the Docs configuration file
3 | # See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
4 |
5 | # Required
6 | version: 2
7 |
8 | # Set the version of Python and other tools you might need
9 | build:
10 | os: ubuntu-20.04
11 | tools:
12 | python: "3.8"
13 | # You can also specify other tool versions:
14 | # nodejs: "16"
15 | # rust: "1.55"
16 | # golang: "1.17"
17 |
18 | mkdocs:
19 | configuration: mkdocs.yml
20 |
21 | # If using Sphinx, optionally build your docs in additional formats such as PDF
22 | # formats:
23 | # - pdf
24 |
25 | # Optionally declare the Python requirements required to build your docs
26 | python:
27 | install:
28 | - requirements: requirements.txt
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # 单细胞测序-最佳的分析Pipeline
2 |
3 | - 作者: starlitnightly
4 | - 日期: 2023.07.14
5 |
6 | 我们的教程可以在[read the docs](https://single-cell-tutorial.readthedocs.io/zh/latest/)上获得: https://single-cell-tutorial.readthedocs.io/zh/latest/
7 |
8 | !!! note 楔子
9 | 从事单细胞分析也有一段时间了,国内大部分中文教程都是使用R语言进行分析,使用Python的还比较少,或者是直译scanpy的教程,不过scanpy可能已经比较旧了。在这里,我们参考了[Single cell best practice](https://www.sc-best-practices.org/preamble.html),希望能给国内的从业者带来一个完善的教程指引以及分析。
10 |
11 |
12 | ## 简介
13 |
14 | 人体是一个复杂的机器,严重依赖于生命的基本单位——细胞。细胞可以分为不同类型,在发育过程中甚至会发生转变,在疾病或再生时也会如此。这种细胞的异质性在形态、功能和基因表达谱上都有所体现。强烈的干扰会导致细胞类型的紊乱,从而影响整个系统,甚至引发像癌症这样严重的疾病[Macaulay等人,2017]。因此,了解细胞在正常状态和干扰下的行为对于改善我们对整个细胞系统的理解至关重要。
15 |
16 | 这项庞大的任务可以通过不同的方式来解决,其中最有前途的方法是在个体水平上对细胞进行分析。到目前为止,每个细胞的转录组主要是通过一种称为单细胞RNA测序的过程来检测的。随着单细胞基因组学的最新进展,现在可以将转录组信息与空间、染色质可及性或蛋白质信息结合起来。这些进展不仅可以揭示复杂的调控机制,而且还增加了数据分析师的复杂性。
17 |
18 | 如今,数据分析师面临着一个庞大的分析工具领域,其中包含1000多种计算单细胞分析方法。在这个广泛的工具范围中导航以生成科学前沿的可靠结果变得越来越具有挑战性。
19 |
20 | ## 本书内容概述
21 |
22 | 本书的目标是教新手和专业人士单细胞测序分析的最佳实践,在Python中。本书将教您从预处理到可视化、统计评估等一系列常见的分析步骤,以及更深入的内容。通读本书将使您能够独立分析单模态和多模态单细胞测序数据。本书中的指南和建议不仅旨在教授您如何进行单细胞分析,而且着重于如何正确进行分析。我们的建议尽可能地基于外部基准和评价。最后,我们将本书视为单细胞数据分析师的一份实用资源,可以在推荐发生变化时轻松更新。
23 |
24 | ## 本书内容不涉及
25 |
26 | 本书不涵盖生物学或计算机科学的基础知识,包括编程。此外,本书也不是为特定任务设计的所有分析工具的完整集合。我们特别强调那些经过外部验证的工具,这些工具在处理手头的数据时效果最佳,或者是经过社区验证的最佳实践方法。如果不可能进行外部验证,我们只会基于自己广泛的经验推荐工作流程。
27 |
28 | ## 本书的结构
29 |
30 | 本书的每一章对应于典型单细胞数据分析项目的不同阶段。通常,分析工作流程会按照章节的顺序进行,但在下游分析目标方面可能存在一定的灵活性。我们的每一章都包含了大量的参考文献,我们鼓励读者查阅我们陈述观点的原始来源。尽管我们在可能的情况下试图提供所需的背景知识,但我们的总结并不能始终捕捉到我们推荐的全部理由。
31 |
32 | ## 学习前准备
33 |
34 | 生物信息学对于新手来说是一个具有挑战性的研究领域,因为它需要对生物学和计算机科学都有一定的了解。而单细胞分析则更加具有要求,因为它结合了许多子领域,而且数据集通常较大。本书无法涵盖计算单细胞分析的所有先决条件,因此我们建议您在下面对各种主题进行粗略的概述。以下链接可能会提升您在本书中的学习体验:
35 |
36 | 基本的Python编程。您应该熟悉控制流程(循环、条件语句等)、基本数据结构(列表、字典、集合)以及最常用库(如Pandas和Numpy)的核心功能。如果您对编程和Python还不熟悉,我们强烈推荐北京理工大学的嵩天老师的Python相关的mooc,包括[Python基础学习](https://www.icourse163.org/course/BIT-268001)与[Python数据处理与可视化](https://www.icourse163.org/course/BIT-1001870002)两节。
37 |
38 | 了解AnnData和scanpy包的基础知识会有益,但不是绝对必需的。本书对AnnData的介绍足以让您跟上,并介绍了使用scanpy的工作流程。然而,我们无法在本书的过程中介绍scanpy的所有功能。如果您对scanpy还不熟悉,我们强烈建议您通过学习[scanpy教程](https://scanpy.readthedocs.io/en/stable/tutorials.html),并偶尔查看[scanpy的API](https://scanpy.readthedocs.io/en/stable/api.html)参考来学习。
39 |
40 | 如果您对多模态数据分析感兴趣,建议了解muon和MuData的基础知识。本书对MuData进行了更详细的介绍,但只是简要介绍了muon,类似于AnnData和scanpy。[muon教程](https://muon-tutorials.readthedocs.io/en/latest/)是学习使用muon进行多模态数据分析的很好入门资料。
41 |
42 | 生物学基础知识。虽然我们大致介绍了数据的产生过程,但我们不会涵盖DNA、RNA和蛋白质的基础知识。如果您对分子生物学完全不熟悉,建议阅读Bruce Alberts等人的《细胞分子生物学》(Molecular Biology of the Cell)。
43 |
44 | ## License
45 |
46 |
47 |
48 | 本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。在此再次感谢Single-cell best practices对单细胞教程的贡献,本书将基于Single-cell best practices结合作者自身的分析经验来完成。
--------------------------------------------------------------------------------
/docs/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/.DS_Store
--------------------------------------------------------------------------------
/docs/Identifying cellular structure/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/Identifying cellular structure/.DS_Store
--------------------------------------------------------------------------------
/docs/Introduction/1-1.md:
--------------------------------------------------------------------------------
1 | # 1-1. 现有技术
2 |
3 | ## 1.Bioconductor OSCA和OSTA书籍
4 |
5 | 《使用Bioconductor进行单细胞分析》(Bioconductor OSCA) [@Alvarez2020]是一本数字书,旨在通过基于R的Bioconductor[@pa:Huber2015]生态系统教授单细胞RNA-Seq分析的常见工作流程。同名论文[@pa:Huber2015] 介绍了使用Bioconductor进行单细胞分析的概述,而这本书则是一个在线版本,详细介绍了更多内容,并提供了大量的代码示例。
6 |
7 | 该书在基本的单细胞RNA-Seq分析方面非常全面,解释详细,示例工作流程丰富。然而,它不包括其他单细胞组学,如scATAC-seq。关于空间转录组学的内容可以在补充的《使用Bioconductor进行空间分辨转录组学分析》(Bioconductor OSTA)书籍中找到(https://lmweber.org/OSTA-book/)。由于这些书是为Bioconductor生态系统设计的,它们只使用Bioconductor上可用的工具。这并不一定会导致最佳分析结果,正如书中本身所指出的。我们认为Bioconductor的这些书对于具有基本R和较强生物学背景,有兴趣学习如何使用Bioconductor进行单细胞和空间转录组学数据分析的人群特别有用。
8 |
9 | ## 2. 单细胞RNA-seq分析的最佳实践:教程
10 |
11 | 《单细胞RNA-seq分析的最佳实践:教程》[@pa:Lücken2019]是Malte Lücken和Fabian Theis撰写的一篇论文,介绍了最佳实践的单细胞RNA-Seq分析方法。这篇论文在该领域的独特贡献在于它不仅作为可能的分析步骤的综述,还根据独立基准提供最佳实践建议。在没有最佳实践建议的情况下,提供了对分析方法的一般推荐。该论文附带了一个示例分析,涉及了Haber等人的小鼠肠上皮区域[@pa:Haber2017]。
12 |
13 | 与Bioconductor OSCA相比,该论文和示例分析没有受到所展示工具的偏见,并且在涵盖的主题广度方面更加完整。然而,与Bioconductor OSCA论文和书籍类似,Lücken和Theis也没有涵盖更近期的主题,如RNA速度、空间转录组学或多组学。我们强烈推荐这篇论文作为该领域的介绍和概述,以及最初的分析最佳实践建议。
14 |
15 | 本书的章节基于最新的最佳实践,并提供了对该领域的最新视角。此外,本书中的分析工作流程更详细地解释,以为读者提供运行这些方法所需的更多背景信息。我们通常不建议研究相关的案例研究,而是建议详细阅读本书的章节。
16 |
--------------------------------------------------------------------------------
/docs/Introduction/1-2.md:
--------------------------------------------------------------------------------
1 | # 1-2. 单细胞RNA测序
2 |
3 | 本章简要介绍了最常用的单细胞核糖核酸(RNA)测序技术以及相关的基本分子生物学概念。多模态或空间测序不在此章节涉及范围内,而是在相应的高级章节中介绍。所有测序技术都具有各自的优势和局限性,数据分析人员必须了解这些,以便意识到数据可能存在的偏差。
4 |
5 | ## 1. 生命的构成单元
6 |
7 | 众所周知,生命是区分生物体与死亡或无生命实体的特征。对术语“生命”的大多数定义都有一个共同的实体——细胞。细胞形成开放系统,维持稳态,具有新陈代谢,生长,适应环境,繁殖,对刺激做出反应,并自我组织。因此,细胞是构成生命的基本单元,最早由英国科学家罗伯特·胡克在1665年发现。胡克使用非常原始的显微镜观察了一片薄薄的软木片,惊讶地发现这片软木看起来像蜂巢。他将这些微小的单位称为“细胞”。
8 |
9 |
10 | { width="500" }
11 | 细胞图
12 |
13 | 1839年,马蒂亚斯·雅各布·施莱登(Matthias Jakob Schleiden)和提奥多尔·施旺(Theodor Schwann)首次描述了细胞理论。细胞理论描述了所有生物体都由细胞组成。细胞作为功能单元,自身从其他细胞中产生,使它们成为繁殖的基本单位。
14 |
15 | 自细胞理论的早期定义以来,研究人员发现细胞内存在着能量流动,遗传信息以DNA的形式从一细胞传递到另一细胞,而且所有细胞几乎具有相同的化学组成。存在两种一般类型的细胞,真核细胞和原核细胞。真核细胞包含有核,核膜包裹着染色体;而原核细胞只有一个核区域,没有核。细胞核承载着细胞的基因组脱氧核糖核酸(DNA),也是真核细胞名称的由来:核(Nucleus)在拉丁语中表示核心或种子。真核生物是由单个细胞(单细胞)或多个细胞(多细胞)组成的生物体,而原核生物是单细胞生物。真核细胞与原核细胞的区别在于它们高度分室化,即包含有膜的细胞器执行高度专门化的功能,并为细胞提供关键支持。
16 |
17 | 与原核细胞相比,真核细胞平均体积大约是它的10,000倍,并具有丰富的细胞器混合物和由微管、微丝和中间丝组成的细胞骨架。DNA复制机制读取核中存储的DNA中的遗传信息,以复制自身并维持生命周期。真核DNA被分成几个线性束,称为染色体,在细胞核分裂期间由微管纺锤体分离。理解隐藏在DNA中的遗传信息对于理解许多进化和与疾病相关的过程至关重要。测序是解码DNA核苷酸顺序的过程,主要用于揭示特定DNA片段、完整基因组甚至复杂的微生物组携带的遗传信息。DNA测序使研究人员能够确定DNA分子和基因组中的基因和调控元件的位置和功能,并揭示开放阅读框(ORFs)或CpG岛等遗传特征,这些特征指示启动子区域。进化分析是另一个非常常见的应用领域,其中比较不同生物体的同源DNA序列。DNA测序还可以用于研究突变与疾病之间的关联,有时甚至用于疾病抵抗性等,被认为是最有用的应用之一。
18 |
19 | 一个非常常见的例子是镰刀红细胞病,一组血液疾病,由红细胞中氧载体蛋白质血红蛋白的异常引起。这导致严重的健康问题,包括疼痛、贫血、手脚肿胀、细菌感染和中风。镰刀红细胞病的原因是从每个父母那里继承两个异常的β-球蛋白基因(HBB),这些基因编码血红蛋白。基因缺陷是由单核苷酸突变引起的,其中GAG密码子变为GTG密码子的β-球蛋白基因。这导致氨基酸谷氨酸在位置6被缬氨酸取代(E6V置换),从而导致了上述疾病。不幸的是,并不总是可能找到单核苷酸突变与疾病之间的这种“简单”关联,因为大多数疾病由复杂的调控过程引起。
20 |
21 | ## 2. 测序的简要历史
22 |
23 | ### 2.1 第一代测序
24 |
25 | 尽管DNA已经在1869年由弗里德里希·米切尔首次分离出来,但科学界花了100多年才发展出高通量测序技术。
26 | 1953年,沃森、克里克和富兰克林发现了DNA的结构;1965年,罗伯特·霍利首次测序了一种tRNA。
27 | 七年后的1972年,瓦尔特·菲尔斯首次测序了一段完整的基因(噬菌体MS2的外壳蛋白),他使用RNAses消化病毒RNA,分离寡核苷酸,最后通过电泳和色谱法将它们分离开来。
28 | 与此同时,弗里德里希·桑格发展了一种使用放射性标记的部分消化片段的DNA测序方法,被称为“链终止法”,更常被称为“桑格测序”。尽管桑格测序仍然被广泛使用至今,但它存在一些缺点,包括缺乏自动化和耗时。
29 | 1987年,勒罗伊·胡德和迈克尔·哈卡皮勒开发了ABI 370仪器,这是一种自动化桑格测序过程的仪器。它最重要的创新成就是将DNA片段的标记自动化为荧光染料,而不是放射性分子。这种改变不仅使方法的执行更安全,而且允许计算机分析获得的数据。
30 |
31 | 优势:
32 |
33 | - 桑格测序简单且经济实惠。
34 | - 如果操作正确,错误率非常低(<0.001%)。
35 |
36 | 限制:
37 |
38 | - 桑格测序只能测序约300到1000个碱基对(bp)的短DNA片段。
39 | - 桑格测序结果的质量在前15到40个碱基处通常不太好,因为这是引物结合的位置。
40 | - 测序在700到900个碱基后退化。
41 | - 如果测序的DNA片段已经被克隆,一些克隆载体序列可能出现在最终的测序结果中。
42 | - 桑格测序的成本比第二代或第三代测序每测序碱基更高。
43 |
44 | ### 2.2 第二代测序
45 |
46 | 九年后的1996年,莫斯塔法·罗纳吉、马蒂亚斯·乌伦和帕尔·尼雷尼引入了一种新的DNA测序技术,称为焰火测序,开启了第二代测序的时代。
47 | 第二代测序,也称为下一代测序(NGS),主要得益于实验室中进一步的自动化、计算机的应用和反应的微型化。
48 | 焰火测序通过测序过程中合成焦磷酸盐生成的发光来测量光亮度。
49 | 这个过程也常被称为“合成测序”。
50 | 两年后,Shankar Balasubramanian和David Klenerman在Solexa公司开发并适应了合成测序过程,该方法利用荧光染料。Solexa的技术也成为了Illumina测序仪的基础,该测序仪在市场上占据主导地位。
51 | 2005年开发的Roche 454测序仪是第一台能够完全自动化焰火测序过程的测序仪,它是一台单一的自动化机器。
52 | 随后推出了许多其他平台,例如SOLiD系统的“连接测序”(2007年)和Life Technologies的Ion Torrent(2011年),Ion Torrent使用“合成测序”来检测新合成的DNA时的氢离子。
53 |
54 | 优势:
55 |
56 | - 第二代测序通常是所需化学物质成本最低的选择。
57 | - 可以使用稀疏材料作为输入。
58 | - 高灵敏度,可以检测低频变异并进行全面的基因组覆盖。
59 | - 高容量,可进行样本多重复性测序。
60 | - 能够同时测序成千上万个基因。
61 |
62 | 限制:
63 |
64 | - 测序仪价格昂贵,通常需要与同事共享使用。
65 | - 第二代测序仪体积庞大,固定不动,不适用于野外工作。
66 | - 通常,第二代测序结果包含许多短的
67 |
68 | 测序片段(reads),这些片段难以用于新的基因组。
69 | - 测序结果的质量取决于参考基因组。
70 |
71 | ### 2.3 第三代测序
72 |
73 | 第三代测序,现在也被称为下一代测序,为市场带来了两项创新。
74 | 首先,长读测序,即获得比通常的Illumina短读测序仪生成的长度更长的核苷酸片段(大约75到300个碱基对,具体取决于测序仪)。这对于组装没有可用参考基因组的新基因组尤为重要。其次,实时测序的能力是第三代测序的又一重大进步。结合便携式测序仪的使用,这些仪器体积小巧,不需要进一步的复杂设备进行化学反应,测序现在已经“适合现场使用”,即使远离实验室设施也可以采集样品。
75 |
76 | Pacific Biosciences(PacBio)在2010年推出了零模波导(ZMW)测序技术,它使用所谓的包含单个DNA聚合酶的纳米孔。这允许直接观察任何单个核苷酸的结合过程,并通过纳孔孔下方的探测器进行测定。每种类型的核苷酸都用特定的荧光染料标记,这些荧光染料在结合过程中发出荧光信号,随后被测量为序列读出。从PacBio测序仪获得的读数通常为8到15千碱基(kb),可达70kb。
77 |
78 | Oxford Nanopore Technologies在2012年推出了GridION。GridION及其后续产品MinION和Flongle是便携式的DNA和RNA测序仪,可产生超过2Mb的读数。值得注意的是,这样的测序设备甚至适合一个人的手掌。Oxford Nanopore测序仪通过观察核酸通过蛋白质纳米孔时发生的电流变化来识别核苷酸序列。
79 |
80 | 优势:
81 |
82 | - 长读测序将实现大型新基因组的组装。
83 | - 测序仪便携,适用于野外工作。
84 | - 直接检测DNA和RNA序列的表观遗传修饰。
85 | - 速度快,第三代测序仪快速。
86 |
87 | 限制:
88 |
89 | - 一些第三代测序仪的错误率比第二代测序仪高。
90 | - 试剂通常比第二代测序更昂贵。
91 |
92 | ## 3. NGS流程概述
93 |
94 | 虽然存在多种NGS技术,但DNA测序(因此也包括反转录的RNA)的一般步骤大致相同。主要的差异在于各种测序技术的化学特性。
95 |
96 | 1. **样品和文库准备**:首先,通过将DNA样品断裂并与适配器分子连接来准备所谓的文库。适配器分子在文库片段与基质的杂交和引物结合中起作用。
97 |
98 | 2. **扩增和测序**:在第二步中,文库转化为单链分子。通过扩增步骤(例如聚合酶链反应),创建DNA分子的簇。在单次测序运行期间,所有簇都执行个体反应。
99 |
100 | 3. **数据输出和分析**:测序实验的输出取决于测序技术和化学方法。一些测序仪产生荧光信号,将其存储在特定的输出文件中,而其他测序仪可能产生电信号,将其存储在相应的文件格式中。通常,生成的数据量(原始数据)非常大。这些数据需要进行复杂且计算密集的处理。这在原始数据处理部分进一步讨论。
101 |
102 |
103 |
104 | ## 4. RNA测序
105 |
106 | 到目前为止,我们只介绍了未提及的DNA测序,但是,仅了解一个生物体的DNA序列和其调控元素的位置对于细胞的动态和实时运行提供的信息非常有限。
107 | 例如,通过将同一mRNA前体的不同mRNA剪接位点和外显子结合在一起,一个基因可以编码多个蛋白质。这种选择性剪接事件在真核生物中是自然发生的,也很常见;然而,一个变异可能潜在地导致非功能性酶和诱发疾病状态。这就是RNA测序(RNA-Seq)的用武之地。
108 | RNA-Seq在很大程度上遵循DNA测序的协议,但包括一个逆转录步骤,其中从RNA模板合成互补DNA(cDNA)。测序RNA使科学家能够以基因表达的形式在测序时获得细胞、组织或生物体的快照。
109 | 这些信息可以用于检测疾病状态对治疗的反应,在不同的环境条件下,比较基因型以及其他实验设计中的变化。
110 | 现代RNA测序允许对转录本进行无偏采样,与例如基于微阵列的分析或{term}`RT-qPCR`相比,后者需要设计探针专门靶向感兴趣的区域。
111 | 获得的基因表达谱还可以检测基因异构体、基因融合、单核苷酸变异和许多其他有趣的特性。
112 | 现代RNA测序不受先前知识的限制,并允许捕获已知和新颖特征,产生丰富的数据集,可用于探索性数据分析。
113 |
114 | ## 5. 单细胞RNA测序
115 |
116 | ### 5.1 概述
117 |
118 | RNA的测序主要可以通过两种方式进行:一种是对源自感兴趣细胞的混合RNA进行测序(批量测序),另一种是对细胞的转录组进行单独测序(单细胞测序)。将所有细胞的RNA混合在一起通常比实验复杂的单细胞测序更便宜且更容易。批量RNA-Seq得到的是细胞平均的表达谱,通常更容易分析,但也隐藏了一些复杂性,如细胞表达谱的异质性,这可能有助于回答感兴趣的问题。
119 |
120 | 一些药物或干扰可能仅影响特定的细胞类型或细胞类型之间的相互作用。例如,在肿瘤学中,可能存在导致复发的罕见耐药肿瘤细胞,即使在培养细胞中也很难通过简单的批量RNA-Seq鉴定。
121 |
122 | 为了揭示这种关系,有必要对单细胞水平上的基因表达进行研究。单细胞RNA-Seq(scRNA-Seq)然而具有一些注意事项。首先,单细胞实验通常更昂贵且更难以正确进行。其次,由于分辨率的增加,下游分析变得更加复杂,更容易得出错误的结论。
123 |
124 | 一般来说,单细胞实验遵循与批量RNA-Seq实验相同的步骤(见上文),但需要进行一些适应。与批量测序类似,单细胞测序需要裂解、逆转录、扩增和最终测序。
125 | 此外,单细胞测序还需要细胞分离和物理分隔为更小的反应腔室或其他形式的细胞标记,以便在后续阶段将获得的转录组映射回原始细胞。
126 | 因此,这些也是大多数单细胞测序方法的步骤之处:单细胞分离、转录本扩增和根据测序机器的不同,测序。在解释不同测序方法的工作原理之前,我们将在下面更详细地讨论转录本定量。
127 |
128 |
129 | ### 5.2 转录本定量
130 |
131 | 转录本定量是将测序的转录本与基因序列进行计数的过程。这些计数最终进入计数表中。关于这个计算过程的更多细节将在下一章中描述。转录本定量有两种主要方法:全长和标签方法。
132 |
133 | 全长协议尝试用测序读数均匀覆盖整个转录本,而标签基协议只捕获转录本的5'端或3'端。转录本定量方法对捕获的基因有很强的影响,因此分析人员必须了解所使用的定量过程。全长测序仅适用于基于板的协议(见下文),而库制备与批量RNA测序相似。并非总是能够实现转录本的均匀覆盖,因此仍可能对基因体的特定区域存在偏差。全长协议的主要优点是它们可以检测到剪接变体。
134 |
135 | 标签协议仅测序转录本的3'端或5'端。这样做的代价是不一定(不一定)覆盖整个基因长度,从而使读数能够明确地与一个转录本对齐并区分不同的亚型{cite}`Archer2016`。然而,它允许使用唯一的分子标识符(UMI),这对于解决转录本扩增过程中的偏差非常有用。
136 |
137 | 转录本扩增是任何RNA测序运行的关键步骤,以确保转录本足够丰富以进行质量控制和测序。在这个过程中,通常使用聚合酶链反应(PCR)从原始分子的相同片段制备拷贝。由于拷贝和原始分子是无法区分的,确定样品中原始分子的数量变得具有挑战性。使用UMI是量化原始的、非重复的分子的常见解决方案。
138 |
139 | UMI作为分子条形码,有时也被称为随机条形码。这些“条形码”由添加到样品中的每个分子上的短的随机核苷酸序列组成,作为唯一的标签。UMI必须在扩增步骤之前的文库生成过程中添加。准确识别PCR重复的能力对于下游分析来说是重要的,以排除或了解扩增偏差{cite}`Aird2011`。
140 |
141 | 扩增偏差是指被有选择性扩增的RNA/cDNA序列,因此它们将被测序更多次,导致更高的计数。这对于任何基因表达分析都可能产生不利影响,因为不活跃的基因可能突然表现出高表达。对于在PCR步骤的后期被扩增的序列尤其如此,此时错误率可能已经比早期PCR阶段更高。
142 |
143 | 虽然在计算上可以通过删除具有相同对齐坐标的读取来检测和删除此类序列,但通常建议始终在可能的情况下设计带有UMI的实验。使用UMI还可以在不损失准确性的情况下执行基因计数的归一化处理{cite}`Kivioja2012`。
144 |
145 |
146 |
147 | ### 5.3 单细胞测序协议
148 |
149 | 目前,存在三种类型的单细胞测序协议,这些协议主要根据其细胞分离协议进行分组:基于微流控装置的策略,其中细胞被封装在水凝胶液滴中;基于孔板的协议,其中细胞被物理分离到小孔中;以及商业化的Fluidigm C1微流控芯片解决方案,可以将细胞加载和分离到小的反应腔室中。这三种方法在恢复转录本、测序的细胞数以及许多其他方面都有所不同。
150 | 在接下来的小节中,我们将简要讨论它们的工作原理、优点和缺点,以及数据分析人员应该注意的相关协议偏倚。
151 |
152 | #### 5.3.1 基于微流控装置的协议
153 |
154 | 基于微流控装置的单细胞策略将细胞困在水凝胶液滴中,使其分隔为单个细胞反应腔。目前最常用的协议包括inDrop{cite}`Klein2015`、Drop-seq{cite}`exp:Macosko2015`和商业化的10x Genomics Chromium{cite}`exp:Zheng2017`,可以每秒产生数千个这样的液滴。这种高度并行的过程以相对较低的成本产生非常高数量的液滴。尽管这三种协议在细节上有所不同,但总是设计能够同时捕获珠子和细胞的纳升级液滴。
155 |
156 | 封装过程是使用专门的微珠进行的,这些微珠带有包含PCR手柄、细胞条码和4-8个碱基对长的唯一分子识别码(UMI-见下文)和聚合物T尾的引物。细胞溶解后,mRNA立即释放并被附着在微珠上的条形码寡核苷酸捕获。接下来,收集和破裂液滴,以释放与微粒附着的单细胞转录组(STAMPs)。然后进行PCR和逆转录以捕获和扩增转录本。
157 |
158 | 最后进行标签化,其中转录本被随机切割并连接测序适配体。这个过程产生了准备好进行测序的测序文库,如上所述。在基于微流控装置的协议中,只有大约10%的细胞转录本得到恢复{cite}`Islam2014`。值得注意的是,这种低测序足以稳健地识别细胞类型。
159 |
160 | 这三种基于微流控装置的方法都会产生特定的偏差。所使用的珠子的材料在协议之间有所不同。Drop-seq使用易碎树脂制造的珠子,因此珠子以泊松分布封装,而InDrop和10X Genomics珠子是可变形的,导致珠子占有率超过80%{cite}`Zhang2019`。
161 |
162 | 此外,捕获效率可能受到Drop-Seq中使用的表面固定引物的影响。InDrop使用的是光解除的引物,而10X genomics则溶解了珠子。这种差异还会影响逆转录过程的位置。Drop-seq中的逆转录发生在珠子从液滴中释放后,而InDrop和10X genomics协议中逆转录发生在液滴内部{cite}`Zhang2019`。
163 |
164 | 2019年Zhang等人的比较发现,在珠子的质量方面,InDrop和Drop-seq被10X Genomics超越。前两个系统中的细胞条码包含明显的不匹配。此外,来自有效条码的读取比例为10X Genomics为75%,而InDrop为25%,Drop-seq为30%{cite}`Zhang2019`。
165 |
166 | 对于敏感性,10X Genomics也表现出了类似的优势。在他们的比较中,10X Genomics平均每个细胞捕获了约3000个基因的17000个转录本,而Drop-seq捕获了约2500个基因的8000个转录本,InDrop捕获了约1250个基因的2700个转录本{cite}`Zhang2019`。
167 |
168 | 实际产生的数据显示了明显的协议偏差。10X Genomics倾向于捕获和扩增较短基因和具有较高GC含量的基因,而与之相比,Drop-seq更偏好具有较低GC含量的基因。
169 |
170 | 尽管10X Genomics在各个方面的表现都优于其他协议,但其每个细胞的成本也是其他协议的两倍左右。此外,除了珠子外,Drop-seq是开源的,如果需要,该协议更容易进行适应。InDrop完全开源,即使在实验室也可以制造和修改珠子。因此,InDrop是这三种协议中最灵活的。
171 |
172 | 优点:
173 |
174 | - 可以成本效益地测序大量细胞,以识别组织的整体组成并表征罕见的细胞类型。
175 | - 可以整合UMI。
176 |
177 | 局限性:
178 |
179 | - 与其他方法相比,转录本检测率较低。
180 | - 只捕获3'端而不是整个转录本,因为细胞条码和PCR手柄仅添加到转录本的末端。
181 |
182 | #### 5.3.2 基于孔板的协议
183 |
184 | 基于孔板的协议通常将细胞物理分离到微孔板中的单个孔中。第一步是通过荧光激活的细胞分选(FACS)(例如,根据特定的细胞表面标记物)或微量移液将细胞分选。然后,选择的细胞被置于包含细胞裂解缓冲液的单个孔中,随后进行逆转录。这使得可以在单个实验中分析数百个细胞,每个细胞捕获5000到10000个基因。
185 |
186 | 基于孔板的测序协议包括SMART-seq2、MARS-seq、QUARTZ-seq和SRCB-seq等等。一般来说,这些协议在其多重复能力方面有所不同。例如,MARS-seq允许三个条码级别,即分子、细胞和孔板级别的标签,以实现稳健的多重复能力。相反,SMART-seq2不允许早期多重复,从而限制了细胞数量。Mereu等人在2020年的一项系统性比较研究中发现,QUARTZ-seq2每个细胞能够捕获的基因数量超过SMART-seq2、MARS-seq和10X Genomics Chromium等协议。
187 |
188 | 优点:
189 |
190 | - 转录本检测率较高,可以捕获细胞中更多的转录本。
191 | - 可以在细胞和孔板级别添加多个条码,以增加复制能力。
192 |
193 | 局限性:
194 |
195 | - 测序的细胞数量相对较少。
196 | - 较高的成本。
197 |
198 | #### 5.3.3 Fluidigm C1
199 |
200 | Fluidigm C1是一个商业化的解决方案,用于单细胞测序。它使用集成的微流控芯片将单个细胞加载到不同的反应腔中,并将逆转录和扩增步骤与仪器捆绑在一起。
201 | 与基于微流控装置的协议相比,Fluidigm C1的优点是对每个细胞的转录组覆盖度更高,并且可以获得更高的转录本检测率。然而,它的成本较高,并且需要专门的仪器和芯片。
202 |
203 | 优点:
204 |
205 | - 转录本检测率较高。
206 | - 较高的转录组覆盖度。
207 |
208 | 局限性:
209 |
210 | - 成本较高。
211 | - 需要专门的仪器和芯片。
212 |
213 | 以上是目前常用的单细胞测序协议的简要概述。每种协议都有其特定的优点和局限性,选择适当的协议取决于研究目的、预算和实验室资源。
214 |
215 | ### 5.4 数据分析
216 |
217 | 单细胞RNA测序数据的分析是一个复杂的过程,涉及多个步骤和算法。以下是一般性的数据分析流程:
218 |
219 | 1. **质量控制**:首先,对测序数据进行质量控制,包括检查测序读数的质量分数、去除低质量的读数和检测可能的测序偏差。
220 |
221 | 2. **数据预处理**:对原始数据进行预处理,包括去除低质量的细胞、去除环境RNA的污染、校正测序偏差、归一化转录本计数等。
222 |
223 | 3. **细胞聚类**:使用聚类算法将单个细胞分成不同的群集,每个群集代表一个细胞类型或亚型。
224 |
225 | 4. **细胞亚型分析**:在每个细胞类型或群集中,进一步分析细胞的亚型和状态,包括特征基因的识别、差异表达分析、细胞亚型标记物的鉴定等。
226 |
227 | 5. **细胞状态转换分析**:对细胞进行时间轴分析,以了解细胞状态的转换和发育过程。
228 |
229 | 6. **细胞间相互作用分析**:分析细胞之间的相互作用、细胞群集之间的关系以及细胞状态的转移。
230 |
231 | 7. **功能注释和通路分析**:对不同细胞类型或亚型中的基因进行功能注释和通路分析,以了解其在生物学过程中的功能和相互关系。
232 |
233 | 这只是单细胞RNA测序数据分析的一般流程。实际分析可能会根据具体的研究问题和数据的特点而有所不同。对于每个步骤,有多种算法和工具可供选择,研究人员应根据其需求和数据特点选择适当的方法。
234 |
235 | 希望这些信息能够帮助您了解NGS和单细胞测序的一般原理和流程。请注意,这只是一个概述,NGS和单细胞测序是广泛且不断发展的领域,仍有许多新的技术和方法被引入和改进。对于更详细和具体的信息,建议参考相关文献和专业资源。
--------------------------------------------------------------------------------
/docs/Introduction/1-4.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "attachments": {},
5 | "cell_type": "markdown",
6 | "metadata": {},
7 | "source": [
8 | "# 1-4 分析框架和工具\n",
9 | "\n",
10 | "我们在前面的分析中已经获得了样本的计数矩阵(cells x genes),我们需要从计数矩阵中,挖掘潜在的细胞或者基因信息。由于数据规模往往较大,常规的Pandas,Numpy无法同时容纳不同维度的信息,这对分析工具提出了新的要求。目前单细胞领域的工具主要有三种:\n",
11 | "\n",
12 | "- Bioconductor:R语言实现的生物信息学生态\n",
13 | "- Seurat:R语言实现的单细胞分析生态\n",
14 | "- Scverse:基于Python实现的单细胞分析生态。\n",
15 | "\n",
16 | "Bioconductor 是一个开发、支持和共享免费开源软件的项目,重点是对包括单细胞在内的许多不同生物测定的数据进行严格且可重复的分析。同质的开发人员和用户体验以及带有用户友好小插图的丰富文档是 Bioconductor 的最大优势。Seurat 是一款备受推崇的 R 软件包,专为分析单细胞数据而设计。它为分析的所有步骤提供工具,包括多模式和空间数据。修拉以写得好的小插图和庞大的用户群而闻名。然而,对于极大的数据集(超过 50 万个单元),这两种 R 选项都会遇到可扩展性问题,这促使基于 Python 的社区开发 scverse 生态系统。scverse 是一个致力于生命科学基础工具的组织和生态系统,最初重点关注单细胞。可扩展性、可扩展性以及与现有 Python 数据和机器学习工具的强大互操作性是 scverse 生态系统的一些优势。\n",
17 | "\n",
18 | "而在我们的教程中,我们还会介绍omicverse框架,该框架整合了大量RNA-seq的处理算法,并重新定义了不同算法的数据格式,提升了兼容性,更具有用户友好的特征。此外,omicverse还能完成CNS级别的可视化图表的绘制。基于omicverse框架我们可以完成更多单细胞领域探索的任务。\n",
19 | "\n",
20 | "在本章中,我们将介绍scverse生态中所包含的数据结构,包括AnnData,MuData。二者是我们使用Python进行单细胞分析所涉及的基本数据结构。\n",
21 | "\n",
22 | "## 1. 使用AnnData存储数据\n",
23 | "\n",
24 | "我们在此前的学习中,掌握了Pandas的数据格式与操作。而AnnData,则是基于Pandas的基本格式,赋予了更多层的数据含义。在Pandas中,我们的数据由`index`与`columns`进行索引,但是对于index与columns自身所包含的信息,则需要额外的文件进行存储和补充。例如,我们的计数矩阵的格式为(cells x genes),我们可以定义columns为cells,index为genes,对于cells,我们会关注每一个细胞是什么类型的细胞,每一个细胞里表达的基因数量;对于genes,我们会关注每一个基因的id,基因名,染色体位置(chr:10000-20000)等。故我们需要一种包含更多维度信息的数据格式。\n",
25 | "\n",
26 | "在这里,我们介绍AnnData格式,该数据的格式如图所示,以计数矩阵为核心,向外扩展出了obs,var两个数据层,并且obs和var还可以存放更进一步的数据内容。同时计数矩阵由layers叠加,我们可以存放不同类型的计数矩阵,如归一化后的计数矩阵,原始计数矩阵等。我们还可以在uns层中存放其他任何你想存放的数据。\n",
27 | "\n",
28 | "
\n",
29 | "
\n",
30 | "\n",
31 | "\n",
32 | "### 1.1 安装\n",
33 | "\n",
34 | "AnnData 可在 PyPI 或 Conda上使用,并且可以使用以下任一方法安装:\n"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": null,
40 | "metadata": {
41 | "vscode": {
42 | "languageId": "python"
43 | }
44 | },
45 | "outputs": [],
46 | "source": [
47 | "%pip install anndata\n",
48 | "%conda install -c conda-forge anndata"
49 | ]
50 | },
51 | {
52 | "attachments": {},
53 | "cell_type": "markdown",
54 | "metadata": {},
55 | "source": [
56 | "### 1.2 初始化AnnData对象\n",
57 | "\n",
58 | "本节的灵感来自 AnnData 的“入门”教程:https://anndata-tutorials.readthedocs.io/en/latest/getting-started.html\n",
59 | "\n",
60 | "让我们创建一个具有稀疏计数矩阵的简单 AnnData 对象,例如可以表示基因表达计数。首先,我们导入所需的包。"
61 | ]
62 | },
63 | {
64 | "cell_type": "code",
65 | "execution_count": 1,
66 | "metadata": {
67 | "vscode": {
68 | "languageId": "python"
69 | }
70 | },
71 | "outputs": [],
72 | "source": [
73 | "import numpy as np\n",
74 | "import pandas as pd\n",
75 | "import anndata as ad\n",
76 | "from scipy.sparse import csr_matrix"
77 | ]
78 | },
79 | {
80 | "attachments": {},
81 | "cell_type": "markdown",
82 | "metadata": {},
83 | "source": [
84 | "下一步,我们使用随机泊松分布数据初始化 AnnData 对象。一般来说,我们会把AnnData实例化的对象叫做adata"
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "execution_count": 2,
90 | "metadata": {
91 | "vscode": {
92 | "languageId": "python"
93 | }
94 | },
95 | "outputs": [
96 | {
97 | "data": {
98 | "text/plain": [
99 | "AnnData object with n_obs × n_vars = 100 × 2000"
100 | ]
101 | },
102 | "execution_count": 2,
103 | "metadata": {},
104 | "output_type": "execute_result"
105 | }
106 | ],
107 | "source": [
108 | "counts = csr_matrix(np.random.poisson(1, size=(100, 2000)), dtype=np.float32)\n",
109 | "adata = ad.AnnData(counts)\n",
110 | "adata"
111 | ]
112 | },
113 | {
114 | "attachments": {},
115 | "cell_type": "markdown",
116 | "metadata": {},
117 | "source": [
118 | "获得的AnnData对象有100个obs和2000个var。这相当于 100 个细胞和 2000 个基因。我们可以使用adata.X来访问我们的计数矩阵。"
119 | ]
120 | },
121 | {
122 | "cell_type": "code",
123 | "execution_count": 3,
124 | "metadata": {
125 | "vscode": {
126 | "languageId": "python"
127 | }
128 | },
129 | "outputs": [
130 | {
131 | "data": {
132 | "text/plain": [
133 | "<100x2000 sparse matrix of type ''\n",
134 | "\twith 126647 stored elements in Compressed Sparse Row format>"
135 | ]
136 | },
137 | "execution_count": 3,
138 | "metadata": {},
139 | "output_type": "execute_result"
140 | }
141 | ],
142 | "source": [
143 | "adata.X"
144 | ]
145 | },
146 | {
147 | "attachments": {},
148 | "cell_type": "markdown",
149 | "metadata": {},
150 | "source": [
151 | "我们可以使用.obs_names和.var_names为obs和var提供索引,当然你也可以使用obs.index,var.index,二者的效果是一样的。"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": 4,
157 | "metadata": {
158 | "vscode": {
159 | "languageId": "python"
160 | }
161 | },
162 | "outputs": [
163 | {
164 | "name": "stdout",
165 | "output_type": "stream",
166 | "text": [
167 | "Index(['Cell_0', 'Cell_1', 'Cell_2', 'Cell_3', 'Cell_4', 'Cell_5', 'Cell_6',\n",
168 | " 'Cell_7', 'Cell_8', 'Cell_9'],\n",
169 | " dtype='object')\n"
170 | ]
171 | }
172 | ],
173 | "source": [
174 | "adata.obs_names = [f\"Cell_{i:d}\" for i in range(adata.n_obs)]\n",
175 | "adata.var_names = [f\"Gene_{i:d}\" for i in range(adata.n_vars)]\n",
176 | "print(adata.obs_names[:10])"
177 | ]
178 | },
179 | {
180 | "attachments": {},
181 | "cell_type": "markdown",
182 | "metadata": {},
183 | "source": [
184 | "### 1.3 添加对齐的元数据\n",
185 | "\n",
186 | "#### 1.3.1 Obs或者Var\n",
187 | "\n",
188 | "元数据是一个广泛应用在数据分析中的概念,比如我们描述一个病人,会包括年龄,身高,体重,性别等信息,这类信息被我们统称为元数据。\n",
189 | "\n",
190 | "我们的AnnData对象已经初始化完成,数据中包含了计数矩阵以及基因id和细胞id。我们需要将更具体的细胞信息添加进入数据中,例如,细胞类型。我们会将细胞类型添加进入`.obs`中,会将基因信息添加进入`.var`中。"
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "execution_count": 5,
196 | "metadata": {
197 | "vscode": {
198 | "languageId": "python"
199 | }
200 | },
201 | "outputs": [
202 | {
203 | "data": {
204 | "text/html": [
205 | "\n",
206 | "\n",
219 | "
\n",
220 | " \n",
221 | " \n",
222 | " | \n",
223 | " cell_type | \n",
224 | "
\n",
225 | " \n",
226 | " \n",
227 | " \n",
228 | " Cell_0 | \n",
229 | " B | \n",
230 | "
\n",
231 | " \n",
232 | " Cell_1 | \n",
233 | " Monocyte | \n",
234 | "
\n",
235 | " \n",
236 | " Cell_2 | \n",
237 | " Monocyte | \n",
238 | "
\n",
239 | " \n",
240 | " Cell_3 | \n",
241 | " T | \n",
242 | "
\n",
243 | " \n",
244 | " Cell_4 | \n",
245 | " T | \n",
246 | "
\n",
247 | " \n",
248 | " ... | \n",
249 | " ... | \n",
250 | "
\n",
251 | " \n",
252 | " Cell_95 | \n",
253 | " Monocyte | \n",
254 | "
\n",
255 | " \n",
256 | " Cell_96 | \n",
257 | " B | \n",
258 | "
\n",
259 | " \n",
260 | " Cell_97 | \n",
261 | " B | \n",
262 | "
\n",
263 | " \n",
264 | " Cell_98 | \n",
265 | " B | \n",
266 | "
\n",
267 | " \n",
268 | " Cell_99 | \n",
269 | " Monocyte | \n",
270 | "
\n",
271 | " \n",
272 | "
\n",
273 | "
100 rows × 1 columns
\n",
274 | "
"
275 | ],
276 | "text/plain": [
277 | " cell_type\n",
278 | "Cell_0 B\n",
279 | "Cell_1 Monocyte\n",
280 | "Cell_2 Monocyte\n",
281 | "Cell_3 T\n",
282 | "Cell_4 T\n",
283 | "... ...\n",
284 | "Cell_95 Monocyte\n",
285 | "Cell_96 B\n",
286 | "Cell_97 B\n",
287 | "Cell_98 B\n",
288 | "Cell_99 Monocyte\n",
289 | "\n",
290 | "[100 rows x 1 columns]"
291 | ]
292 | },
293 | "execution_count": 5,
294 | "metadata": {},
295 | "output_type": "execute_result"
296 | }
297 | ],
298 | "source": [
299 | "#随机生成细胞类型,并存放到变量ct中\n",
300 | "ct = np.random.choice([\"B\", \"T\", \"Monocyte\"], size=(adata.n_obs,))\n",
301 | "#我们将生成的细胞类型赋值到adata.obs列中\n",
302 | "adata.obs[\"cell_type\"] = pd.Categorical(ct) # Categoricals are preferred for efficiency\n",
303 | "adata.obs"
304 | ]
305 | },
306 | {
307 | "attachments": {},
308 | "cell_type": "markdown",
309 | "metadata": {},
310 | "source": [
311 | "如果我们现在再次检查 AnnData 对象的内容,我们将注意到它也被更新了,在 obs 中包含 cell_type 信息。"
312 | ]
313 | },
314 | {
315 | "cell_type": "code",
316 | "execution_count": 6,
317 | "metadata": {
318 | "vscode": {
319 | "languageId": "python"
320 | }
321 | },
322 | "outputs": [
323 | {
324 | "data": {
325 | "text/plain": [
326 | "AnnData object with n_obs × n_vars = 100 × 2000\n",
327 | " obs: 'cell_type'"
328 | ]
329 | },
330 | "execution_count": 6,
331 | "metadata": {},
332 | "output_type": "execute_result"
333 | }
334 | ],
335 | "source": [
336 | "adata"
337 | ]
338 | },
339 | {
340 | "attachments": {},
341 | "cell_type": "markdown",
342 | "metadata": {},
343 | "source": [
344 | "#### 1.3.2 数据子集\n",
345 | "\n",
346 | "我们有时候,只需要研究特定的一类细胞,这类细胞可以被认为是全体细胞的子集。切片操作类似`pandas`中的DataFrames的切片操作。\n",
347 | "\n",
348 | "在这里,我们取所有B细胞出来。"
349 | ]
350 | },
351 | {
352 | "cell_type": "code",
353 | "execution_count": 7,
354 | "metadata": {
355 | "vscode": {
356 | "languageId": "python"
357 | }
358 | },
359 | "outputs": [
360 | {
361 | "data": {
362 | "text/plain": [
363 | "View of AnnData object with n_obs × n_vars = 32 × 2000\n",
364 | " obs: 'cell_type'"
365 | ]
366 | },
367 | "execution_count": 7,
368 | "metadata": {},
369 | "output_type": "execute_result"
370 | }
371 | ],
372 | "source": [
373 | "bdata = adata[adata.obs.cell_type == \"B\"]\n",
374 | "bdata"
375 | ]
376 | },
377 | {
378 | "attachments": {},
379 | "cell_type": "markdown",
380 | "metadata": {},
381 | "source": [
382 | "### 1.4 Observation/variable 的矩阵\n",
383 | "\n",
384 | "我们注意到,在`.obs`或`.var`中,我们可以存放细胞类型的元数据,但是这个元数据是一维的,如果我们有一个特征是多维的呢?比如细胞的特征向量,这时候,我们可以将数据存放在`.obsm`或者`.varm`中,表示多维的元数据。\n",
385 | "\n",
386 | "让我们从一个随机生成的矩阵开始,我们可以将其解释为细胞的特征向量UMAP,我们还可以随机生成一个矩阵表示基因的一些多维元数据特征。"
387 | ]
388 | },
389 | {
390 | "cell_type": "code",
391 | "execution_count": 8,
392 | "metadata": {
393 | "vscode": {
394 | "languageId": "python"
395 | }
396 | },
397 | "outputs": [
398 | {
399 | "data": {
400 | "text/plain": [
401 | "AxisArrays with keys: X_umap"
402 | ]
403 | },
404 | "execution_count": 8,
405 | "metadata": {},
406 | "output_type": "execute_result"
407 | }
408 | ],
409 | "source": [
410 | "adata.obsm[\"X_umap\"] = np.random.normal(0, 1, size=(adata.n_obs, 2))\n",
411 | "adata.varm[\"gene_stuff\"] = np.random.normal(0, 1, size=(adata.n_vars, 5))\n",
412 | "adata.obsm"
413 | ]
414 | },
415 | {
416 | "attachments": {},
417 | "cell_type": "markdown",
418 | "metadata": {},
419 | "source": [
420 | "同样的,AnnData对象也被更新了"
421 | ]
422 | },
423 | {
424 | "cell_type": "code",
425 | "execution_count": 9,
426 | "metadata": {
427 | "vscode": {
428 | "languageId": "python"
429 | }
430 | },
431 | "outputs": [
432 | {
433 | "data": {
434 | "text/plain": [
435 | "AnnData object with n_obs × n_vars = 100 × 2000\n",
436 | " obs: 'cell_type'\n",
437 | " obsm: 'X_umap'\n",
438 | " varm: 'gene_stuff'"
439 | ]
440 | },
441 | "execution_count": 9,
442 | "metadata": {},
443 | "output_type": "execute_result"
444 | }
445 | ],
446 | "source": [
447 | "adata"
448 | ]
449 | },
450 | {
451 | "attachments": {},
452 | "cell_type": "markdown",
453 | "metadata": {},
454 | "source": [
455 | "这里有一些关于`.obsm/.varm`的额外信息:\n",
456 | "- 多维的元数据可以是`pandas`的`DataFrames`格式,也可以是`numpy`的`ndarray`,还可以是`scipy`的`sparse matrix`\n",
457 | "- 元数据可以很方便地被`scanpy`识别,进行绘图操作"
458 | ]
459 | },
460 | {
461 | "attachments": {},
462 | "cell_type": "markdown",
463 | "metadata": {},
464 | "source": [
465 | "### 1.5 非结构化元数据\n",
466 | "\n",
467 | "如上所述,AnnData 有.uns,它允许任何非结构化元数据。这可以是任何东西,例如包含一些对数据分析有用的一般信息的列表或字典。尝试仅将此插槽用于无法有效存储在其他插槽中的数据。"
468 | ]
469 | },
470 | {
471 | "cell_type": "code",
472 | "execution_count": 10,
473 | "metadata": {
474 | "vscode": {
475 | "languageId": "python"
476 | }
477 | },
478 | "outputs": [
479 | {
480 | "data": {
481 | "text/plain": [
482 | "OverloadedDict, wrapping:\n",
483 | "\tOrderedDict([('random', [1, 2, 3])])\n",
484 | "With overloaded keys:\n",
485 | "\t['neighbors']."
486 | ]
487 | },
488 | "execution_count": 10,
489 | "metadata": {},
490 | "output_type": "execute_result"
491 | }
492 | ],
493 | "source": [
494 | "adata.uns[\"random\"] = [1, 2, 3]\n",
495 | "adata.uns"
496 | ]
497 | },
498 | {
499 | "attachments": {},
500 | "cell_type": "markdown",
501 | "metadata": {},
502 | "source": [
503 | "### 1.6 Layers\n",
504 | "\n",
505 | "最后,我们可能有不同形式的原始核心数据,可能一种是标准化的,另一种不是标准化的。这些可以存储在 AnnData 的不同层中。\n",
506 | "\n",
507 | "例如,让我们对原始数据进行对数转换并将其存储在layers中。"
508 | ]
509 | },
510 | {
511 | "cell_type": "code",
512 | "execution_count": 11,
513 | "metadata": {
514 | "vscode": {
515 | "languageId": "python"
516 | }
517 | },
518 | "outputs": [
519 | {
520 | "data": {
521 | "text/plain": [
522 | "AnnData object with n_obs × n_vars = 100 × 2000\n",
523 | " obs: 'cell_type'\n",
524 | " uns: 'random'\n",
525 | " obsm: 'X_umap'\n",
526 | " varm: 'gene_stuff'\n",
527 | " layers: 'log_transformed'"
528 | ]
529 | },
530 | "execution_count": 11,
531 | "metadata": {},
532 | "output_type": "execute_result"
533 | }
534 | ],
535 | "source": [
536 | "adata.layers[\"log_transformed\"] = np.log1p(adata.X)\n",
537 | "adata"
538 | ]
539 | },
540 | {
541 | "attachments": {},
542 | "cell_type": "markdown",
543 | "metadata": {},
544 | "source": [
545 | "我们的原始矩阵X没有修改并且仍然可以访问。我们可以通过比较原始layers:X和新layers:log_transformed来验证这一点。"
546 | ]
547 | },
548 | {
549 | "cell_type": "code",
550 | "execution_count": 12,
551 | "metadata": {
552 | "vscode": {
553 | "languageId": "python"
554 | }
555 | },
556 | "outputs": [
557 | {
558 | "data": {
559 | "text/plain": [
560 | "False"
561 | ]
562 | },
563 | "execution_count": 12,
564 | "metadata": {},
565 | "output_type": "execute_result"
566 | }
567 | ],
568 | "source": [
569 | "(adata.X != adata.layers[\"log_transformed\"]).nnz == 0"
570 | ]
571 | },
572 | {
573 | "attachments": {},
574 | "cell_type": "markdown",
575 | "metadata": {},
576 | "source": [
577 | "### 1.7 导出为DataFrames\n",
578 | "\n",
579 | "我们有时候需要导出pandas的DataFrames来上传到其他工具进行分析,也就是计数矩阵,我们可以很轻松地实现这点,在AnnData中。我们可以导出任意layers,如果不指定则导出`.X`"
580 | ]
581 | },
582 | {
583 | "cell_type": "code",
584 | "execution_count": 14,
585 | "metadata": {
586 | "vscode": {
587 | "languageId": "python"
588 | }
589 | },
590 | "outputs": [
591 | {
592 | "data": {
593 | "text/html": [
594 | "\n",
595 | "\n",
608 | "
\n",
609 | " \n",
610 | " \n",
611 | " | \n",
612 | " Gene_0 | \n",
613 | " Gene_1 | \n",
614 | " Gene_2 | \n",
615 | " Gene_3 | \n",
616 | " Gene_4 | \n",
617 | " Gene_5 | \n",
618 | " Gene_6 | \n",
619 | " Gene_7 | \n",
620 | " Gene_8 | \n",
621 | " Gene_9 | \n",
622 | " ... | \n",
623 | " Gene_1990 | \n",
624 | " Gene_1991 | \n",
625 | " Gene_1992 | \n",
626 | " Gene_1993 | \n",
627 | " Gene_1994 | \n",
628 | " Gene_1995 | \n",
629 | " Gene_1996 | \n",
630 | " Gene_1997 | \n",
631 | " Gene_1998 | \n",
632 | " Gene_1999 | \n",
633 | "
\n",
634 | " \n",
635 | " \n",
636 | " \n",
637 | " Cell_0 | \n",
638 | " 0.000000 | \n",
639 | " 1.386294 | \n",
640 | " 0.693147 | \n",
641 | " 0.693147 | \n",
642 | " 0.693147 | \n",
643 | " 0.000000 | \n",
644 | " 0.693147 | \n",
645 | " 0.000000 | \n",
646 | " 0.693147 | \n",
647 | " 1.098612 | \n",
648 | " ... | \n",
649 | " 0.000000 | \n",
650 | " 1.386294 | \n",
651 | " 0.693147 | \n",
652 | " 0.000000 | \n",
653 | " 0.693147 | \n",
654 | " 1.386294 | \n",
655 | " 0.693147 | \n",
656 | " 1.098612 | \n",
657 | " 0.693147 | \n",
658 | " 0.000000 | \n",
659 | "
\n",
660 | " \n",
661 | " Cell_1 | \n",
662 | " 0.000000 | \n",
663 | " 0.000000 | \n",
664 | " 1.609438 | \n",
665 | " 0.693147 | \n",
666 | " 0.000000 | \n",
667 | " 0.693147 | \n",
668 | " 0.693147 | \n",
669 | " 0.000000 | \n",
670 | " 1.098612 | \n",
671 | " 1.098612 | \n",
672 | " ... | \n",
673 | " 0.000000 | \n",
674 | " 0.693147 | \n",
675 | " 0.693147 | \n",
676 | " 1.609438 | \n",
677 | " 0.000000 | \n",
678 | " 0.000000 | \n",
679 | " 0.693147 | \n",
680 | " 0.000000 | \n",
681 | " 0.693147 | \n",
682 | " 0.693147 | \n",
683 | "
\n",
684 | " \n",
685 | " Cell_2 | \n",
686 | " 1.098612 | \n",
687 | " 0.693147 | \n",
688 | " 0.000000 | \n",
689 | " 0.693147 | \n",
690 | " 0.693147 | \n",
691 | " 1.386294 | \n",
692 | " 1.098612 | \n",
693 | " 0.693147 | \n",
694 | " 0.000000 | \n",
695 | " 1.098612 | \n",
696 | " ... | \n",
697 | " 1.609438 | \n",
698 | " 0.000000 | \n",
699 | " 1.098612 | \n",
700 | " 0.693147 | \n",
701 | " 0.000000 | \n",
702 | " 0.000000 | \n",
703 | " 1.098612 | \n",
704 | " 1.098612 | \n",
705 | " 0.693147 | \n",
706 | " 0.000000 | \n",
707 | "
\n",
708 | " \n",
709 | " Cell_3 | \n",
710 | " 0.693147 | \n",
711 | " 0.000000 | \n",
712 | " 0.000000 | \n",
713 | " 0.693147 | \n",
714 | " 0.693147 | \n",
715 | " 0.000000 | \n",
716 | " 0.693147 | \n",
717 | " 0.693147 | \n",
718 | " 0.693147 | \n",
719 | " 0.693147 | \n",
720 | " ... | \n",
721 | " 0.693147 | \n",
722 | " 0.693147 | \n",
723 | " 0.693147 | \n",
724 | " 0.693147 | \n",
725 | " 0.693147 | \n",
726 | " 0.000000 | \n",
727 | " 0.000000 | \n",
728 | " 0.000000 | \n",
729 | " 0.000000 | \n",
730 | " 0.693147 | \n",
731 | "
\n",
732 | " \n",
733 | " Cell_4 | \n",
734 | " 0.693147 | \n",
735 | " 0.000000 | \n",
736 | " 1.386294 | \n",
737 | " 0.693147 | \n",
738 | " 1.098612 | \n",
739 | " 1.098612 | \n",
740 | " 0.000000 | \n",
741 | " 0.000000 | \n",
742 | " 0.693147 | \n",
743 | " 0.693147 | \n",
744 | " ... | \n",
745 | " 1.386294 | \n",
746 | " 0.000000 | \n",
747 | " 0.000000 | \n",
748 | " 0.000000 | \n",
749 | " 1.098612 | \n",
750 | " 0.000000 | \n",
751 | " 0.693147 | \n",
752 | " 0.000000 | \n",
753 | " 0.693147 | \n",
754 | " 0.000000 | \n",
755 | "
\n",
756 | " \n",
757 | "
\n",
758 | "
5 rows × 2000 columns
\n",
759 | "
"
760 | ],
761 | "text/plain": [
762 | " Gene_0 Gene_1 Gene_2 Gene_3 Gene_4 Gene_5 Gene_6 \\\n",
763 | "Cell_0 0.000000 1.386294 0.693147 0.693147 0.693147 0.000000 0.693147 \n",
764 | "Cell_1 0.000000 0.000000 1.609438 0.693147 0.000000 0.693147 0.693147 \n",
765 | "Cell_2 1.098612 0.693147 0.000000 0.693147 0.693147 1.386294 1.098612 \n",
766 | "Cell_3 0.693147 0.000000 0.000000 0.693147 0.693147 0.000000 0.693147 \n",
767 | "Cell_4 0.693147 0.000000 1.386294 0.693147 1.098612 1.098612 0.000000 \n",
768 | "\n",
769 | " Gene_7 Gene_8 Gene_9 ... Gene_1990 Gene_1991 Gene_1992 \\\n",
770 | "Cell_0 0.000000 0.693147 1.098612 ... 0.000000 1.386294 0.693147 \n",
771 | "Cell_1 0.000000 1.098612 1.098612 ... 0.000000 0.693147 0.693147 \n",
772 | "Cell_2 0.693147 0.000000 1.098612 ... 1.609438 0.000000 1.098612 \n",
773 | "Cell_3 0.693147 0.693147 0.693147 ... 0.693147 0.693147 0.693147 \n",
774 | "Cell_4 0.000000 0.693147 0.693147 ... 1.386294 0.000000 0.000000 \n",
775 | "\n",
776 | " Gene_1993 Gene_1994 Gene_1995 Gene_1996 Gene_1997 Gene_1998 \\\n",
777 | "Cell_0 0.000000 0.693147 1.386294 0.693147 1.098612 0.693147 \n",
778 | "Cell_1 1.609438 0.000000 0.000000 0.693147 0.000000 0.693147 \n",
779 | "Cell_2 0.693147 0.000000 0.000000 1.098612 1.098612 0.693147 \n",
780 | "Cell_3 0.693147 0.693147 0.000000 0.000000 0.000000 0.000000 \n",
781 | "Cell_4 0.000000 1.098612 0.000000 0.693147 0.000000 0.693147 \n",
782 | "\n",
783 | " Gene_1999 \n",
784 | "Cell_0 0.000000 \n",
785 | "Cell_1 0.693147 \n",
786 | "Cell_2 0.000000 \n",
787 | "Cell_3 0.693147 \n",
788 | "Cell_4 0.000000 \n",
789 | "\n",
790 | "[5 rows x 2000 columns]"
791 | ]
792 | },
793 | "execution_count": 14,
794 | "metadata": {},
795 | "output_type": "execute_result"
796 | }
797 | ],
798 | "source": [
799 | "adata.to_df(layer=\"log_transformed\").head()"
800 | ]
801 | },
802 | {
803 | "attachments": {},
804 | "cell_type": "markdown",
805 | "metadata": {},
806 | "source": [
807 | "### 1.8 AnnData 的读写\n",
808 | "\n",
809 | "AnnData 对象可以保存在磁盘上的分层数组存储(如HDF5或Zarr)中,以在磁盘和内存中启用类似的结构。AnnData 带有自己的基于 HDF5 的持久文件格式:`h5ad`. 如果类别数量较少的字符串列尚未分类,AnnData 会自动将它们转换为分类。现在,我们将以格式保存 AnnData 对象`h5ad`。"
810 | ]
811 | },
812 | {
813 | "cell_type": "code",
814 | "execution_count": 18,
815 | "metadata": {
816 | "vscode": {
817 | "languageId": "python"
818 | }
819 | },
820 | "outputs": [],
821 | "source": [
822 | "adata.write(\"../../data/my_results.h5ad\", compression=\"gzip\")"
823 | ]
824 | },
825 | {
826 | "cell_type": "code",
827 | "execution_count": 19,
828 | "metadata": {
829 | "vscode": {
830 | "languageId": "python"
831 | }
832 | },
833 | "outputs": [
834 | {
835 | "name": "stdout",
836 | "output_type": "stream",
837 | "text": [
838 | "X Group\n",
839 | "layers Group\n",
840 | "obs Group\n",
841 | "obsm Group\n",
842 | "obsp Group\n",
843 | "uns Group\n",
844 | "var Group\n",
845 | "varm Group\n",
846 | "varp Group\n"
847 | ]
848 | }
849 | ],
850 | "source": [
851 | "!h5ls '../../data/my_results.h5ad'"
852 | ]
853 | },
854 | {
855 | "attachments": {},
856 | "cell_type": "markdown",
857 | "metadata": {},
858 | "source": [
859 | "……然后读回来。"
860 | ]
861 | },
862 | {
863 | "cell_type": "code",
864 | "execution_count": 21,
865 | "metadata": {
866 | "vscode": {
867 | "languageId": "python"
868 | }
869 | },
870 | "outputs": [
871 | {
872 | "data": {
873 | "text/plain": [
874 | "AnnData object with n_obs × n_vars = 100 × 2000\n",
875 | " obs: 'cell_type'\n",
876 | " uns: 'random'\n",
877 | " obsm: 'X_umap'\n",
878 | " varm: 'gene_stuff'\n",
879 | " layers: 'log_transformed'"
880 | ]
881 | },
882 | "execution_count": 21,
883 | "metadata": {},
884 | "output_type": "execute_result"
885 | }
886 | ],
887 | "source": [
888 | "adata_new = ad.read_h5ad(\"../../data/my_results.h5ad\")\n",
889 | "adata_new"
890 | ]
891 | },
892 | {
893 | "attachments": {},
894 | "cell_type": "markdown",
895 | "metadata": {},
896 | "source": [
897 | "### 1.9 数据访问\n",
898 | "\n",
899 | "#### 1.9.1 数据查看和复制\n",
900 | "\n",
901 | "让我们看一下另一个元数据用例。想象一下,观察结果来自于一项多年研究中,表征了细胞的不同信息,这些样本是从不同地点的不同受试者身上采集的。我们通常会以某种格式获取该信息,然后将其存储在 DataFrame 中:\n",
902 | "\n",
903 | "- time_yr: 代表细胞的时期\n",
904 | "- subject_id: 代表细胞的类别\n",
905 | "- instrument_type: 代表细胞取材的设备\n",
906 | "- site: 代表细胞所属的部位"
907 | ]
908 | },
909 | {
910 | "cell_type": "code",
911 | "execution_count": 22,
912 | "metadata": {
913 | "vscode": {
914 | "languageId": "python"
915 | }
916 | },
917 | "outputs": [],
918 | "source": [
919 | "obs_meta = pd.DataFrame(\n",
920 | " {\n",
921 | " \"time_yr\": np.random.choice([0, 2, 4, 8], adata.n_obs),\n",
922 | " \"subject_id\": np.random.choice(\n",
923 | " [\"subject 1\", \"subject 2\", \"subject 4\", \"subject 8\"], adata.n_obs\n",
924 | " ),\n",
925 | " \"instrument_type\": np.random.choice([\"type a\", \"type b\"], adata.n_obs),\n",
926 | " \"site\": np.random.choice([\"site x\", \"site y\"], adata.n_obs),\n",
927 | " },\n",
928 | " index=adata.obs.index, # these are the same IDs of observations as above!\n",
929 | ")"
930 | ]
931 | },
932 | {
933 | "attachments": {},
934 | "cell_type": "markdown",
935 | "metadata": {},
936 | "source": [
937 | "我们使用该`obs_meta`取代原来AnnData对象的`.obs`,为了避免混淆,我们生成一个新的`AnnData`对象"
938 | ]
939 | },
940 | {
941 | "cell_type": "code",
942 | "execution_count": 23,
943 | "metadata": {
944 | "vscode": {
945 | "languageId": "python"
946 | }
947 | },
948 | "outputs": [
949 | {
950 | "data": {
951 | "text/plain": [
952 | "AnnData object with n_obs × n_vars = 100 × 2000\n",
953 | " obs: 'time_yr', 'subject_id', 'instrument_type', 'site'"
954 | ]
955 | },
956 | "execution_count": 23,
957 | "metadata": {},
958 | "output_type": "execute_result"
959 | }
960 | ],
961 | "source": [
962 | "adata = ad.AnnData(adata.X, obs=obs_meta, var=adata.var)\n",
963 | "adata"
964 | ]
965 | },
966 | {
967 | "attachments": {},
968 | "cell_type": "markdown",
969 | "metadata": {},
970 | "source": [
971 | "需要注意的是,与`numpy`类似,我们所浏览的是AnnData的视图,并不会直接显示变量,这样可以避免额外的内存开销。\n",
972 | "\n",
973 | "此外,当我们对AnnData进行任何修改时,都会在函数内部调用`.copy()`完成AnnData实际内容的修改。我们也可以直接使用`.copy()`来获取AnnData的实际变量,但通常该操作的意义不大,除了在某些特定场景的函数中,比如`scvi-tools`中的模型训练中,会用到AnnData的实际内容。\n",
974 | "\n",
975 | "我们使用`.[]`来对AnnData对象进行切片操作,与`pandas`不同的是,我们在`.[]`中输入整数则类似`.iloc`,输入字符串或者逻辑值则类似`.loc`,例如,我想查看前五个细胞,以及基因1和基因3的AnnData对象的子集。"
976 | ]
977 | },
978 | {
979 | "cell_type": "code",
980 | "execution_count": 24,
981 | "metadata": {
982 | "vscode": {
983 | "languageId": "python"
984 | }
985 | },
986 | "outputs": [
987 | {
988 | "data": {
989 | "text/plain": [
990 | "View of AnnData object with n_obs × n_vars = 5 × 2\n",
991 | " obs: 'time_yr', 'subject_id', 'instrument_type', 'site'"
992 | ]
993 | },
994 | "execution_count": 24,
995 | "metadata": {},
996 | "output_type": "execute_result"
997 | }
998 | ],
999 | "source": [
1000 | "adata_view = adata[:5, [\"Gene_1\", \"Gene_3\"]]\n",
1001 | "adata_view"
1002 | ]
1003 | },
1004 | {
1005 | "attachments": {},
1006 | "cell_type": "markdown",
1007 | "metadata": {},
1008 | "source": [
1009 | "我们发现`View of AnnData object with n_obs × n_vars = 5 × 2`,但实际上的AnnData并没有被修改,这与`pandas`是一致的"
1010 | ]
1011 | },
1012 | {
1013 | "cell_type": "code",
1014 | "execution_count": 25,
1015 | "metadata": {
1016 | "vscode": {
1017 | "languageId": "python"
1018 | }
1019 | },
1020 | "outputs": [
1021 | {
1022 | "data": {
1023 | "text/plain": [
1024 | "AnnData object with n_obs × n_vars = 100 × 2000\n",
1025 | " obs: 'time_yr', 'subject_id', 'instrument_type', 'site'"
1026 | ]
1027 | },
1028 | "execution_count": 25,
1029 | "metadata": {},
1030 | "output_type": "execute_result"
1031 | }
1032 | ],
1033 | "source": [
1034 | "adata"
1035 | ]
1036 | },
1037 | {
1038 | "attachments": {},
1039 | "cell_type": "markdown",
1040 | "metadata": {},
1041 | "source": [
1042 | "如果我们想要一个 AnnData 将数据保存在内存中,则必须调用.copy()."
1043 | ]
1044 | },
1045 | {
1046 | "cell_type": "code",
1047 | "execution_count": 26,
1048 | "metadata": {
1049 | "vscode": {
1050 | "languageId": "python"
1051 | }
1052 | },
1053 | "outputs": [
1054 | {
1055 | "data": {
1056 | "text/plain": [
1057 | "AnnData object with n_obs × n_vars = 5 × 2\n",
1058 | " obs: 'time_yr', 'subject_id', 'instrument_type', 'site'"
1059 | ]
1060 | },
1061 | "execution_count": 26,
1062 | "metadata": {},
1063 | "output_type": "execute_result"
1064 | }
1065 | ],
1066 | "source": [
1067 | "adata_subset = adata[:5, [\"Gene_1\", \"Gene_3\"]].copy()\n",
1068 | "adata_subset"
1069 | ]
1070 | },
1071 | {
1072 | "attachments": {},
1073 | "cell_type": "markdown",
1074 | "metadata": {},
1075 | "source": [
1076 | "为什么强调这个概念,是因为当你修改`adata_subset`的时候,原`adata`不会被修改,但当你修改`adata_view`的时候,原`adata`的相应内容也会被修改,这可能会引发一些逻辑错误。\n",
1077 | "\n",
1078 | "对于View,我们还可以设置列的前 3 个元素。我们将前3个细胞的基因1的表达量设置成0"
1079 | ]
1080 | },
1081 | {
1082 | "cell_type": "code",
1083 | "execution_count": 27,
1084 | "metadata": {
1085 | "vscode": {
1086 | "languageId": "python"
1087 | }
1088 | },
1089 | "outputs": [
1090 | {
1091 | "name": "stdout",
1092 | "output_type": "stream",
1093 | "text": [
1094 | "[[3.0], [0.0], [1.0]]\n",
1095 | "[[0.0], [0.0], [0.0]]\n"
1096 | ]
1097 | }
1098 | ],
1099 | "source": [
1100 | "print(adata[:3, \"Gene_1\"].X.toarray().tolist())\n",
1101 | "adata[:3, \"Gene_1\"].X = [0, 0, 0]\n",
1102 | "print(adata[:3, \"Gene_1\"].X.toarray().tolist())"
1103 | ]
1104 | },
1105 | {
1106 | "attachments": {},
1107 | "cell_type": "markdown",
1108 | "metadata": {},
1109 | "source": [
1110 | "虽然前面强调了`adata_view`和`adata_subset`的不同,但实际上,我们如果修改`adata_view`的内容的时候,会自动调用`.copy`使数据存放到内存上"
1111 | ]
1112 | },
1113 | {
1114 | "cell_type": "code",
1115 | "execution_count": 29,
1116 | "metadata": {
1117 | "vscode": {
1118 | "languageId": "python"
1119 | }
1120 | },
1121 | "outputs": [
1122 | {
1123 | "name": "stderr",
1124 | "output_type": "stream",
1125 | "text": [
1126 | "/var/folders/4m/2xw3_2s503s9r616083n7w440000gn/T/ipykernel_2195/3248193034.py:1: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
1127 | " adata_view.obs[\"foo\"] = range(5)\n"
1128 | ]
1129 | }
1130 | ],
1131 | "source": [
1132 | "adata_view.obs[\"foo\"] = range(5)"
1133 | ]
1134 | },
1135 | {
1136 | "attachments": {},
1137 | "cell_type": "markdown",
1138 | "metadata": {},
1139 | "source": [
1140 | "现在adata_view存储实际数据,不再只是对 adata 的引用。"
1141 | ]
1142 | },
1143 | {
1144 | "cell_type": "code",
1145 | "execution_count": 31,
1146 | "metadata": {
1147 | "vscode": {
1148 | "languageId": "python"
1149 | }
1150 | },
1151 | "outputs": [
1152 | {
1153 | "data": {
1154 | "text/plain": [
1155 | "AnnData object with n_obs × n_vars = 5 × 2\n",
1156 | " obs: 'time_yr', 'subject_id', 'instrument_type', 'site', 'foo'"
1157 | ]
1158 | },
1159 | "execution_count": 31,
1160 | "metadata": {},
1161 | "output_type": "execute_result"
1162 | }
1163 | ],
1164 | "source": [
1165 | "adata_view"
1166 | ]
1167 | },
1168 | {
1169 | "attachments": {},
1170 | "cell_type": "markdown",
1171 | "metadata": {},
1172 | "source": [
1173 | "当然,我们也可以使用`pandas`的bool索引来进行AnnData的切片操作"
1174 | ]
1175 | },
1176 | {
1177 | "cell_type": "code",
1178 | "execution_count": 32,
1179 | "metadata": {
1180 | "vscode": {
1181 | "languageId": "python"
1182 | }
1183 | },
1184 | "outputs": [
1185 | {
1186 | "data": {
1187 | "text/html": [
1188 | "\n",
1189 | "\n",
1202 | "
\n",
1203 | " \n",
1204 | " \n",
1205 | " | \n",
1206 | " time_yr | \n",
1207 | " subject_id | \n",
1208 | " instrument_type | \n",
1209 | " site | \n",
1210 | "
\n",
1211 | " \n",
1212 | " \n",
1213 | " \n",
1214 | " Cell_0 | \n",
1215 | " 2 | \n",
1216 | " subject 8 | \n",
1217 | " type a | \n",
1218 | " site x | \n",
1219 | "
\n",
1220 | " \n",
1221 | " Cell_7 | \n",
1222 | " 4 | \n",
1223 | " subject 1 | \n",
1224 | " type b | \n",
1225 | " site y | \n",
1226 | "
\n",
1227 | " \n",
1228 | " Cell_8 | \n",
1229 | " 4 | \n",
1230 | " subject 2 | \n",
1231 | " type a | \n",
1232 | " site x | \n",
1233 | "
\n",
1234 | " \n",
1235 | " Cell_11 | \n",
1236 | " 4 | \n",
1237 | " subject 1 | \n",
1238 | " type a | \n",
1239 | " site x | \n",
1240 | "
\n",
1241 | " \n",
1242 | " Cell_16 | \n",
1243 | " 2 | \n",
1244 | " subject 1 | \n",
1245 | " type b | \n",
1246 | " site x | \n",
1247 | "
\n",
1248 | " \n",
1249 | "
\n",
1250 | "
"
1251 | ],
1252 | "text/plain": [
1253 | " time_yr subject_id instrument_type site\n",
1254 | "Cell_0 2 subject 8 type a site x\n",
1255 | "Cell_7 4 subject 1 type b site y\n",
1256 | "Cell_8 4 subject 2 type a site x\n",
1257 | "Cell_11 4 subject 1 type a site x\n",
1258 | "Cell_16 2 subject 1 type b site x"
1259 | ]
1260 | },
1261 | "execution_count": 32,
1262 | "metadata": {},
1263 | "output_type": "execute_result"
1264 | }
1265 | ],
1266 | "source": [
1267 | "adata[adata.obs.time_yr.isin([2, 4])].obs.head()"
1268 | ]
1269 | },
1270 | {
1271 | "attachments": {},
1272 | "cell_type": "markdown",
1273 | "metadata": {},
1274 | "source": [
1275 | "### 1.10 大文件的读取\n",
1276 | "\n",
1277 | "如果单个h5ad文件非常大,您可以使用backed模式或当前实验性的[read_elem](https://anndata.readthedocs.io/en/latest/generated/anndata.experimental.read_elem.html) API 将其部分读入内存。"
1278 | ]
1279 | },
1280 | {
1281 | "cell_type": "code",
1282 | "execution_count": 33,
1283 | "metadata": {
1284 | "vscode": {
1285 | "languageId": "python"
1286 | }
1287 | },
1288 | "outputs": [],
1289 | "source": [
1290 | "adata = ad.read(\"../../data/my_results.h5ad\", backed=\"r\")"
1291 | ]
1292 | },
1293 | {
1294 | "cell_type": "code",
1295 | "execution_count": 34,
1296 | "metadata": {
1297 | "vscode": {
1298 | "languageId": "python"
1299 | }
1300 | },
1301 | "outputs": [
1302 | {
1303 | "data": {
1304 | "text/plain": [
1305 | "True"
1306 | ]
1307 | },
1308 | "execution_count": 34,
1309 | "metadata": {},
1310 | "output_type": "execute_result"
1311 | }
1312 | ],
1313 | "source": [
1314 | "adata.isbacked"
1315 | ]
1316 | },
1317 | {
1318 | "attachments": {},
1319 | "cell_type": "markdown",
1320 | "metadata": {},
1321 | "source": [
1322 | "如果这样做,则需要记住 AnnData 对象与用于读取的文件有一个连接是为关闭的,这类似于我们打开一个文件的意思。"
1323 | ]
1324 | },
1325 | {
1326 | "cell_type": "code",
1327 | "execution_count": 35,
1328 | "metadata": {
1329 | "vscode": {
1330 | "languageId": "python"
1331 | }
1332 | },
1333 | "outputs": [
1334 | {
1335 | "data": {
1336 | "text/plain": [
1337 | "PosixPath('../../data/my_results.h5ad')"
1338 | ]
1339 | },
1340 | "execution_count": 35,
1341 | "metadata": {},
1342 | "output_type": "execute_result"
1343 | }
1344 | ],
1345 | "source": [
1346 | "adata.filename"
1347 | ]
1348 | },
1349 | {
1350 | "attachments": {},
1351 | "cell_type": "markdown",
1352 | "metadata": {},
1353 | "source": [
1354 | "因为我们在只读模式下使用它,所以我们不能损坏任何东西。要继续学习本教程,我们仍然需要显式地关闭它。"
1355 | ]
1356 | },
1357 | {
1358 | "cell_type": "code",
1359 | "execution_count": 36,
1360 | "metadata": {
1361 | "vscode": {
1362 | "languageId": "python"
1363 | }
1364 | },
1365 | "outputs": [],
1366 | "source": [
1367 | "adata.file.close()"
1368 | ]
1369 | },
1370 | {
1371 | "attachments": {},
1372 | "cell_type": "markdown",
1373 | "metadata": {},
1374 | "source": [
1375 | "## 2. 使用MuData存放多模态的数据\n",
1376 | "\n",
1377 | "AnnData 主要用于存储和操作单模态数据。然而,CITE-Seq 等单细胞多模态计数通过同时测量 RNA 和表面蛋白来生成多模态数据。\n",
1378 | "\n",
1379 | "这些数据需要更先进的存储方式,这就是 MuData 发挥作用的地方。MuData 构建在 AnnData 之上,用于存储和操作多模态数据。\n",
1380 | "\n",
1381 | "\n",
1382 | "
\n",
1383 | "\n",
1384 | "\n"
1385 | ]
1386 | },
1387 | {
1388 | "attachments": {},
1389 | "cell_type": "markdown",
1390 | "metadata": {},
1391 | "source": [
1392 | "### 2.1 安装\n",
1393 | "\n",
1394 | "MuData 可在 PyPI 或 Conda上使用,并且可以使用以下任一方法安装:"
1395 | ]
1396 | },
1397 | {
1398 | "cell_type": "code",
1399 | "execution_count": null,
1400 | "metadata": {
1401 | "vscode": {
1402 | "languageId": "python"
1403 | }
1404 | },
1405 | "outputs": [],
1406 | "source": [
1407 | "%pip install mudata\n",
1408 | "%conda install -c conda-forge mudata"
1409 | ]
1410 | },
1411 | {
1412 | "attachments": {},
1413 | "cell_type": "markdown",
1414 | "metadata": {},
1415 | "source": [
1416 | "MuData 背后的主要思想:MuData 对象包含对单模态数据的单个 AnnData 对象的引用,但 MuData 对象本身也存储多模态注释。因此,可以直接访问 AnnData 对象来执行单模态数据转换,将其结果存储在相应的 AnnData 注释中,而且还可以聚合联合计算的模态,其结果可以存储在全局 MuData 对象中。\n",
1417 | "\n",
1418 | "从技术上讲,这是通过 MuData 对象实现的,该对象包含一个带有 AnnData 对象的字典,每个模态一个,在其.mod(=modality) 属性中。正如 AnnData 对象本身一样,它们也包含诸如.obs或者var(样本或细胞)之类的属性,`.obsm`及`.varm`其多维注释,例如嵌入。"
1419 | ]
1420 | },
1421 | {
1422 | "attachments": {},
1423 | "cell_type": "markdown",
1424 | "metadata": {},
1425 | "source": [
1426 | "### 2.2 初始化MuData对象\n",
1427 | "\n",
1428 | "我们将从 mudata 包中导入 MuData 开始。"
1429 | ]
1430 | },
1431 | {
1432 | "cell_type": "code",
1433 | "execution_count": 37,
1434 | "metadata": {
1435 | "vscode": {
1436 | "languageId": "python"
1437 | }
1438 | },
1439 | "outputs": [
1440 | {
1441 | "data": {
1442 | "text/plain": [
1443 | "(1000, 100)"
1444 | ]
1445 | },
1446 | "execution_count": 37,
1447 | "metadata": {},
1448 | "output_type": "execute_result"
1449 | }
1450 | ],
1451 | "source": [
1452 | "import mudata as md\n",
1453 | "#为了创建MuData,我们需要模拟生成一些数据\n",
1454 | "n, d, k = 1000, 100, 10\n",
1455 | "\n",
1456 | "z = np.random.normal(loc=np.arange(k), scale=np.arange(k) * 2, size=(n, k))\n",
1457 | "w = np.random.normal(size=(d, k))\n",
1458 | "y = np.dot(z, w.T)\n",
1459 | "y.shape"
1460 | ]
1461 | },
1462 | {
1463 | "attachments": {},
1464 | "cell_type": "markdown",
1465 | "metadata": {},
1466 | "source": [
1467 | "要创建 MuData 对象,我们首先需要多个单模态 AnnData 对象。因此,我们创建两个 AnnData 对象,其中包含相同obs值但不同var的数据。"
1468 | ]
1469 | },
1470 | {
1471 | "cell_type": "code",
1472 | "execution_count": 38,
1473 | "metadata": {
1474 | "vscode": {
1475 | "languageId": "python"
1476 | }
1477 | },
1478 | "outputs": [
1479 | {
1480 | "name": "stderr",
1481 | "output_type": "stream",
1482 | "text": [
1483 | "/var/folders/4m/2xw3_2s503s9r616083n7w440000gn/T/ipykernel_2195/3341965559.py:1: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n",
1484 | " adata = ad.AnnData(y)\n"
1485 | ]
1486 | },
1487 | {
1488 | "data": {
1489 | "text/plain": [
1490 | "AnnData object with n_obs × n_vars = 1000 × 100"
1491 | ]
1492 | },
1493 | "execution_count": 38,
1494 | "metadata": {},
1495 | "output_type": "execute_result"
1496 | }
1497 | ],
1498 | "source": [
1499 | "adata = ad.AnnData(y)\n",
1500 | "adata.obs_names = [f\"obs_{i+1}\" for i in range(n)]\n",
1501 | "adata.var_names = [f\"var_{j+1}\" for j in range(d)]\n",
1502 | "adata"
1503 | ]
1504 | },
1505 | {
1506 | "cell_type": "code",
1507 | "execution_count": 39,
1508 | "metadata": {
1509 | "vscode": {
1510 | "languageId": "python"
1511 | }
1512 | },
1513 | "outputs": [
1514 | {
1515 | "name": "stderr",
1516 | "output_type": "stream",
1517 | "text": [
1518 | "/var/folders/4m/2xw3_2s503s9r616083n7w440000gn/T/ipykernel_2195/255684217.py:5: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n",
1519 | " adata2 = ad.AnnData(y2)\n"
1520 | ]
1521 | },
1522 | {
1523 | "data": {
1524 | "text/plain": [
1525 | "AnnData object with n_obs × n_vars = 1000 × 50"
1526 | ]
1527 | },
1528 | "execution_count": 39,
1529 | "metadata": {},
1530 | "output_type": "execute_result"
1531 | }
1532 | ],
1533 | "source": [
1534 | "d2 = 50\n",
1535 | "w2 = np.random.normal(size=(d2, k))\n",
1536 | "y2 = np.dot(z, w2.T)\n",
1537 | "\n",
1538 | "adata2 = ad.AnnData(y2)\n",
1539 | "adata2.obs_names = [f\"obs_{i+1}\" for i in range(n)]\n",
1540 | "adata2.var_names = [f\"var2_{j+1}\" for j in range(d2)]\n",
1541 | "adata2"
1542 | ]
1543 | },
1544 | {
1545 | "attachments": {},
1546 | "cell_type": "markdown",
1547 | "metadata": {},
1548 | "source": [
1549 | "然后可以将这两个 AnnData 对象(两个“模态”)包装到单个 MuData 对象中。在这里,我们将模态一命名为A,模态二命名为B。"
1550 | ]
1551 | },
1552 | {
1553 | "cell_type": "code",
1554 | "execution_count": 40,
1555 | "metadata": {
1556 | "vscode": {
1557 | "languageId": "python"
1558 | }
1559 | },
1560 | "outputs": [
1561 | {
1562 | "data": {
1563 | "text/html": [
1564 | "MuData object with n_obs × n_vars = 1000 × 150\n",
1565 | " 2 modalities\n",
1566 | " A:\t1000 x 100\n",
1567 | " B:\t1000 x 50
"
1568 | ],
1569 | "text/plain": [
1570 | "MuData object with n_obs × n_vars = 1000 × 150\n",
1571 | " 2 modalities\n",
1572 | " A:\t1000 x 100\n",
1573 | " B:\t1000 x 50"
1574 | ]
1575 | },
1576 | "execution_count": 40,
1577 | "metadata": {},
1578 | "output_type": "execute_result"
1579 | }
1580 | ],
1581 | "source": [
1582 | "mdata = md.MuData({\"A\": adata, \"B\": adata2})\n",
1583 | "mdata"
1584 | ]
1585 | },
1586 | {
1587 | "attachments": {},
1588 | "cell_type": "markdown",
1589 | "metadata": {},
1590 | "source": [
1591 | "MuData 对象的obs和var变量是全局的,这意味着.obs_names不同模态中具有相同名称 ( ) 的obs被认为是相同的obs。而var名称 ( .var_names) 是唯一的。在上面的对象描述中:mdata有 1000 个obs和 150 = 100+50 个var。\n",
1592 | "\n",
1593 | "### 2.3 MuData属性\n",
1594 | "\n",
1595 | "MuData 对象由前面描述的 AnnData 对象(如`.obs`或`.var`组成),`.mod`用作单个模态的访问器。\n",
1596 | "\n",
1597 | "`.mod`模态存储在可通过MuData 对象的属性访问的集合中,其中模态名称作为键,AnnData 对象作为值。"
1598 | ]
1599 | },
1600 | {
1601 | "cell_type": "code",
1602 | "execution_count": 41,
1603 | "metadata": {
1604 | "vscode": {
1605 | "languageId": "python"
1606 | }
1607 | },
1608 | "outputs": [
1609 | {
1610 | "data": {
1611 | "text/plain": [
1612 | "['A', 'B']"
1613 | ]
1614 | },
1615 | "execution_count": 41,
1616 | "metadata": {},
1617 | "output_type": "execute_result"
1618 | }
1619 | ],
1620 | "source": [
1621 | "list(mdata.mod.keys())"
1622 | ]
1623 | },
1624 | {
1625 | "attachments": {},
1626 | "cell_type": "markdown",
1627 | "metadata": {},
1628 | "source": [
1629 | "`.mod`可以通过属性或通过 MuData 对象本身作为简写来访问各个模态的名称。"
1630 | ]
1631 | },
1632 | {
1633 | "cell_type": "code",
1634 | "execution_count": 42,
1635 | "metadata": {
1636 | "vscode": {
1637 | "languageId": "python"
1638 | }
1639 | },
1640 | "outputs": [
1641 | {
1642 | "name": "stdout",
1643 | "output_type": "stream",
1644 | "text": [
1645 | "AnnData object with n_obs × n_vars = 1000 × 100\n",
1646 | "AnnData object with n_obs × n_vars = 1000 × 100\n"
1647 | ]
1648 | }
1649 | ],
1650 | "source": [
1651 | "print(mdata.mod[\"A\"])\n",
1652 | "print(mdata[\"A\"])"
1653 | ]
1654 | },
1655 | {
1656 | "attachments": {},
1657 | "cell_type": "markdown",
1658 | "metadata": {},
1659 | "source": [
1660 | "样本(cells)注释可通过`.obs`属性访问,并且默认情况下包括来自各个模式的数据帧的列的副本`.obs`。`.var`也是如此,它包含变量(features)的注释。从各个模态复制的obs列包含模态名称作为其前缀,例如 `rna:n_genes`。对于var列也是如此。但是,如果多种模态中存在具有相同名称的列`.var`(例如 `n_cells`),则这些列将跨模态合并,并且不添加前缀。当模态的 AnnData 对象中的这些槽发生更改时,例如添加新列或过滤掉样本(cells),必须使用`.update()`方法获取更改(见下文)。\n",
1661 | "\n",
1662 | "可以在.obsm属性中访问样本(cells)的多维注释。例如,这可以是在所有模态上的 UMAP 坐标。\n",
1663 | "\n",
1664 | "MuData 对象的形状由两个数字表示,两个数字计算为各个模态的形状之和: 一个表示obs的数量,另一个表示var的数量。"
1665 | ]
1666 | },
1667 | {
1668 | "cell_type": "code",
1669 | "execution_count": 43,
1670 | "metadata": {
1671 | "vscode": {
1672 | "languageId": "python"
1673 | }
1674 | },
1675 | "outputs": [
1676 | {
1677 | "name": "stdout",
1678 | "output_type": "stream",
1679 | "text": [
1680 | "(1000, 150)\n",
1681 | "1000\n",
1682 | "150\n"
1683 | ]
1684 | }
1685 | ],
1686 | "source": [
1687 | "print(mdata.shape)\n",
1688 | "print(mdata.n_obs)\n",
1689 | "print(mdata.n_vars)"
1690 | ]
1691 | },
1692 | {
1693 | "attachments": {},
1694 | "cell_type": "markdown",
1695 | "metadata": {},
1696 | "source": [
1697 | "默认情况下,var始终被计为单一模态,而具有相同名称的obs被计为同一obs,意味着同一个细胞具有跨多种模态的测量,var则代表了不同模态的特征值。"
1698 | ]
1699 | },
1700 | {
1701 | "cell_type": "code",
1702 | "execution_count": 44,
1703 | "metadata": {
1704 | "vscode": {
1705 | "languageId": "python"
1706 | }
1707 | },
1708 | "outputs": [
1709 | {
1710 | "data": {
1711 | "text/plain": [
1712 | "[(1000, 100), (1000, 50)]"
1713 | ]
1714 | },
1715 | "execution_count": 44,
1716 | "metadata": {},
1717 | "output_type": "execute_result"
1718 | }
1719 | ],
1720 | "source": [
1721 | "[adata.shape for adata in mdata.mod.values()]"
1722 | ]
1723 | },
1724 | {
1725 | "attachments": {},
1726 | "cell_type": "markdown",
1727 | "metadata": {},
1728 | "source": [
1729 | "如果模态内的形状发生变化,比如adata2有相应的修改,则我们必须运行`MuData.Update()`将更新内容存放到`MuData`对象中"
1730 | ]
1731 | },
1732 | {
1733 | "cell_type": "code",
1734 | "execution_count": 45,
1735 | "metadata": {
1736 | "vscode": {
1737 | "languageId": "python"
1738 | }
1739 | },
1740 | "outputs": [],
1741 | "source": [
1742 | "adata2.var_names = [\"var_ad2_\" + e.split(\"_\")[1] for e in adata2.var_names]"
1743 | ]
1744 | },
1745 | {
1746 | "cell_type": "code",
1747 | "execution_count": 46,
1748 | "metadata": {
1749 | "vscode": {
1750 | "languageId": "python"
1751 | }
1752 | },
1753 | "outputs": [
1754 | {
1755 | "name": "stdout",
1756 | "output_type": "stream",
1757 | "text": [
1758 | "Outdated variables names: ..., var2_48, var2_49, var2_50\n",
1759 | "Updated variables names: ..., var_ad2_48, var_ad2_49, var_ad2_50\n"
1760 | ]
1761 | }
1762 | ],
1763 | "source": [
1764 | "print(f\"Outdated variables names: ...,\", \", \".join(mdata.var_names[-3:]))\n",
1765 | "mdata.update()\n",
1766 | "print(f\"Updated variables names: ...,\", \", \".join(mdata.var_names[-3:]))"
1767 | ]
1768 | },
1769 | {
1770 | "attachments": {},
1771 | "cell_type": "markdown",
1772 | "metadata": {},
1773 | "source": [
1774 | "这意味着,我们的MuData中存放的是对原始对象的引用,我们在更改原始对象的非结构化特征时,不用通过`update`一样能作用于最后的MuData"
1775 | ]
1776 | },
1777 | {
1778 | "cell_type": "code",
1779 | "execution_count": 47,
1780 | "metadata": {
1781 | "vscode": {
1782 | "languageId": "python"
1783 | }
1784 | },
1785 | "outputs": [],
1786 | "source": [
1787 | "# Add some unstructured data to the original object\n",
1788 | "adata.uns[\"misc\"] = {\"adata\": True}"
1789 | ]
1790 | },
1791 | {
1792 | "cell_type": "code",
1793 | "execution_count": 48,
1794 | "metadata": {
1795 | "vscode": {
1796 | "languageId": "python"
1797 | }
1798 | },
1799 | "outputs": [
1800 | {
1801 | "data": {
1802 | "text/plain": [
1803 | "{'adata': True}"
1804 | ]
1805 | },
1806 | "execution_count": 48,
1807 | "metadata": {},
1808 | "output_type": "execute_result"
1809 | }
1810 | ],
1811 | "source": [
1812 | "# Access modality A via the .mod attribute\n",
1813 | "mdata.mod[\"A\"].uns[\"misc\"]"
1814 | ]
1815 | },
1816 | {
1817 | "attachments": {},
1818 | "cell_type": "markdown",
1819 | "metadata": {},
1820 | "source": [
1821 | "### 2.4 映射关系\n",
1822 | "\n",
1823 | "我们在创建MuData对象时,会同时创建模态的映射关系,映射关系由bool进行存储。比如,我们模态A和模态B中所有的obs相同,那么在MuData对象中,obsm也相同"
1824 | ]
1825 | },
1826 | {
1827 | "cell_type": "code",
1828 | "execution_count": 50,
1829 | "metadata": {
1830 | "vscode": {
1831 | "languageId": "python"
1832 | }
1833 | },
1834 | "outputs": [
1835 | {
1836 | "data": {
1837 | "text/plain": [
1838 | "True"
1839 | ]
1840 | },
1841 | "execution_count": 50,
1842 | "metadata": {},
1843 | "output_type": "execute_result"
1844 | }
1845 | ],
1846 | "source": [
1847 | "np.sum(mdata.obsm[\"A\"]) == np.sum(mdata.obsm[\"B\"]) == n"
1848 | ]
1849 | },
1850 | {
1851 | "attachments": {},
1852 | "cell_type": "markdown",
1853 | "metadata": {},
1854 | "source": [
1855 | "然而,对于var来说,它们是 150 长的向量。模态A有 100 个 True 值,后跟 50 个 False 值,模态B则是前面有100个False值,后面有50个True值"
1856 | ]
1857 | },
1858 | {
1859 | "cell_type": "code",
1860 | "execution_count": 51,
1861 | "metadata": {
1862 | "vscode": {
1863 | "languageId": "python"
1864 | }
1865 | },
1866 | "outputs": [
1867 | {
1868 | "data": {
1869 | "text/plain": [
1870 | "array([ True, True, True, True, True, True, True, True, True,\n",
1871 | " True, True, True, True, True, True, True, True, True,\n",
1872 | " True, True, True, True, True, True, True, True, True,\n",
1873 | " True, True, True, True, True, True, True, True, True,\n",
1874 | " True, True, True, True, True, True, True, True, True,\n",
1875 | " True, True, True, True, True, True, True, True, True,\n",
1876 | " True, True, True, True, True, True, True, True, True,\n",
1877 | " True, True, True, True, True, True, True, True, True,\n",
1878 | " True, True, True, True, True, True, True, True, True,\n",
1879 | " True, True, True, True, True, True, True, True, True,\n",
1880 | " True, True, True, True, True, True, True, True, True,\n",
1881 | " True, False, False, False, False, False, False, False, False,\n",
1882 | " False, False, False, False, False, False, False, False, False,\n",
1883 | " False, False, False, False, False, False, False, False, False,\n",
1884 | " False, False, False, False, False, False, False, False, False,\n",
1885 | " False, False, False, False, False, False, False, False, False,\n",
1886 | " False, False, False, False, False, False])"
1887 | ]
1888 | },
1889 | "execution_count": 51,
1890 | "metadata": {},
1891 | "output_type": "execute_result"
1892 | }
1893 | ],
1894 | "source": [
1895 | "mdata.varm[\"A\"]"
1896 | ]
1897 | },
1898 | {
1899 | "attachments": {},
1900 | "cell_type": "markdown",
1901 | "metadata": {},
1902 | "source": [
1903 | "### 2.5 视图\n",
1904 | "\n",
1905 | "与 AnnData 对象的行为类似,对 MuData 对象进行切片会返回原始数据的视图。"
1906 | ]
1907 | },
1908 | {
1909 | "cell_type": "code",
1910 | "execution_count": 52,
1911 | "metadata": {
1912 | "vscode": {
1913 | "languageId": "python"
1914 | }
1915 | },
1916 | "outputs": [
1917 | {
1918 | "name": "stdout",
1919 | "output_type": "stream",
1920 | "text": [
1921 | "True\n",
1922 | "True\n"
1923 | ]
1924 | }
1925 | ],
1926 | "source": [
1927 | "view = mdata[:100, :1000]\n",
1928 | "print(view.is_view)\n",
1929 | "print(view[\"A\"].is_view)"
1930 | ]
1931 | },
1932 | {
1933 | "attachments": {},
1934 | "cell_type": "markdown",
1935 | "metadata": {},
1936 | "source": [
1937 | "对 MuData 对象进行子集化很特殊,因为它可以跨模态对它们进行切片。obs_names和var_names对每个模态执行一组和/或的切片操作,而不仅仅是针对全局多模态注释。此行为使工作流节省内存,这在处理大型数据集时尤其重要。但是,如果要修改对象,则应创建它的副本,该副本不再是视图并且不依赖于原始对象。"
1938 | ]
1939 | },
1940 | {
1941 | "cell_type": "code",
1942 | "execution_count": 53,
1943 | "metadata": {
1944 | "vscode": {
1945 | "languageId": "python"
1946 | }
1947 | },
1948 | "outputs": [
1949 | {
1950 | "data": {
1951 | "text/plain": [
1952 | "False"
1953 | ]
1954 | },
1955 | "execution_count": 53,
1956 | "metadata": {},
1957 | "output_type": "execute_result"
1958 | }
1959 | ],
1960 | "source": [
1961 | "mdata_sub = view.copy()\n",
1962 | "mdata_sub.is_view"
1963 | ]
1964 | },
1965 | {
1966 | "attachments": {},
1967 | "cell_type": "markdown",
1968 | "metadata": {},
1969 | "source": [
1970 | "### 2.6 MuData对象的读写\n",
1971 | "\n",
1972 | "与 AnnData 对象类似,MuData 对象被设计为序列化为基于 HDF5 的.h5mu文件。所有模态都以其各自的名称存储/mod在. 每个单独的模态,例如,以与存储在文件中相同的方式存储。MuData 对象可以按如下方式读写:.h5mu file/mod/A.h5ad"
1973 | ]
1974 | },
1975 | {
1976 | "cell_type": "code",
1977 | "execution_count": 54,
1978 | "metadata": {
1979 | "vscode": {
1980 | "languageId": "python"
1981 | }
1982 | },
1983 | "outputs": [
1984 | {
1985 | "data": {
1986 | "text/html": [
1987 | "MuData object with n_obs × n_vars = 1000 × 150 backed at '../../data/my_mudata.h5mu'\n",
1988 | " 2 modalities\n",
1989 | " A:\t1000 x 100\n",
1990 | " uns:\t'misc'\n",
1991 | " B:\t1000 x 50
"
1992 | ],
1993 | "text/plain": [
1994 | "MuData object with n_obs × n_vars = 1000 × 150 backed at '../../data/my_mudata.h5mu'\n",
1995 | " 2 modalities\n",
1996 | " A:\t1000 x 100\n",
1997 | " uns:\t'misc'\n",
1998 | " B:\t1000 x 50"
1999 | ]
2000 | },
2001 | "execution_count": 54,
2002 | "metadata": {},
2003 | "output_type": "execute_result"
2004 | }
2005 | ],
2006 | "source": [
2007 | "mdata.write(\"../../data/my_mudata.h5mu\")\n",
2008 | "mdata_r = md.read(\"../../data/my_mudata.h5mu\", backed=True)\n",
2009 | "mdata_r"
2010 | ]
2011 | },
2012 | {
2013 | "attachments": {},
2014 | "cell_type": "markdown",
2015 | "metadata": {},
2016 | "source": [
2017 | "MuData对象内的单模态也支持流式读取"
2018 | ]
2019 | },
2020 | {
2021 | "cell_type": "code",
2022 | "execution_count": 55,
2023 | "metadata": {
2024 | "vscode": {
2025 | "languageId": "python"
2026 | }
2027 | },
2028 | "outputs": [
2029 | {
2030 | "data": {
2031 | "text/plain": [
2032 | "True"
2033 | ]
2034 | },
2035 | "execution_count": 55,
2036 | "metadata": {},
2037 | "output_type": "execute_result"
2038 | }
2039 | ],
2040 | "source": [
2041 | "mdata_r[\"A\"].isbacked"
2042 | ]
2043 | },
2044 | {
2045 | "attachments": {},
2046 | "cell_type": "markdown",
2047 | "metadata": {},
2048 | "source": [
2049 | "如果备份了原始对象,则必须将文件名提供给`.Copy()`调用,并且生成的对象将在一个新的位置进行备份。"
2050 | ]
2051 | },
2052 | {
2053 | "cell_type": "code",
2054 | "execution_count": 56,
2055 | "metadata": {
2056 | "vscode": {
2057 | "languageId": "python"
2058 | }
2059 | },
2060 | "outputs": [
2061 | {
2062 | "name": "stdout",
2063 | "output_type": "stream",
2064 | "text": [
2065 | "False\n",
2066 | "True\n"
2067 | ]
2068 | }
2069 | ],
2070 | "source": [
2071 | "mdata_sub = mdata_r.copy(\"mdata_sub.h5mu\")\n",
2072 | "print(mdata_sub.is_view)\n",
2073 | "print(mdata_sub.isbacked)"
2074 | ]
2075 | },
2076 | {
2077 | "cell_type": "code",
2078 | "execution_count": null,
2079 | "metadata": {
2080 | "vscode": {
2081 | "languageId": "python"
2082 | }
2083 | },
2084 | "outputs": [],
2085 | "source": []
2086 | }
2087 | ],
2088 | "metadata": {
2089 | "kernelspec": {
2090 | "display_name": "Python 3 (ipykernel)",
2091 | "language": "python",
2092 | "name": "python3"
2093 | },
2094 | "orig_nbformat": 4
2095 | },
2096 | "nbformat": 4,
2097 | "nbformat_minor": 2
2098 | }
2099 |
--------------------------------------------------------------------------------
/docs/Introduction/mdata_sub.h5mu:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/Introduction/mdata_sub.h5mu
--------------------------------------------------------------------------------
/docs/error.md:
--------------------------------------------------------------------------------
1 |
2 | # 错误提示
3 |
4 | ## 本章节的内容还在完善中
--------------------------------------------------------------------------------
/docs/index.md:
--------------------------------------------------------------------------------
1 | # 单细胞测序-最佳的分析Pipeline
2 |
3 | - 作者:starlitnightly
4 | - 日期:2023.07.14
5 |
6 | !!! note 楔子
7 | 从事单细胞分析也有一段时间了,国内大部分中文教程都是使用R语言进行分析,使用Python的还比较少,或者是直译scanpy的教程,不过scanpy可能已经比较旧了。在这里,我们参考了[Single cell best practice](https://www.sc-best-practices.org/preamble.html),希望能给国内的从业者带来一个完善的教程指引以及分析。
8 |
9 |
10 | ## 简介
11 |
12 | 人体是一个复杂的机器,严重依赖于生命的基本单位——细胞。细胞可以分为不同类型,在发育过程中甚至会发生转变,在疾病或再生时也会如此。这种细胞的异质性在形态、功能和基因表达谱上都有所体现。强烈的干扰会导致细胞类型的紊乱,从而影响整个系统,甚至引发像癌症这样严重的疾病[Macaulay等人,2017]。因此,了解细胞在正常状态和干扰下的行为对于改善我们对整个细胞系统的理解至关重要。
13 |
14 | 这项庞大的任务可以通过不同的方式来解决,其中最有前途的方法是在个体水平上对细胞进行分析。到目前为止,每个细胞的转录组主要是通过一种称为单细胞RNA测序的过程来检测的。随着单细胞基因组学的最新进展,现在可以将转录组信息与空间、染色质可及性或蛋白质信息结合起来。这些进展不仅可以揭示复杂的调控机制,而且还增加了数据分析师的复杂性。
15 |
16 | 如今,数据分析师面临着一个庞大的分析工具领域,其中包含1000多种计算单细胞分析方法。在这个广泛的工具范围中导航以生成科学前沿的可靠结果变得越来越具有挑战性。
17 |
18 | ## 本书内容概述
19 |
20 | 本书的目标是教新手和专业人士单细胞测序分析的最佳实践,在Python中。本书将教您从预处理到可视化、统计评估等一系列常见的分析步骤,以及更深入的内容。通读本书将使您能够独立分析单模态和多模态单细胞测序数据。本书中的指南和建议不仅旨在教授您如何进行单细胞分析,而且着重于如何正确进行分析。我们的建议尽可能地基于外部基准和评价。最后,我们将本书视为单细胞数据分析师的一份实用资源,可以在推荐发生变化时轻松更新。
21 |
22 | ## 本书内容不涉及
23 |
24 | 本书不涵盖生物学或计算机科学的基础知识,包括编程。此外,本书也不是为特定任务设计的所有分析工具的完整集合。我们特别强调那些经过外部验证的工具,这些工具在处理手头的数据时效果最佳,或者是经过社区验证的最佳实践方法。如果不可能进行外部验证,我们只会基于自己广泛的经验推荐工作流程。
25 |
26 | ## 本书的结构
27 |
28 | 本书的每一章对应于典型单细胞数据分析项目的不同阶段。通常,分析工作流程会按照章节的顺序进行,但在下游分析目标方面可能存在一定的灵活性。我们的每一章都包含了大量的参考文献,我们鼓励读者查阅我们陈述观点的原始来源。尽管我们在可能的情况下试图提供所需的背景知识,但我们的总结并不能始终捕捉到我们推荐的全部理由。
29 |
30 | ## 学习前准备
31 |
32 | 生物信息学对于新手来说是一个具有挑战性的研究领域,因为它需要对生物学和计算机科学都有一定的了解。而单细胞分析则更加具有要求,因为它结合了许多子领域,而且数据集通常较大。本书无法涵盖计算单细胞分析的所有先决条件,因此我们建议您在下面对各种主题进行粗略的概述。以下链接可能会提升您在本书中的学习体验:
33 |
34 | 基本的Python编程。您应该熟悉控制流程(循环、条件语句等)、基本数据结构(列表、字典、集合)以及最常用库(如Pandas和Numpy)的核心功能。如果您对编程和Python还不熟悉,我们强烈推荐北京理工大学的嵩天老师的Python相关的mooc,包括[Python基础学习](https://www.icourse163.org/course/BIT-268001)与[Python数据处理与可视化](https://www.icourse163.org/course/BIT-1001870002)两节。
35 |
36 | 了解AnnData和scanpy包的基础知识会有益,但不是绝对必需的。本书对AnnData的介绍足以让您跟上,并介绍了使用scanpy的工作流程。然而,我们无法在本书的过程中介绍scanpy的所有功能。如果您对scanpy还不熟悉,我们强烈建议您通过学习[scanpy教程](https://scanpy.readthedocs.io/en/stable/tutorials.html),并偶尔查看[scanpy的API](https://scanpy.readthedocs.io/en/stable/api.html)参考来学习。
37 |
38 | 如果您对多模态数据分析感兴趣,建议了解muon和MuData的基础知识。本书对MuData进行了更详细的介绍,但只是简要介绍了muon,类似于AnnData和scanpy。[muon教程](https://muon-tutorials.readthedocs.io/en/latest/)是学习使用muon进行多模态数据分析的很好入门资料。
39 |
40 | 生物学基础知识。虽然我们大致介绍了数据的产生过程,但我们不会涵盖DNA、RNA和蛋白质的基础知识。如果您对分子生物学完全不熟悉,建议阅读Bruce Alberts等人的《细胞分子生物学》(Molecular Biology of the Cell)。
41 |
42 | ## License
43 |
44 |
45 |
46 | 本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。在此再次感谢Single-cell best practices对单细胞教程的贡献,本书将基于Single-cell best practices结合作者自身的分析经验来完成。
--------------------------------------------------------------------------------
/docs/others/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/.DS_Store
--------------------------------------------------------------------------------
/docs/others/7-1.md:
--------------------------------------------------------------------------------
1 | # 7-1 单细胞数据公开至NCBI数据库
2 |
3 | > 作者按
4 | >
5 | > 国内对于单细胞测序相关的中文教程确实不够全面,当然NCBI官网给的上传教程也比较详细了,所以变成了**会者不难**。在这里,我们将演示如何将测序文件完整上传到NCBI上。本教程首发于[单细胞最好的中文教程](https://single-cell-tutorial.readthedocs.io/zh/latest/),未经授权许可,禁止转载。
6 | >
7 | > **全文字数|预计阅读时间**: 3500|5min
8 | >
9 | > ——Starlitnightly
10 |
11 |
12 |
13 | ## 1. 注册NCBI账户
14 |
15 | 我们首先打开SRA的上传官网:https://submit.ncbi.nlm.nih.gov/subs/sra/,注册一个账户
16 |
17 | > **注意**
18 | >
19 | > 不能使用163,qq邮箱之类的,可能会收不到邮件
20 |
21 | ## 2. 新建SRA提交
22 |
23 | ### 2.1 选择Biosample
24 |
25 | 我们首先需要准备biosample,这个是需要ncbi官网审核的,每一个biosample代表了一个真实的样本。
26 |
27 | 
28 |
29 | ## 2.2 填写信息
30 |
31 | 一般来说,我们可以指定一个数据的释放日期,我估计一年的时间论文应该能发出去,所以设置到了明年同一时间。
32 |
33 | 
34 |
35 | 下拉选择人类资源Human
36 |
37 | 
38 |
39 | 
40 |
41 | 在这里,我们选择文件上传的方式进行填写
42 |
43 | 
44 |
45 | 文件大概长这个样,绿色是必填项,黄色是可选项,你也可以添加自己的属性。不过需要注意的是,对于不同的sample_name,后面的属性不能完全一致,即使是同一种病,建议把病人来源标上,这样就能不重复了。
46 |
47 | 
48 |
49 | 我们提交后出现这个页面,等待管理员审核即可。
50 |
51 | 
52 |
53 | ## 3. 上传fastq
54 |
55 | ### 3.1 biosample获取
56 |
57 | 经过三天的等待,我们的fastq的上传审核终于完成了
58 |
59 | 
60 |
61 | 我们点击manage data发现多了30个biosample可以进行上传,这里的Accession就是我们的biosample的id,每个样本唯一。
62 |
63 | 
64 |
65 | ### 3.2 上传SRA数据
66 |
67 | 我们点击上一个页面的Sequence Read Archive,准备上传fastq
68 |
69 | 
70 |
71 | 由于我的数据在服务器上,所以我们选择命令行上传
72 |
73 | 
74 |
75 | 点击Request preload folder,ncbi提供了详细的上传教程
76 |
77 | 
78 |
79 | ### 3.3 ascp环境安装
80 |
81 | 直接使用conda安装ascp,避免各种无意义的不兼容与报错。因为ascp的安装需要特定的glibc版本。
82 |
83 | ```bash
84 | conda create -n ascp python=3.8
85 | conda activate ascp
86 | conda install -c hcc aspera-cli -y
87 | ascp -h #检查安装
88 | ```
89 |
90 | 准备好ascp环境后,我们选择开始构建SRA项目
91 |
92 | 
93 |
94 | ### 3.4 SRA上传内容填写
95 |
96 | 构建的过程比较简单,我们需要输入每一列所对应的信息即可,这里提供一个例子
97 |
98 | | biosample_accession | library_ID | title | [**library_strategy**](#'Library and Platform terms'!A2) | [**library_source**](#'Library and Platform terms'!A27) | [**library_selection**](#'Library and Platform terms'!A36) | library_layout | [**platform**](#'Library and Platform terms'!A66) | instrument_model | design_description | filetype | filename | filename2 | filename3 | filename4 | filename5 | filename6 | filename7 | filename8 | assembly | fasta_file | |
99 | | ------------------- | ---------------- | -------------------------------- | -------------------------------------------------------- | ------------------------------------------------------- | ---------------------------------------------------------- | -------------- | ------------------------------------------------- | --------------------- | ------------------ | -------- | ----------------------------- | ----------------------------- | ----------------------------- | ----------------------------- | ----------------------------- | ----------------------------- | ----------------------------- | ----------------------------- | -------- | ---------- | ---- |
100 | |SAMN36786***| Name | RNA-seq of Homo sapiens: Decidua | RNA-Seq | TRANSCRIPTOMIC SINGLE CELL | RANDOM | paired | ILLUMINA | Illumina NovaSeq 6000 | Single cell 10X | fastq | ***_1_1_R1.fq.gz | ***_1_1_R2.fq.gz | ***_2_1_R1.fq.gz | ***_2_1_R2.fq.gz | ***_3_1_R1.fq.gz | ***_3_1_R2.fq.gz | ***_4_1_R1.fq.gz | ***_4_1_R2.fq.gz | | | |
101 |
102 | 其中,你下载的excel里面有很多是可以选的,所以我加粗的才是需要自己填写而不法选择的。
103 |
104 | - **biosample_accession**: 生物样本访问号
105 | - **library_ID**: 库ID,这个是独一无二的,需要自己想好用什么ID存在SRA数据库
106 | - **title**: 数据标题,这个标题就任意,但是一般规则是RNA-seq of 物种:组织
107 | - library_strategy: 库构建策略,这里一般是RNA-seq
108 | - library_source: 数据来源库,这里一般是TRANSCRIPTOMIC SINGLE CELL
109 | - library_selection: 库选择方式,这里一般是RANDOM
110 | - library_layout: 库布局,我这里是双端测序,所以写paired
111 | - platform: 测序平台,我从测序公司给的文件里找到了Illumina
112 | - instrument_model: 仪器模型,我从测序公司给的文件里找到了Illumina NovaSeq 6000
113 | - **design_description**: 设计描述,我们是单细胞测序所以是Single cell 10X
114 | - filetype: 文件类型,我们上传的文件类型,这里一般都是fastq
115 | - **filename**: 文件名(第1个文件)一般一个单细胞文库最大是8个文件,双端测序的话最小是2个文件,大部分都是两个文件,取决于测序公司给你的结果。
116 | - **filename2**: 文件名(第2个文件)
117 | - **filename3**: 文件名(第3个文件)
118 | - **filename4**: 文件名(第4个文件)
119 | - **filename5**: 文件名(第5个文件)
120 | - **filename6**: 文件名(第6个文件)
121 | - **filename7**: 文件名(第7个文件)
122 | - **filename8**: 文件名(第8个文件)
123 |
124 | ### 3.5 上传本地fastq
125 |
126 | 填好这个表后,我们点击continue,进入文件上传页面,跨洲际上传,这里只介绍ascp,以下命令是从官网复制的,**切记不可照抄**
127 |
128 | ```shell
129 | ascp -i /mnt/home/zehuazeng/ncbi/aspera.openssh -QT -l100m -k1 -d /mnt/home/zehuazeng/media/207D/origData subasp@upload.ncbi.nlm.nih.gov:uploads/starlitnightly_163.com_***
130 | ```
131 |
132 | 这里需要修改的只有-i和-d两个参数,注意是`绝对路径`
133 |
134 | - **-i**: openssh文件路径,这个是在上传页面有一个超链接可以下载
135 | - **-d**: 需要上传的文件夹,里面只需要包含你要传的fq文件即可,文件名与前面填的filename1-8一致。
136 |
137 | 由于网络波动,所以我写了一个.sh文件,检测到上传失败可以自动重传,我们命名该文件为`upload.sh`
138 |
139 | ```shell
140 | #!/bin/bash
141 |
142 | while true; do
143 | ascp -i /mnt/home/zehuazeng/ncbi/aspera.openssh -QT -l100m -k1 -d /mnt/home/zehuazeng/media/207O/origData8 subasp@upload.ncbi.nlm.nih.gov:uploads/starlitnightly_163.com_UFgOBYes
144 |
145 | if [ $? -eq 0 ]; then
146 | echo "Command completed successfully."
147 | break
148 | else
149 | echo "Command failed. Retrying in 5 seconds..."
150 | sleep 5
151 | fi
152 | done
153 | ```
154 |
155 | 然后在终端输入`./upload.sh`即可运行,注意文件需要先修改权限`chmod 777 ./upload.sh`。
156 |
157 | 
158 |
159 | 我们发现自动上传便开始了,并且会自动检测上传成功与失败。我们等全部文件上传好后,回到刚才的SUB页面
160 |
161 | 
162 |
163 | 我们选择`Select preload folder`,这里origData7是我们已经传好的文件夹,而origData8是正在上传的文件夹
164 |
165 | 
166 |
167 | 选择完了我们点击Continue,就会跳到最后一个页面,一般来说文件会自动帮你选出来,根据你前面填写的filename1-8
168 |
169 | 
170 |
171 | 我们点击Submit就好了,会跳转到最开始的页面,提示我们正在处理中。
172 |
173 | 
174 |
175 | ### 3.6 其他
176 |
177 | 我在上传的时候遇到了一次失败,报错提示是
178 |
179 | ```
180 | 2023-08-06T22:03:02 sra_subprocess error: Finished: /panfs/traces01.be-md.ncbi.nlm.nih.gov/trace_software/pipeline/sra_prod/transform_tools/sharq_load.py --config /panfs/traces01.be-md.ncbi.nlm.nih.gov/trace_software/pipeline/sra_prod/config.sra.public /panfs/traces01.be-md.ncbi.nlm.nih.gov/trace_software/pipeline/sra_prod/transform_tools/sharq --platform=ILLUMINA --log-level=info --output=/export/home/SSD/production_sra_public/sge1240.212826.trace.run_load.sh/SRR25541***.run_load/SRR25541***.output ***_3_1_R2.fq.gz ***_4_1_R2.fq.gz ***_2_1_R2.fq.gz ***_2_1_R1.fq.gz ***_1_1_R1.fq.gz ***_1_1_R2.fq.gz ***_3_1_R1.fq.gz ***_4_1_R1.fq.gz; pid=218120, rc=243
181 | ```
182 |
183 | 这种错误不是我能解决的,于是我写了封邮件发送到`sra@ncbi.nlm.nih.gov`,然后过了两天再看邮箱,工作人员后台已经帮我弄好了。
184 |
185 | 
186 |
187 | ## 4. GEO上传
188 |
189 | 我们将原始数据成功上传到SRA数据库后,我们还需要上传处理后的数据到GEO数据库上。GEO数据库与SRA的上传有一些类似,但也有所区别。可能是由于处理后的文件通常不会太大,所以上传可以用ftp。
190 |
191 | ### 4.1 新建GEO提交
192 |
193 | 我们点击New Submission新建一个提交
194 |
195 | 
196 |
197 | 我们选择high-throughout sequencing来完成scRNA-seq数据的上传,点开后发现,我们需要先下载一个meta文件进行信息的填写。
198 |
199 | 
200 |
201 | 
202 |
203 | 我们选择第二个,因为我们已经把测序文件上传到了SRA数据库,这样可以避免重复上传原始数据。
204 |
205 | 
206 |
207 | 我们照着EXAMPLE的格式填写即可。
208 |
209 | ### 4.2 上传处理文件与meta
210 |
211 | 
212 |
213 | 然后我们点击Create personalized upload space创建自己的ftp空间
214 |
215 | 
216 |
217 | 等待一会儿便会加载出Step 2,我们点击左边的箭头展开,会发现里面提供了GEO数据库的服务器信息。
218 |
219 | - Host address:
220 | - username
221 | - password
222 |
223 | 我们根据这三个信息可以连接到远程的GEO服务器上,但需要注意的是,我们连接的远程目录不能是默认的,而是`uploads/starlitnightly_***`这个在图中我用红色区域圈了起来
224 |
225 | 
226 |
227 | 我是mac系统,所以我用transmit来做ftp文件传送,我们将刚才的Hostname填到地址,username填到用户名,password填到密码,同时设定远程路径为刚刚提到的`uploads/starlitnightly_***`
228 |
229 | 
230 |
231 | 连接后我们会发现里面是空的,我这里需要传过滤后的h5文件以及velocyto生成的剪接/未剪接矩阵,所以我新建了两个文件夹,一个叫h5_file,一个叫loom_file,同时我在meta里面已经填写好了每一个样本对应的文件名
232 |
233 | 
234 |
235 | 
236 |
237 | 
238 |
239 | 我们直接拖动文件进入对应文件夹即可上传。同时需要注意的是,meta文件也需要一并上传
240 |
241 | 
242 |
243 | ### 4.3 提交GEO申请
244 |
245 | 我们在传好文件后,回到GEO申请界面,点击`Notify GEO`,我们填写好目录和文件描述,点击submit即可。
246 |
247 | 
248 |
249 | 
250 |
251 | 
252 |
253 | ### 4.4 确认GEO
254 |
255 | 大概一天左右(工作日),你就会收到一封来自GEO工作人员的审核邮件,如果有问题会在邮件里说明,我这里没有问题就直接回道GEO里查看我的提交,发现确实多了一个记录。
256 |
257 | 
258 |
259 | 
260 |
261 | 我们点进去后,可以点击Update更新一些相关信息。
262 |
263 | 
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230802030612620.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230802030612620.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230802030645455.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230802030645455.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230802030748770.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230802030748770.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230802030802855.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230802030802855.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230802030827209.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230802030827209.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230802030858128.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230802030858128.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230802031118413.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230802031118413.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230804130900926.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230804130900926.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230804130953860.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230804130953860.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230804131237199.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230804131237199.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230804131322335.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230804131322335.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230804131410657.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230804131410657.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230804131629730.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230804131629730.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230804131657927.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230804131657927.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230804145936324.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230804145936324.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230814184244061.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230814184244061.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230814184344231.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230814184344231.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230814184440738.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230814184440738.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230814184605700.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230814184605700.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230814184643354.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230814184643354.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230814200428123.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230814200428123.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230816012830474.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230816012830474.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230816012842851.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230816012842851.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230816012920228.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230816012920228.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230816013029023.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230816013029023.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230816013300087.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230816013300087.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230816013351757.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230816013351757.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230816013430633.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230816013430633.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230816015247633.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230816015247633.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230816015518442.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230816015518442.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230816015550553.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230816015550553.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230816130240947.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230816130240947.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230816130413781.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230816130413781.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230816131154766.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230816131154766.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230816131708566.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230816131708566.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230816131944726.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230816131944726.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230816132034270.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230816132034270.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230816132100337.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230816132100337.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230816132200585.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230816132200585.png
--------------------------------------------------------------------------------
/docs/others/ncbi_submitted.assets/image-20230816132431751.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Starlitnightly/single_cell_tutorial/94a874febdf46e3395ae080fc05ef0f287352e59/docs/others/ncbi_submitted.assets/image-20230816132431751.png
--------------------------------------------------------------------------------
/docs/overrides/main.html:
--------------------------------------------------------------------------------
1 | {% extends "base.html" %}
2 |
3 | {% block content %}
4 | {{ super() }}
5 |
6 | {% if git_page_authors %}
7 |
8 |
9 | Authors: {{ git_page_authors | default('enable mkdocs-git-authors-plugin') }}
10 |
11 |
12 | {% endif %}
13 | {% endblock %}
--------------------------------------------------------------------------------
/docs/ref/ref.bib:
--------------------------------------------------------------------------------
1 | @article{Aird2011,
2 | title = {Analyzing and Minimizing {{PCR}} Amplification Bias in {{Illumina}} Sequencing Libraries},
3 | author = {Aird, Daniel and Ross, Michael G. and Chen, Wei-Sheng and Danielsson, Maxwell and Fennell, Timothy and Russ, Carsten and Jaffe, David B. and Nusbaum, Chad and Gnirke, Andreas},
4 | year = {2011},
5 | month = feb,
6 | journal = {Genome Biology},
7 | volume = {12},
8 | number = {2},
9 | pages = {R18},
10 | issn = {1474-760X},
11 | doi = {10.1186/gb-2011-12-2-r18}
12 | }
13 |
14 | @article{Alvarez2020,
15 | title = {Enhancing Droplet-Based Single-Nucleus term`{{RNA}}`-Seq Resolution Using the Semi-Supervised Machine Learning Classifier {{DIEM}}},
16 | author = {Alvarez, Marcus and Rahmani, Elior and Jew, Brandon and Garske, Kristina M. and Miao, Zong and Benhammou, Jihane N. and Ye, Chun Jimmie and Pisegna, Joseph R. and Pietil{\"a}inen, Kirsi H. and Halperin, Eran and Pajukanta, P{\"a}ivi},
17 | year = {2020},
18 | month = jul,
19 | journal = {Scientific Reports},
20 | volume = {10},
21 | number = {1},
22 | publisher = {{Springer Science and Business Media LLC}},
23 | doi = {10.1038/s41598-020-67513-5}
24 | }
25 |
26 | @article{Archer2016,
27 | title = {Modeling Enzyme Processivity Reveals That {{RNA-Seq}} Libraries Are Biased in Characteristic and Correctable Ways},
28 | author = {Archer, Nathan and Walsh, Mark D. and Shahrezaei, Vahid and Hebenstreit, Daniel},
29 | year = {2016},
30 | journal = {Cell Systems},
31 | volume = {3},
32 | number = {5},
33 | pages = {467-479.e12},
34 | issn = {2405-4712},
35 | doi = {10.1016/j.cels.2016.10.012},
36 | keywords = {Bayesian framework,bias,coverage,enzyme,Markov Chain Monte Carlo,mathematical modeling,polymerase,processivity,reverse transcriptase,term`RNA`-seq}
37 | }
38 |
39 | @article{Bais2019,
40 | title = {Scds: Computational Annotation of Doublets in Single-Cell term`{{RNA}}` Sequencing Data},
41 | author = {Bais, Abha S and Kostka, Dennis},
42 | editor = {Elofsson, Arne},
43 | year = {2019},
44 | month = sep,
45 | journal = {Bioinformatics (Oxford, England)},
46 | volume = {36},
47 | number = {4},
48 | pages = {1150--1158},
49 | publisher = {{Oxford University Press (OUP)}},
50 | doi = {10.1093/bioinformatics/btz698}
51 | }
52 |
53 | @article{Bakken2018,
54 | title = {Single-Nucleus and Single-Cell Transcriptomes Compared in Matched Cortical Cell Types},
55 | author = {Bakken, Trygve E. and Hodge, Rebecca D. and Miller, Jeremy A. and Yao, Zizhen and Nguyen, Thuc Nghi and Aevermann, Brian and Barkan, Eliza and Bertagnolli, Darren and Casper, Tamara and Dee, Nick and Garren, Emma and Goldy, Jeff and Graybuck, Lucas T. and Kroll, Matthew and Lasken, Roger S. and Lathia, Kanan and Parry, Sheana and Rimorin, Christine and Scheuermann, Richard H. and Schork, Nicholas J. and Shehata, Soraya I. and Tieu, Michael and Phillips, John W. and Bernard, Amy and Smith, Kimberly A. and Zeng, Hongkui and Lein, Ed S. and Tasic, Bosiljka},
56 | year = {2018},
57 | month = dec,
58 | journal = {PLOS ONE},
59 | volume = {13},
60 | number = {12},
61 | pages = {1--24},
62 | publisher = {{Public Library of Science}},
63 | doi = {10.1371/journal.pone.0209648},
64 | abstract = {Transcriptomic profiling of complex tissues by single-nucleus term`RNA`-sequencing (snRNA-seq) affords some advantages over single-cell term`RNA`-sequencing (scRNA-seq). snRNA-seq provides less biased cellular coverage, does not appear to suffer cell isolation-based transcriptional artifacts, and can be applied to archived frozen specimens. We used well-matched snRNA-seq and scRNA-seq datasets from mouse visual cortex to compare cell type detection. Although more transcripts are detected in individual whole cells ( 11,000 genes) than nuclei ( 7,000 genes), we demonstrate that closely related neuronal cell types can be similarly discriminated with both methods if intronic sequences are included in snRNA-seq analysis. We estimate that the nuclear proportion of total cellular mRNA varies from 20\% to over 50\% for large and small pyramidal neurons, respectively. Together, these results illustrate the high information content of nuclear term`RNA` for characterization of cellular diversity in brain tissues.}
65 | }
66 |
67 | @article{Bernstein2020,
68 | title = {Solo: {{Doublet}} Identification in Single-Cell term`{{RNA}}`-{{Seq}} via Semi-Supervised Deep Learning},
69 | author = {Bernstein, Nicholas J. and Fong, Nicole L. and Lam, Irene and Roy, Margaret A. and Hendrickson, David G. and Kelley, David R.},
70 | year = {2020},
71 | month = jul,
72 | journal = {Cell Systems},
73 | volume = {11},
74 | number = {1},
75 | pages = {95--101.e5},
76 | publisher = {{Elsevier BV}},
77 | doi = {10.1016/j.cels.2020.05.010}
78 | }
79 |
80 | @article{Bose2015,
81 | title = {Scalable Microfluidics for Single-Cell term`{{RNA}}` Printing and Sequencing},
82 | author = {Bose, Sayantan and Wan, Zhenmao and Carr, Ambrose and Rizvi, Abbas H. and Vieira, Gregory and Pe'er, Dana and Sims, Peter A.},
83 | year = {2015},
84 | month = jun,
85 | journal = {Genome Biology},
86 | volume = {16},
87 | number = {1},
88 | publisher = {{Springer Science and Business Media LLC}},
89 | doi = {10.1186/s13059-015-0684-3}
90 | }
91 |
92 | @article{Bray2016,
93 | title = {Near-Optimal Probabilistic term`{{RNA}}`-Seq Quantification},
94 | author = {Bray, Nicolas L and Pimentel, Harold and Melsted, P{\'a}ll and Pachter, Lior},
95 | year = {2016},
96 | journal = {Nature biotechnology},
97 | volume = {34},
98 | number = {5},
99 | pages = {525--527},
100 | publisher = {{Nature Publishing Group}}
101 | }
102 |
103 | @article{Bruning_2022,
104 | title = {Comparative Analysis of Common Alignment Tools for Single-Cell term`{{RNA}}` Sequencing},
105 | author = {Br{\"u}ning, Ralf Schulze and Tombor, Lukas and Schulz, Marcel H and Dimmeler, Stefanie and John, David},
106 | year = {2022},
107 | journal = {GigaScience},
108 | volume = {11},
109 | publisher = {{Oxford University Press (OUP)}},
110 | doi = {10.1093/gigascience/giac001}
111 | }
112 |
113 | @article{Bruning2022Comparative,
114 | title = {Comparative Analysis of Common Alignment Tools for Single-Cell term`{{RNA}}` Sequencing},
115 | author = {Br{\"u}ning, Ralf Schulze and Tombor, Lukas and Schulz, Marcel H and Dimmeler, Stefanie and John, David},
116 | year = {2022},
117 | journal = {GigaScience},
118 | volume = {11},
119 | publisher = {{Oxford Academic}}
120 | }
121 |
122 | @article{calib,
123 | title = {Alignment-Free Clustering of {{UMI}} Tagged term`{{DNA}}` Molecules},
124 | author = {Orabi, Baraa and Erhan, Emre and McConeghy, Brian and Volik, Stanislav V and Bihan, Stephane Le and Bell, Robert and Collins, Colin C and Chauve, Cedric and Hach, Faraz},
125 | editor = {Berger, Bonnie},
126 | year = {2018},
127 | month = oct,
128 | journal = {Bioinformatics (Oxford, England)},
129 | volume = {35},
130 | number = {11},
131 | pages = {1829--1836},
132 | publisher = {{Oxford University Press (OUP)}},
133 | doi = {10.1093/bioinformatics/bty888}
134 | }
135 |
136 | @article{chao1992aligning,
137 | title = {Aligning Two Sequences within a Specified Diagonal Band},
138 | author = {Chao, Kun-Mao and Pearson, William R. and Miller, Webb},
139 | year = {1992},
140 | journal = {Bioinformatics (Oxford, England)},
141 | volume = {8},
142 | number = {5},
143 | pages = {481--487},
144 | publisher = {{Oxford University Press (OUP)}},
145 | doi = {10.1093/bioinformatics/8.5.481}
146 | }
147 |
148 | @article{darmanis2015,
149 | title = {A Survey of Human Brain Transcriptome Diversity at the Single Cell Level},
150 | author = {Darmanis, Spyros and Sloan, Steven A. and Zhang, Ye and Enge, Martin and Caneda, Christine and Shuer, Lawrence M. and Gephart, Melanie G. Hayden and Barres, Ben A. and {Stephen R. Quake}},
151 | year = {2015},
152 | journal = {Proceedings of the National Academy of Sciences},
153 | volume = {112},
154 | number = {23},
155 | eprint = {https://www.pnas.org/doi/pdf/10.1073/pnas.1507125112},
156 | pages = {7285--7290},
157 | doi = {10.1073/pnas.1507125112}
158 | }
159 |
160 | @article{DePasquale2019,
161 | title = {{{DoubletDecon}}: {{Deconvoluting}} Doublets from Single-Cell term`{{RNA}}`-{{Sequencing}} Data},
162 | author = {DePasquale, Erica A.K. and Schnell, Daniel J. and Camp, Pieter-Jan Van and {Valiente-Aland{\'i}}, {\'I}{\~n}igo and Blaxall, Burns C. and Grimes, H. Leighton and Singh, Harinder and Salomonis, Nathan},
163 | year = {2019},
164 | month = nov,
165 | journal = {Cell Reports},
166 | volume = {29},
167 | number = {6},
168 | pages = {1718--1727.e8},
169 | publisher = {{Elsevier BV}},
170 | doi = {10.1016/j.celrep.2019.09.082}
171 | }
172 |
173 | @article{Ding2020,
174 | title = {Systematic Comparison of Single-Cell and Single-Nucleus {{term}}`{{RNA}}`-Sequencing Methods},
175 | author = {Ding, Jiarui and Adiconis, Xian and Simmons, Sean K. and Kowalczyk, Monika S. and Hession, Cynthia C. and Marjanovic, Nemanja D. and Hughes, Travis K. and Wadsworth, Marc H. and Burks, Tyler and Nguyen, Lan T. and Kwon, John Y. H. and Barak, Boaz and Ge, William and Kedaigle, Amanda J. and Carroll, Shaina and Li, Shuqiang and Hacohen, Nir and {Rozenblatt-Rosen}, Orit and Shalek, Alex K. and Villani, Alexandra-Chlo{\'e} and Regev, Aviv and Levin, Joshua Z.},
176 | year = {2020},
177 | month = jun,
178 | journal = {Nature Biotechnology},
179 | volume = {38},
180 | number = {6},
181 | pages = {737--746},
182 | issn = {1546-1696},
183 | doi = {10.1038/s41587-020-0465-8}
184 | }
185 |
186 | @article{dobin2013star,
187 | title = {{{STAR}}: Ultrafast Universal {{RNA-seq}} Aligner},
188 | author = {Dobin, Alexander and Davis, Carrie A and Schlesinger, Felix and Drenkow, Jorg and Zaleski, Chris and Jha, Sonali and Batut, Philippe and Chaisson, Mark and Gingeras, Thomas R},
189 | year = {2013},
190 | journal = {Bioinformatics (Oxford, England)},
191 | volume = {29},
192 | number = {1},
193 | pages = {15--21},
194 | publisher = {{Oxford University Press}}
195 | }
196 |
197 | @article{exp:Macosko2015,
198 | title = {Highly Parallel Genome-Wide Expression Profiling of Individual Cells Using Nanoliter Droplets},
199 | author = {Macosko, Evan Z. and Basu, Anindita and Satija, Rahul and Nemesh, James and Shekhar, Karthik and Goldman, Melissa and Tirosh, Itay and Bialas, Allison R. and Kamitaki, Nolan and Martersteck, Emily M. and Trombetta, John J. and Weitz, David A. and Sanes, Joshua R. and Shalek, Alex K. and Regev, Aviv and McCarroll, Steven A.},
200 | year = {2015},
201 | month = may,
202 | journal = {Cell},
203 | volume = {161},
204 | number = {5},
205 | pages = {1202--1214},
206 | publisher = {{Elsevier}},
207 | issn = {0092-8674},
208 | doi = {10.1016/j.cell.2015.05.002}
209 | }
210 |
211 | @article{exp:Zheng2017,
212 | title = {Massively Parallel Digital Transcriptional Profiling of Single Cells},
213 | author = {Zheng, Grace X. Y. and Terry, Jessica M. and Belgrader, Phillip and Ryvkin, Paul and Bent, Zachary W. and Wilson, Ryan and Ziraldo, Solongo B. and Wheeler, Tobias D. and McDermott, Geoff P. and Zhu, Junjie and Gregory, Mark T. and Shuga, Joe and Montesclaros, Luz and Underwood, Jason G. and Masquelier, Donald A. and Nishimura, Stefanie Y. and {Schnall-Levin}, Michael and Wyatt, Paul W. and Hindson, Christopher M. and Bharadwaj, Rajiv and Wong, Alexander and Ness, Kevin D. and Beppu, Lan W. and Deeg, H. Joachim and McFarland, Christopher and Loeb, Keith R. and Valente, William J. and Ericson, Nolan G. and Stevens, Emily A. and Radich, Jerald P. and Mikkelsen, Tarjei S. and Hindson, Benjamin J. and Bielas, Jason H.},
214 | year = {2017},
215 | month = jan,
216 | journal = {Nature Communications},
217 | volume = {8},
218 | number = {1},
219 | pages = {14049},
220 | issn = {2041-1723},
221 | doi = {10.1038/ncomms14049}
222 | }
223 |
224 | @article{farouni2020model,
225 | title = {Model-Based Analysis of Sample Index Hopping Reveals Its Widespread Artifacts in Multiplexed Single-Cell {{term}}`{{RNA}}`-Sequencing},
226 | author = {Farouni, Rick and Djambazian, Haig and Ferri, Lorenzo E and Ragoussis, Jiannis and Najafabadi, Hamed S},
227 | year = {2020},
228 | journal = {Nature communications},
229 | volume = {11},
230 | number = {1},
231 | pages = {1--8},
232 | publisher = {{Nature Publishing Group}}
233 | }
234 |
235 | @article{farrar2007striped,
236 | title = {Striped {{Smith}}\textendash{{Waterman}} Speeds Database Searches Six Times over Other {{SIMD}} Implementations},
237 | author = {Farrar, Michael},
238 | year = {2007},
239 | journal = {Bioinformatics (Oxford, England)},
240 | volume = {23},
241 | number = {2},
242 | pages = {156--161},
243 | publisher = {{Oxford University Press}}
244 | }
245 |
246 | @misc{fluidigm,
247 | title = {Single-Cell Analysis with Microfluidics},
248 | author = {{Fluidigm}},
249 | year = {2022}
250 | }
251 |
252 | @article{Gorin2021,
253 | title = {Length Biases in Single-Cell {{term}}`{{RNA}}` Sequencing of Pre-{{mRNA}}},
254 | author = {Gorin, Gennady and Pachter, Lior},
255 | year = {2021},
256 | journal = {bioRxiv : the preprint server for biology},
257 | eprint = {https://www.biorxiv.org/content/early/2021/07/31/2021.07.30.454514.full.pdf},
258 | publisher = {{Cold Spring Harbor Laboratory}},
259 | doi = {10.1101/2021.07.30.454514},
260 | elocation-id = {2021.07.30.454514}
261 | }
262 |
263 | @article{Gupta2018,
264 | title = {Single-Cell Isoform {{term}}`{{RNA}}` Sequencing Characterizes Isoforms in Thousands of Cerebellar Cells},
265 | author = {Gupta, Ishaan and Collier, Paul G. and Haase, Bettina and Mahfouz, Ahmed and Joglekar, Anoushka and Floyd, Taylor and Koopmans, Frank and Barres, Ben and Smit, August B. and Sloan, Steven A. and Luo, Wenjie and Fedrigo, Olivier and Ross, M. Elizabeth and Tilgner, Hagen U.},
266 | year = {2018},
267 | month = dec,
268 | journal = {Nature Biotechnology},
269 | volume = {36},
270 | number = {12},
271 | pages = {1197--1202},
272 | issn = {1546-1696},
273 | doi = {10.1038/nbt.4259}
274 | }
275 |
276 | @article{Heiser2021,
277 | title = {Automated Quality Control and Cell Identification of Droplet-Based Single-Cell Data Using Dropkick},
278 | author = {Heiser, Cody N. and Wang, Victoria M. and Chen, Bob and Hughey, Jacob J. and Lau, Ken S.},
279 | year = {2021},
280 | month = apr,
281 | journal = {Genome Research},
282 | volume = {31},
283 | number = {10},
284 | pages = {1742--1752},
285 | publisher = {{Cold Spring Harbor Laboratory}},
286 | doi = {10.1101/gr.271908.120}
287 | }
288 |
289 | @article{Hippen2021,
290 | title = {{{miQC}}: {{An}} Adaptive Probabilistic Framework for Quality Control of Single-Cell {{term}}`{{RNA}}`-Sequencing Data},
291 | author = {Hippen, Ariel A and Falco, Matias M and Weber, Lukas M and Erkan, Erdogan Pekcan and Zhang, Kaiyang and Doherty, Jennifer Anne and V{\"a}h{\"a}rautio, Anna and Greene, Casey S and Hicks, Stephanie C},
292 | year = {2021},
293 | journal = {PLoS computational biology},
294 | volume = {17},
295 | number = {8},
296 | pages = {e1009290},
297 | publisher = {{Public Library of Science San Francisco, CA USA}}
298 | }
299 |
300 | @article{Hood1987,
301 | title = {Automated term`{{DNA}}` Sequencing and Analysis of the Human Genome},
302 | author = {Hood, L E and Hunkapiller, M W and Smith, L M},
303 | year = {1987},
304 | month = nov,
305 | journal = {Genomics},
306 | volume = {1},
307 | number = {3},
308 | pages = {201--212},
309 | publisher = {{Elsevier BV}},
310 | langid = {english}
311 | }
312 |
313 | @article{Hrdlickova2017,
314 | title = {{{RNA-Seq}} Methods for Transcriptome Analysis},
315 | author = {Hrdlickova, Radmila and Toloue, Masoud and Tian, Bin},
316 | year = {2017},
317 | journal = {WIREs RNA},
318 | volume = {8},
319 | number = {1},
320 | eprint = {https://wires.onlinelibrary.wiley.com/doi/pdf/10.1002/wrna.1364},
321 | pages = {e1364},
322 | doi = {10.1002/wrna.1364}
323 | }
324 |
325 | @article{Islam2013,
326 | title = {Quantitative Single-Cell term`{{RNA}}`-Seq with Unique Molecular Identifiers},
327 | author = {Islam, Saiful and Zeisel, Amit and Joost, Simon and Manno, Gioele La and Zajac, Pawel and Kasper, Maria and L{\"o}nnerberg, Peter and Linnarsson, Sten},
328 | year = {2013},
329 | month = dec,
330 | journal = {Nature Methods},
331 | volume = {11},
332 | number = {2},
333 | pages = {163--166},
334 | publisher = {{Springer Science and Business Media LLC}},
335 | doi = {10.1038/nmeth.2772}
336 | }
337 |
338 | @article{Islam2014,
339 | title = {Quantitative Single-Cell {{term}}`{{RNA}}`-Seq with Unique Molecular Identifiers},
340 | author = {Islam, Saiful and Zeisel, Amit and Joost, Simon and La Manno, Gioele and Zajac, Pawel and Kasper, Maria and L{\"o}nnerberg, Peter and Linnarsson, Sten},
341 | year = {2014},
342 | month = feb,
343 | journal = {Nature Methods},
344 | volume = {11},
345 | number = {2},
346 | pages = {163--166},
347 | issn = {1548-7105},
348 | doi = {10.1038/nmeth.2772}
349 | }
350 |
351 | @article{Jain2016,
352 | title = {The {{Oxford Nanopore MinION}}: Delivery of Nanopore Sequencing to the Genomics Community},
353 | author = {Jain, Miten and Olsen, Hugh E. and Paten, Benedict and Akeson, Mark},
354 | year = {2016},
355 | month = nov,
356 | journal = {Genome Biology},
357 | volume = {17},
358 | number = {1},
359 | pages = {239},
360 | issn = {1474-760X},
361 | doi = {10.1186/s13059-016-1103-0}
362 | }
363 |
364 | @article{JOU1972,
365 | title = {Nucleotide Sequence of the Gene Coding for the Bacteriophage {{MS2}} Coat Protein},
366 | author = {JOU, W. MIN and HAEGEMAN, G. and YSEBAERT, M. and FIERS, W.},
367 | year = {1972},
368 | month = may,
369 | journal = {Nature},
370 | volume = {237},
371 | number = {5350},
372 | pages = {82--88},
373 | issn = {1476-4687},
374 | doi = {10.1038/237082a0}
375 | }
376 |
377 | @inproceedings{Ju2017,
378 | title = {Fleximer: Accurate Quantification of term`{{RNA}}`-{{Seq}} via Variable-Length k-Mers},
379 | booktitle = {Proceedings of the 8th {{ACM}} International Conference on Bioinformatics, Computational Biology, and Health Informatics},
380 | author = {Ju, Chelsea J-T and Li, Ruirui and Wu, Zhengliang and Jiang, Jyun-Yu and Yang, Zhao and Wang, Wei},
381 | year = {2017},
382 | pages = {263--272}
383 | }
384 |
385 | @article{Kaminow2021,
386 | title = {{{STARsolo}}: Accurate, Fast and Versatile Mapping/Quantification of Single-Cell and Single-Nucleus {{term}}`{{RNA}}`-Seq Data},
387 | author = {Kaminow, Benjamin and Yunusov, Dinar and Dobin, Alexander},
388 | year = {2021},
389 | journal = {bioRxiv : the preprint server for biology},
390 | publisher = {{Cold Spring Harbor Laboratory}}
391 | }
392 |
393 | @article{Kivioja2012,
394 | title = {Counting Absolute Numbers of Molecules Using Unique Molecular Identifiers},
395 | author = {Kivioja, Teemu and V{\"a}h{\"a}rautio, Anna and Karlsson, Kasper and Bonke, Martin and Enge, Martin and Linnarsson, Sten and Taipale, Jussi},
396 | year = {2012},
397 | month = jan,
398 | journal = {Nature Methods},
399 | volume = {9},
400 | number = {1},
401 | pages = {72--74},
402 | issn = {1548-7105},
403 | doi = {10.1038/nmeth.1778},
404 | abstract = {Unique molecular identifiers (UMIs) associate distinct sequences with every term`DNA` or term`RNA` molecule and can be counted after amplification to quantify molecules in the original sample. Using UMIs, the authors obtain a digital karyotype of an individual with Down's syndrome and quantify mRNA in Drosophila melanogaster cells.}
405 | }
406 |
407 | @article{Klein2015,
408 | title = {Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem Cells},
409 | author = {Klein, Allon M. and Mazutis, Linas and Akartuna, Ilke and Tallapragada, Naren and Veres, Adrian and Li, Victor and Peshkin, Leonid and Weitz, David A. and Kirschner, Marc W.},
410 | year = {2015},
411 | month = may,
412 | journal = {Cell},
413 | volume = {161},
414 | number = {5},
415 | pages = {1187--1201},
416 | issn = {1097-4172},
417 | doi = {10.1016/j.cell.2015.04.044},
418 | langid = {english},
419 | keywords = {*Microfluidic Analytical Techniques,Animals,Embryonic Stem Cells/*cytology/metabolism,Gene Expression Profiling/*methods,High-Throughput Nucleotide Sequencing,Mice,Sequence Analysis,Single-Cell Analysis/*methods,term`RNA`/methods}
420 | }
421 |
422 | @article{Krishnaswami2016,
423 | title = {Using Single Nuclei for {{term}}`{{RNA}}`-Seq to Capture the Transcriptome of Postmortem Neurons},
424 | author = {Krishnaswami, Suguna Rani and Grindberg, Rashel V. and Novotny, Mark and Venepally, Pratap and Lacar, Benjamin and Bhutani, Kunal and Linker, Sara B. and Pham, Son and Erwin, Jennifer A. and Miller, Jeremy A. and Hodge, Rebecca and McCarthy, James K. and Kelder, Martijn and McCorrison, Jamison and Aevermann, Brian D. and Fuertes, Francisco Diez and Scheuermann, Richard H. and Lee, Jun and Lein, Ed S. and Schork, Nicholas and McConnell, Michael J. and Gage, Fred H. and Lasken, Roger S.},
425 | year = {2016},
426 | month = mar,
427 | journal = {Nature Protocols},
428 | volume = {11},
429 | number = {3},
430 | pages = {499--524},
431 | issn = {1750-2799},
432 | doi = {10.1038/nprot.2016.015}
433 | }
434 |
435 | @article{Lafzi2018,
436 | title = {Tutorial: Guidelines for the Experimental Design of Single-Cell {{term}}`{{RNA}}` Sequencing Studies},
437 | author = {Lafzi, Atefeh and Moutinho, Catia and Picelli, Simone and Heyn, Holger},
438 | year = {2018},
439 | month = dec,
440 | journal = {Nature Protocols},
441 | volume = {13},
442 | number = {12},
443 | pages = {2742--2757},
444 | issn = {1750-2799},
445 | doi = {10.1038/s41596-018-0073-y}
446 | }
447 |
448 | @article{Lebrigand2020,
449 | title = {High Throughput Error Corrected {{Nanopore}} Single Cell Transcriptome Sequencing},
450 | author = {Lebrigand, Kevin and Magnone, Virginie and Barbry, Pascal and Waldmann, Rainer},
451 | year = {2020},
452 | month = aug,
453 | journal = {Nature Communications},
454 | volume = {11},
455 | number = {1},
456 | pages = {4025},
457 | issn = {2041-1723},
458 | doi = {10.1038/s41467-020-17800-6}
459 | }
460 |
461 | @article{li2018minimap2,
462 | title = {Minimap2: Pairwise Alignment for Nucleotide Sequences},
463 | author = {Li, Heng},
464 | editor = {Birol, Inanc},
465 | year = {2018},
466 | month = may,
467 | journal = {Bioinformatics (Oxford, England)},
468 | volume = {34},
469 | number = {18},
470 | pages = {3094--3100},
471 | publisher = {{Oxford University Press (OUP)}},
472 | doi = {10.1093/bioinformatics/bty191}
473 | }
474 |
475 | @article{Lun2019,
476 | title = {{{EmptyDrops}}: Distinguishing Cells from Empty Droplets in Droplet-Based Single-Cell {{term}}`{{RNA}}` Sequencing Data},
477 | author = {Lun, Aaron TL and Riesenfeld, Samantha and Andrews, Tallulah and Gomes, Tomas and Marioni, John C and others},
478 | year = {2019},
479 | journal = {Genome biology},
480 | volume = {20},
481 | number = {1},
482 | pages = {1--9},
483 | publisher = {{BioMed Central}}
484 | }
485 |
486 | @article{marco2021fast,
487 | title = {Fast Gap-Affine Pairwise Alignment Using the Wavefront Algorithm},
488 | author = {{Marco-Sola}, Santiago and Moure, Juan Carlos and Moreto, Miquel and Espinosa, Antonio},
489 | editor = {Robinson, Peter},
490 | year = {2020},
491 | month = sep,
492 | journal = {Bioinformatics (Oxford, England)},
493 | publisher = {{Oxford University Press (OUP)}},
494 | doi = {10.1093/bioinformatics/btaa777}
495 | }
496 |
497 | @article{marco2022optimal,
498 | title = {Optimal Gap-Affine Alignment in {{O}}(s) Space},
499 | author = {{Marco-Sola}, Santiago and Eizenga, Jordan M and Guarracino, Andrea and Paten, Benedict and Garrison, Erik and Moreto, Miquel},
500 | year = {2022},
501 | month = apr,
502 | journal = {bioRxiv : the preprint server for biology},
503 | publisher = {{Cold Spring Harbor Laboratory}},
504 | doi = {10.1101/2022.04.14.488380}
505 | }
506 |
507 | @article{McGinnis2019,
508 | title = {{{DoubletFinder}}: {{Doublet}} Detection in Single-Cell term`{{RNA}}` Sequencing Data Using Artificial Nearest Neighbors},
509 | author = {McGinnis, Christopher S. and Murrow, Lyndsay M. and Gartner, Zev J.},
510 | year = {2019},
511 | month = apr,
512 | journal = {Cell Systems},
513 | volume = {8},
514 | number = {4},
515 | pages = {329--337.e4},
516 | publisher = {{Elsevier BV}},
517 | doi = {10.1016/j.cels.2019.03.003}
518 | }
519 |
520 | @article{Melsted2021,
521 | title = {Modular, Efficient and Constant-Memory Single-Cell {{RNA-seq}} Preprocessing},
522 | author = {Melsted, P{\'a}ll and Booeshaghi, A. Sina and Liu, Lauren and Gao, Fan and Lu, Lambda and Min, Kyung Hoi and {da Veiga Beltrame}, Eduardo and Hj{\"o}rleifsson, Kristj{\'a}n Eldj{\'a}rn and Gehring, Jase and Pachter, Lior},
523 | year = {2021},
524 | month = apr,
525 | journal = {Nature Biotechnology},
526 | volume = {39},
527 | number = {7},
528 | pages = {813--818},
529 | publisher = {{Springer Science and Business Media LLC}},
530 | doi = {10.1038/s41587-021-00870-2}
531 | }
532 |
533 | @article{Mereu2020,
534 | title = {Benchmarking Single-Cell {{term}}`{{RNA}}`-Sequencing Protocols for Cell Atlas Projects},
535 | author = {Mereu, Elisabetta and Lafzi, Atefeh and Moutinho, Catia and Ziegenhain, Christoph and McCarthy, Davis J. and {\'A}lvarez-Varela, Adri{\'a}n and Batlle, Eduard and {Sagar} and Gr{\"u}n, Dominic and Lau, Julia K. and Boutet, St{\'e}phane C. and Sanada, Chad and Ooi, Aik and Jones, Robert C. and Kaihara, Kelly and Brampton, Chris and Talaga, Yasha and Sasagawa, Yohei and Tanaka, Kaori and Hayashi, Tetsutaro and Braeuning, Caroline and Fischer, Cornelius and Sauer, Sascha and Trefzer, Timo and Conrad, Christian and Adiconis, Xian and Nguyen, Lan T. and Regev, Aviv and Levin, Joshua Z. and Parekh, Swati and Janjic, Aleksandar and Wange, Lucas E. and Bagnoli, Johannes W. and Enard, Wolfgang and Gut, Marta and Sandberg, Rickard and Nikaido, Itoshi and Gut, Ivo and Stegle, Oliver and Heyn, Holger},
536 | year = {2020},
537 | month = jun,
538 | journal = {Nature Biotechnology},
539 | volume = {38},
540 | number = {6},
541 | pages = {747--755},
542 | issn = {1546-1696},
543 | doi = {10.1038/s41587-020-0469-4}
544 | }
545 |
546 | @article{Muskovic2021,
547 | title = {{{DropletQC}}: Improved Identification of Empty Droplets and Damaged Cells in Single-Cell {{term}}`{{RNA}}`-Seq Data},
548 | author = {Muskovic, Walter and Powell, Joseph E},
549 | year = {2021},
550 | journal = {Genome Biology},
551 | volume = {22},
552 | number = {1},
553 | pages = {1--9},
554 | publisher = {{Springer}}
555 | }
556 |
557 | @article{Nicolae2011,
558 | title = {Estimation of Alternative Splicing Isoform Frequencies from term`{{RNA}}`-{{Seq}} Data},
559 | author = {Nicolae, Marius and Mangul, Serghei and M{\u a}ndoiu, Ion I and Zelikovsky, Alex},
560 | year = {2011},
561 | journal = {Algorithms for molecular biology},
562 | volume = {6},
563 | number = {1},
564 | pages = {1--13},
565 | publisher = {{Springer}}
566 | }
567 |
568 | @article{niebler2020raindrop,
569 | title = {{{RainDrop}}: {{Rapid}} Activation Matrix Computation for Droplet-Based Single-Cell {{term}}`{{RNA}}`-Seq Reads},
570 | author = {Niebler, Stefan and M{\"u}ller, Andr{\'e} and Hankeln, Thomas and Schmidt, Bertil},
571 | year = {2020},
572 | journal = {BMC bioinformatics},
573 | volume = {21},
574 | number = {1},
575 | pages = {1--14},
576 | publisher = {{Springer}}
577 | }
578 |
579 | @article{Ntranos2016,
580 | title = {Fast and Accurate Single-Cell term`{{RNA}}`-Seq Analysis by Clustering of Transcript-Compatibility Counts},
581 | author = {Ntranos, Vasilis and Kamath, Govinda M and Zhang, Jesse M and Pachter, Lior and Tse, David N},
582 | year = {2016},
583 | journal = {Genome biology},
584 | volume = {17},
585 | number = {1},
586 | pages = {1--14},
587 | publisher = {{Springer}}
588 | }
589 |
590 | @article{Ntranos2019,
591 | title = {A Discriminative Learning Approach to Differential Expression Analysis for Single-Cell {{term}}`{{RNA}}`-Seq},
592 | author = {Ntranos, Vasilis and Yi, Lynn and Melsted, P{\'a}ll and Pachter, Lior},
593 | year = {2019},
594 | journal = {Nature methods},
595 | volume = {16},
596 | number = {2},
597 | pages = {163--166},
598 | publisher = {{Nature Publishing Group}}
599 | }
600 |
601 | @article{pa:Haber2017,
602 | title = {A Single-Cell Survey of the Small Intestinal Epithelium},
603 | author = {Haber, Adam L. and Biton, Moshe and Rogel, Noga and Herbst, Rebecca H. and Shekhar, Karthik and Smillie, Christopher and Burgin, Grace and Delorey, Toni M. and Howitt, Michael R. and Katz, Yarden and Tirosh, Itay and Beyaz, Semir and Dionne, Danielle and Zhang, Mei and Raychowdhury, Raktima and Garrett, Wendy S. and {Rozenblatt-Rosen}, Orit and Shi, Hai Ning and Yilmaz, Omer and Xavier, Ramnik J. and Regev, Aviv},
604 | year = {2017},
605 | month = nov,
606 | journal = {Nature},
607 | volume = {551},
608 | number = {7680},
609 | pages = {333--339},
610 | issn = {1476-4687},
611 | doi = {10.1038/nature24489},
612 | abstract = {Intestinal epithelial cells absorb nutrients, respond to microbes, function as a barrier and help to coordinate immune responses. Here we report profiling of 53,193 individual epithelial cells from the small intestine and organoids of mice, which enabled the identification and characterization of previously unknown subtypes of intestinal epithelial cell and their gene signatures. We found unexpected diversity in hormone-secreting enteroendocrine cells and constructed the taxonomy of newly identified subtypes, and distinguished between two subtypes of tuft cell, one of which expresses the epithelial cytokine Tslp and the pan-immune marker CD45, which was not previously associated with non-haematopoietic cells. We also characterized the ways in which cell-intrinsic states and the proportions of different cell types respond to bacterial and helminth infections: Salmonella infection caused an increase in the abundance of Paneth cells and enterocytes, and broad activation of an antimicrobial program; Heligmosomoides polygyrus caused an increase in the abundance of goblet and tuft cells. Our survey highlights previously unidentified markers and programs, associates sensory molecules with cell types, and uncovers principles of gut homeostasis and response to pathogens.}
613 | }
614 |
615 | @article{pa:Huber2015,
616 | title = {Orchestrating High-Throughput Genomic Analysis with {{Bioconductor}}},
617 | author = {Huber, Wolfgang and Carey, Vincent J. and Gentleman, Robert and Anders, Simon and Carlson, Marc and Carvalho, Benilton S. and Bravo, Hector Corrada and Davis, Sean and Gatto, Laurent and Girke, Thomas and Gottardo, Raphael and Hahne, Florian and Hansen, Kasper D. and Irizarry, Rafael A. and Lawrence, Michael and Love, Michael I. and MacDonald, James and Obenchain, Valerie and Ole{\'s}, Andrzej K. and Pag{\`e}s, Herv{\'e} and Reyes, Alejandro and Shannon, Paul and Smyth, Gordon K. and Tenenbaum, Dan and Waldron, Levi and Morgan, Martin},
618 | year = {2015},
619 | month = feb,
620 | journal = {Nature Methods},
621 | volume = {12},
622 | number = {2},
623 | pages = {115--121},
624 | issn = {1548-7105},
625 | doi = {10.1038/nmeth.3252}
626 | }
627 |
628 | @article{pa:Lücken2019,
629 | title = {Current Best Practices in Single-Cell {{term}}`{{RNA}}`-Seq Analysis: A Tutorial},
630 | author = {Luecken, Malte D and Theis, Fabian J},
631 | year = {2019},
632 | journal = {Molecular Systems Biology},
633 | volume = {15},
634 | number = {6},
635 | eprint = {https://www.embopress.org/doi/pdf/10.15252/msb.20188746},
636 | pages = {e8746},
637 | doi = {10.15252/msb.20188746},
638 | keywords = {analysis pipeline development,computational biology,data analysis tutorial,single-cell term`RNA`-seq}
639 | }
640 |
641 | @article{Patro2014,
642 | title = {Sailfish Enables Alignment-Free Isoform Quantification from term`{{RNA}}`-Seq Reads Using Lightweight Algorithms},
643 | author = {Patro, Rob and Mount, Stephen M and Kingsford, Carl},
644 | year = {2014},
645 | journal = {Nature biotechnology},
646 | volume = {32},
647 | number = {5},
648 | pages = {462--464},
649 | publisher = {{Nature Publishing Group}}
650 | }
651 |
652 | @article{Patro2017,
653 | title = {Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression},
654 | author = {Patro, Rob and Duggal, Geet and Love, Michael I and Irizarry, Rafael A and Kingsford, Carl},
655 | year = {2017},
656 | journal = {Nature methods},
657 | volume = {14},
658 | number = {4},
659 | pages = {417--419},
660 | publisher = {{Nature Publishing Group}}
661 | }
662 |
663 | @article{Philpott2021,
664 | title = {Nanopore Sequencing of Single-Cell Transcriptomes with {{scCOLOR-seq}}},
665 | author = {Philpott, Martin and Watson, Jonathan and Thakurta, Anjan and Brown, Tom and Oppermann, Udo and Cribbs, Adam P.},
666 | year = {2021},
667 | month = dec,
668 | journal = {Nature Biotechnology},
669 | volume = {39},
670 | number = {12},
671 | pages = {1517--1520},
672 | issn = {1546-1696},
673 | doi = {10.1038/s41587-021-00965-w}
674 | }
675 |
676 | @article{Pool2022,
677 | title = {Enhanced Recovery of Single-Cell {{term}}`{{RNA}}`-Sequencing Reads for Missing Gene Expression Data},
678 | author = {Pool, Allan-Hermann and Poldsam, Helen and Chen, Sisi and Thomson, Matt and Oka, Yuki},
679 | year = {2022},
680 | journal = {bioRxiv : the preprint server for biology},
681 | eprint = {https://www.biorxiv.org/content/early/2022/04/27/2022.04.26.489449.full.pdf},
682 | publisher = {{Cold Spring Harbor Laboratory}},
683 | doi = {10.1101/2022.04.26.489449},
684 | elocation-id = {2022.04.26.489449}
685 | }
686 |
687 | @article{raw:Cao2019,
688 | title = {The Single-Cell Transcriptional Landscape of Mammalian Organogenesis},
689 | author = {Cao, Junyue and Spielmann, Malte and Qiu, Xiaojie and Huang, Xingfan and Ibrahim, Daniel M. and Hill, Andrew J. and Zhang, Fan and Mundlos, Stefan and Christiansen, Lena and Steemers, Frank J. and Trapnell, Cole and Shendure, Jay},
690 | year = {2019},
691 | month = feb,
692 | journal = {Nature},
693 | volume = {566},
694 | number = {7745},
695 | pages = {496--502},
696 | publisher = {{Springer Science and Business Media LLC}},
697 | doi = {10.1038/s41586-019-0969-x}
698 | }
699 |
700 | @article{raw:He2022,
701 | title = {Alevin-Fry Unlocks Rapid, Accurate and Memory-Frugal Quantification of Single-Cell {{RNA-seq}} Data},
702 | author = {He, Dongze and Zakeri, Mohsen and Sarkar, Hirak and Soneson, Charlotte and Srivastava, Avi and Patro, Rob},
703 | year = {2022},
704 | journal = {Nature Methods},
705 | volume = {19},
706 | number = {3},
707 | pages = {316--322},
708 | publisher = {{Nature Publishing Group}}
709 | }
710 |
711 | @article{raw:Macosko2015,
712 | title = {Highly Parallel Genome-Wide Expression Profiling of Individual Cells Using Nanoliter Droplets},
713 | author = {Macosko, Evan Z. and Basu, Anindita and Satija, Rahul and Nemesh, James and Shekhar, Karthik and Goldman, Melissa and Tirosh, Itay and Bialas, Allison R. and Kamitaki, Nolan and Martersteck, Emily M. and Trombetta, John J. and Weitz, David A. and Sanes, Joshua R. and Shalek, Alex K. and Regev, Aviv and McCarroll, Steven A.},
714 | year = {2015},
715 | month = may,
716 | journal = {Cell},
717 | volume = {161},
718 | number = {5},
719 | pages = {1202--1214},
720 | publisher = {{Elsevier}},
721 | issn = {0092-8674},
722 | doi = {10.1016/j.cell.2015.05.002}
723 | }
724 |
725 | @article{raw:Young2020,
726 | title = {{{SoupX}} Removes Ambient term`{{RNA}}` Contamination from Droplet-Based Single-Cell term`{{RNA}}` Sequencing Data},
727 | author = {Young, Matthew D and Behjati, Sam},
728 | year = {2020},
729 | month = dec,
730 | journal = {GigaScience},
731 | volume = {9},
732 | number = {12},
733 | publisher = {{Oxford University Press (OUP)}},
734 | doi = {10.1093/gigascience/giaa151}
735 | }
736 |
737 | @article{raw:Zappia2021,
738 | title = {Over 1000 Tools Reveal Trends in the Single-Cell {{RNA-seq}} Analysis Landscape},
739 | author = {Zappia, Luke and Theis, Fabian J.},
740 | year = {2021},
741 | month = oct,
742 | journal = {Genome Biology},
743 | volume = {22},
744 | number = {1},
745 | pages = {301},
746 | issn = {1474-760X},
747 | doi = {10.1186/s13059-021-02519-4}
748 | }
749 |
750 | @article{raw:Zheng2017,
751 | title = {Massively Parallel Digital Transcriptional Profiling of Single Cells},
752 | author = {Zheng, Grace X. Y. and Terry, Jessica M. and Belgrader, Phillip and Ryvkin, Paul and Bent, Zachary W. and Wilson, Ryan and Ziraldo, Solongo B. and Wheeler, Tobias D. and McDermott, Geoff P. and Zhu, Junjie and Gregory, Mark T. and Shuga, Joe and Montesclaros, Luz and Underwood, Jason G. and Masquelier, Donald A. and Nishimura, Stefanie Y. and {Schnall-Levin}, Michael and Wyatt, Paul W. and Hindson, Christopher M. and Bharadwaj, Rajiv and Wong, Alexander and Ness, Kevin D. and Beppu, Lan W. and Deeg, H. Joachim and McFarland, Christopher and Loeb, Keith R. and Valente, William J. and Ericson, Nolan G. and Stevens, Emily A. and Radich, Jerald P. and Mikkelsen, Tarjei S. and Hindson, Benjamin J. and Bielas, Jason H.},
753 | year = {2017},
754 | month = jan,
755 | journal = {Nature Communications},
756 | volume = {8},
757 | number = {1},
758 | pages = {14049},
759 | issn = {2041-1723},
760 | doi = {10.1038/ncomms14049}
761 | }
762 |
763 | @article{rognes2000six,
764 | title = {Six-Fold Speed-up of {{Smith}}\textendash{{Waterman}} Sequence Database Searches Using Parallel Processing on Common Microprocessors},
765 | author = {Rognes, Torbj{\o}rn and Seeberg, Erling},
766 | year = {2000},
767 | journal = {Bioinformatics (Oxford, England)},
768 | volume = {16},
769 | number = {8},
770 | pages = {699--706},
771 | publisher = {{Oxford University Press}}
772 | }
773 |
774 | @article{sierra,
775 | title = {Sierra: Discovery of Differential Transcript Usage from {{polyA-captured}} Single-Cell term`{{RNA}}`-Seq Data},
776 | author = {Patrick, Ralph and Humphreys, David T. and Janbandhu, Vaibhao and Oshlack, Alicia and Ho, Joshua W.K. and Harvey, Richard P. and Lo, Kitty K.},
777 | year = {2020},
778 | month = jul,
779 | journal = {Genome Biology},
780 | volume = {21},
781 | number = {1},
782 | publisher = {{Springer Science and Business Media LLC}},
783 | doi = {10.1186/s13059-020-02071-7}
784 | }
785 |
786 | @article{Singh2019,
787 | title = {High-Throughput Targeted Long-Read Single Cell Sequencing Reveals the Clonal and Transcriptional Landscape of Lymphocytes},
788 | author = {Singh, Mandeep and {Al-Eryani}, Ghamdan and Carswell, Shaun and Ferguson, James M. and Blackburn, James and Barton, Kirston and Roden, Daniel and Luciani, Fabio and Giang Phan, Tri and Junankar, Simon and Jackson, Katherine and Goodnow, Christopher C. and Smith, Martin A. and Swarbrick, Alexander},
789 | year = {2019},
790 | month = jul,
791 | journal = {Nature Communications},
792 | volume = {10},
793 | number = {1},
794 | pages = {3120},
795 | issn = {2041-1723},
796 | doi = {10.1038/s41467-019-11049-4}
797 | }
798 |
799 | @article{Smith2017,
800 | title = {{{UMI-tools}}: Modeling Sequencing Errors in {{Unique Molecular Identifiers}} to Improve Quantification Accuracy},
801 | author = {Smith, Tom and Heger, Andreas and Sudbery, Ian},
802 | year = {2017},
803 | journal = {Genome research},
804 | volume = {27},
805 | number = {3},
806 | pages = {491--499},
807 | publisher = {{Cold Spring Harbor Lab}}
808 | }
809 |
810 | @article{Soneson2021Preprocessing,
811 | title = {Preprocessing Choices Affect {{RNA}} Velocity Results for Droplet {{scRNA-seq}} Data},
812 | author = {Soneson, Charlotte and Srivastava, Avi and Patro, Rob and Stadler, Michael B},
813 | year = {2021},
814 | journal = {PLoS computational biology},
815 | volume = {17},
816 | number = {1},
817 | pages = {e1008585},
818 | publisher = {{Public Library of Science San Francisco, CA USA}}
819 | }
820 |
821 | @article{srivastava2016rapmap,
822 | title = {{{RapMap}}: A Rapid, Sensitive and Accurate Tool for Mapping {{term}}`{{RNA}}`-Seq Reads to Transcriptomes},
823 | author = {Srivastava, Avi and Sarkar, Hirak and Gupta, Nitish and Patro, Rob},
824 | year = {2016},
825 | journal = {Bioinformatics (Oxford, England)},
826 | volume = {32},
827 | number = {12},
828 | pages = {i192--i200},
829 | publisher = {{Oxford University Press}}
830 | }
831 |
832 | @article{Srivastava2019,
833 | title = {Alevin Efficiently Estimates Accurate Gene Abundances from {{dscRNA-seq}} Data},
834 | author = {Srivastava, Avi and Malik, Laraib and Smith, Tom and Sudbery, Ian and Patro, Rob},
835 | year = {2019},
836 | journal = {Genome biology},
837 | volume = {20},
838 | number = {1},
839 | pages = {1--16},
840 | publisher = {{Springer}}
841 | }
842 |
843 | @article{Srivastava2020-lf,
844 | title = {A {{Bayesian}} Framework for Inter-Cellular Information Sharing Improves {{dscRNA-seq}} Quantification},
845 | author = {Srivastava, Avi and Malik, Laraib and Sarkar, Hirak and Patro, Rob},
846 | year = {2020},
847 | month = jul,
848 | journal = {Bioinformatics (Oxford, England)},
849 | volume = {36},
850 | number = {Supplement\_1},
851 | pages = {i292--i299},
852 | publisher = {{Oxford University Press (OUP)}},
853 | doi = {10.1093/bioinformatics/btaa450}
854 | }
855 |
856 | @article{Srivastava2020Alignment,
857 | title = {Alignment and Mapping Methodology Influence Transcript Abundance Estimation},
858 | author = {Srivastava, Avi and Malik, Laraib and Sarkar, Hirak and Zakeri, Mohsen and Almodaresi, Fatemeh and Soneson, Charlotte and Love, Michael I and Kingsford, Carl and Patro, Rob},
859 | year = {2020},
860 | journal = {Genome Biology},
861 | volume = {21},
862 | number = {1},
863 | pages = {1--29},
864 | publisher = {{BioMed Central}}
865 | }
866 |
867 | @article{Suzuki2018,
868 | title = {Introducing Difference Recurrence Relations for Faster Semi-Global Alignment of Long Sequences},
869 | author = {Suzuki, Hajime and Kasahara, Masahiro},
870 | year = {2018},
871 | month = feb,
872 | journal = {BMC Bioinformatics},
873 | volume = {19},
874 | number = {S1},
875 | publisher = {{Springer Science and Business Media LLC}},
876 | doi = {10.1186/s12859-018-2014-8}
877 | }
878 |
879 | @article{Svensson2017,
880 | title = {Power Analysis of Single-Cell {{term}}`{{RNA}}`-Sequencing Experiments},
881 | author = {Svensson, Valentine and Natarajan, Kedar Nath and Ly, Lam-Ha and Miragaia, Ricardo J. and Labalette, Charlotte and Macaulay, Iain C. and Cvejic, Ana and Teichmann, Sarah A.},
882 | year = {2017},
883 | month = apr,
884 | journal = {Nature Methods},
885 | volume = {14},
886 | number = {4},
887 | pages = {381--387},
888 | issn = {1548-7105},
889 | doi = {10.1038/nmeth.4220}
890 | }
891 |
892 | @article{Tasic2018,
893 | title = {Shared and Distinct Transcriptomic Cell Types across Neocortical Areas},
894 | author = {Tasic, Bosiljka and Yao, Zizhen and Graybuck, Lucas T. and Smith, Kimberly A. and Nguyen, Thuc Nghi and Bertagnolli, Darren and Goldy, Jeff and Garren, Emma and Economo, Michael N. and Viswanathan, Sarada and Penn, Osnat and Bakken, Trygve and Menon, Vilas and Miller, Jeremy and Fong, Olivia and Hirokawa, Karla E. and Lathia, Kanan and Rimorin, Christine and Tieu, Michael and Larsen, Rachael and Casper, Tamara and Barkan, Eliza and Kroll, Matthew and Parry, Sheana and Shapovalova, Nadiya V. and Hirschstein, Daniel and Pendergraft, Julie and Sullivan, Heather A. and Kim, Tae Kyung and Szafer, Aaron and Dee, Nick and Groblewski, Peter and Wickersham, Ian and Cetin, Ali and Harris, Julie A. and Levi, Boaz P. and Sunkin, Susan M. and Madisen, Linda and Daigle, Tanya L. and Looger, Loren and Bernard, Amy and Phillips, John and Lein, Ed and Hawrylycz, Michael and Svoboda, Karel and Jones, Allan R. and Koch, Christof and Zeng, Hongkui},
895 | year = {2018},
896 | month = nov,
897 | journal = {Nature},
898 | volume = {563},
899 | number = {7729},
900 | pages = {72--78},
901 | issn = {1476-4687},
902 | doi = {10.1038/s41586-018-0654-5},
903 | abstract = {The neocortex contains a multitude of cell types that are segregated into layers and functionally distinct areas. To investigate the diversity of cell types across the mouse neocortex, here we analysed 23,822 cells from two areas at distant poles of the mouse neocortex: the primary visual cortex and the anterior lateral motor cortex. We define 133 transcriptomic cell types by deep, single-cell term`RNA` sequencing. Nearly all types of GABA ({$\gamma$}-aminobutyric acid)-containing neurons are shared across both areas, whereas most types of glutamatergic neurons were found in one of the two areas. By combining single-cell term`RNA` sequencing and retrograde labelling, we match transcriptomic types of glutamatergic neurons to their long-range projection specificity. Our study establishes a combined transcriptomic and projectional taxonomy of cortical cell types from functionally distinct areas of the adult mouse cortex.}
904 | }
905 |
906 | @misc{technote_10x_intronic_reads,
907 | title = {Technical Note - Interpreting Intronic and Antisense Reads in 10x Genomics Single Cell Gene Expression Data},
908 | author = {{10x Genomics}},
909 | year = {2021},
910 | month = aug
911 | }
912 |
913 | @article{Turro2011,
914 | title = {Haplotype and Isoform Specific Expression Estimation Using Multi-Mapping term`{{RNA}}`-Seq Reads},
915 | author = {Turro, Ernest and Su, Shu-Yi and Gon{\c c}alves, {\^A}ngela and Coin, Lachlan JM and Richardson, Sylvia and Lewin, Alex},
916 | year = {2011},
917 | journal = {Genome biology},
918 | volume = {12},
919 | number = {2},
920 | pages = {1--15},
921 | publisher = {{Springer}}
922 | }
923 |
924 | @article{umic,
925 | title = {{{UMIc}}: {{A}} Preprocessing Method for {{UMI}} Deduplication and Reads Correction},
926 | author = {Tsagiopoulou, Maria and Maniou, Maria Christina and Pechlivanis, Nikolaos and Togkousidis, Anastasis and Kotrov{\'a}, Michaela and Hutzenlaub, Tobias and Kappas, Ilias and Chatzidimitriou, Anastasia and Psomopoulos, Fotis},
927 | year = {2021},
928 | month = may,
929 | journal = {Frontiers in Genetics},
930 | volume = {12},
931 | publisher = {{Frontiers Media SA}},
932 | doi = {10.3389/fgene.2021.660366}
933 | }
934 |
935 | @article{Wang2021,
936 | title = {Direct Comparative Analyses of {{10X}} Genomics Chromium and Smart-Seq2},
937 | author = {Wang, Xiliang and He, Yao and Zhang, Qiming and Ren, Xianwen and Zhang, Zemin},
938 | year = {2021},
939 | journal = {Genomics, Proteomics \& Bioinformatics},
940 | volume = {19},
941 | number = {2},
942 | pages = {253--266},
943 | issn = {1672-0229},
944 | doi = {10.1016/j.gpb.2020.02.005},
945 | keywords = {10X,Bulk term`RNA`-seq,Comparison,Single-cell term`RNA` sequencing,Smart-seq2}
946 | }
947 |
948 | @article{Wolock2019,
949 | title = {Scrublet: {{Computational}} Identification of Cell Doublets in Single-Cell Transcriptomic Data},
950 | author = {Wolock, Samuel L. and Lopez, Romain and Klein, Allon M.},
951 | year = {2019},
952 | month = apr,
953 | journal = {Cell Systems},
954 | volume = {8},
955 | number = {4},
956 | pages = {281--291.e9},
957 | publisher = {{Elsevier BV}},
958 | doi = {10.1016/j.cels.2018.11.005}
959 | }
960 |
961 | @article{wozniak1997using,
962 | title = {Using Video-Oriented Instructions to Speed up Sequence Comparison},
963 | author = {Wozniak, Andrzej},
964 | year = {1997},
965 | journal = {Bioinformatics (Oxford, England)},
966 | volume = {13},
967 | number = {2},
968 | pages = {145--150},
969 | publisher = {{Oxford University Press}}
970 | }
971 |
972 | @article{You_2021,
973 | title = {Benchmarking {{UMI-based}} Single-Cell term`{{RNA}}`-Seq Preprocessing Workflows},
974 | author = {You, Yue and Tian, Luyi and Su, Shian and Dong, Xueyi and Jabbari, Jafar S. and Hickey, Peter F. and Ritchie, Matthew E.},
975 | year = {2021},
976 | month = dec,
977 | journal = {Genome Biology},
978 | volume = {22},
979 | number = {1},
980 | publisher = {{Springer Science and Business Media LLC}},
981 | doi = {10.1186/s13059-021-02552-3}
982 | }
983 |
984 | @article{zhang2000,
985 | title = {A Greedy Algorithm for Aligning term`{{DNA}}` Sequences},
986 | author = {Zhang, Zheng and Schwartz, Scott and Wagner, Lukas and Miller, Webb},
987 | year = {2000},
988 | month = feb,
989 | journal = {Journal of Computational Biology},
990 | volume = {7},
991 | number = {1-2},
992 | pages = {203--214},
993 | publisher = {{Mary Ann Liebert Inc}},
994 | doi = {10.1089/10665270050081478}
995 | }
996 |
997 | @article{Zhang2019,
998 | title = {Comparative Analysis of Droplet-Based Ultra-High-Throughput Single-Cell {{term}}`{{RNA}}`-{{Seq}} Systems},
999 | author = {Zhang, Xiannian and Li, Tianqi and Liu, Feng and Chen, Yaqi and Yao, Jiacheng and Li, Zeyao and Huang, Yanyi and Wang, Jianbin},
1000 | year = {2019},
1001 | month = jan,
1002 | journal = {Molecular Cell},
1003 | volume = {73},
1004 | number = {1},
1005 | pages = {130-142.e5},
1006 | publisher = {{Elsevier}},
1007 | issn = {1097-2765},
1008 | doi = {10.1016/j.molcel.2018.10.020}
1009 | }
1010 |
1011 | @article{Ziegenhain2017,
1012 | title = {Comparative Analysis of {{Single-Cell}} term`{{RNA}}` Sequencing Methods},
1013 | author = {Ziegenhain, Christoph and Vieth, Beate and Parekh, Swati and Reinius, Bj{\"o}rn and {Guillaumet-Adkins}, Amy and Smets, Martha and Leonhardt, Heinrich and Heyn, Holger and Hellmann, Ines and Enard, Wolfgang},
1014 | year = {2017},
1015 | month = feb,
1016 | journal = {Molecular cell},
1017 | volume = {65},
1018 | number = {4},
1019 | pages = {631--643.e4},
1020 | address = {{United States}},
1021 | langid = {english},
1022 | keywords = {cost-effectiveness,method comparison,power analysis,simulation,single-cell term`RNA`-seq,transcriptomics}
1023 | }
1024 |
1025 | @article{ziegenhain2022molecular,
1026 | title = {Molecular Spikes: A Gold Standard for Single-Cell {{term}}`{{RNA}}` Counting},
1027 | author = {Ziegenhain, Christoph and Hendriks, Gert-Jan and {Hagemann-Jensen}, Michael and Sandberg, Rickard},
1028 | year = {2022},
1029 | journal = {Nature Methods},
1030 | volume = {19},
1031 | number = {5},
1032 | pages = {560--566},
1033 | publisher = {{Nature Publishing Group}}
1034 | }
1035 |
1036 | @article{zumis,
1037 | title = {{{zUMIs}} - {{A}} Fast and Flexible Pipeline to Process term`{{RNA}}` Sequencing Data with {{UMIs}}},
1038 | author = {Parekh, Swati and Ziegenhain, Christoph and Vieth, Beate and Enard, Wolfgang and Hellmann, Ines},
1039 | year = {2018},
1040 | month = may,
1041 | journal = {GigaScience},
1042 | volume = {7},
1043 | number = {6},
1044 | publisher = {{Oxford University Press (OUP)}},
1045 | doi = {10.1093/gigascience/giy059}
1046 | }
1047 |
--------------------------------------------------------------------------------
/mkdocs.yml:
--------------------------------------------------------------------------------
1 | site_name: single_cell_tutorial Readthedocs
2 | repo_url: https://github.com/Starlitnightly/single_cell_tutorial
3 | site_author: "Zehua Zeng"
4 | copyright: git"Copyright © 2019-2023, 112 Lab, USTB"
5 |
6 | nav:
7 | - 写在前面: index.md
8 | - 介绍:
9 | - 1-1. 现有技术: Introduction/1-1.md
10 | - 1-2. 单细胞RNA测序: Introduction/1-2.md
11 | - 1-3. 原始数据处理: error.md
12 | - 1-4. 分析框架与工具: Introduction/1-4.ipynb
13 | - 1-5. 环境配置: error.md
14 | - 预处理和可视化:
15 | - 2-1. 质量控制: preprocess/2-1.ipynb
16 | - 2-2. 标准化: preprocess/2-2.ipynb
17 | - 2-3. 特征选择: preprocess/2-3.ipynb
18 | - 2-4. 降维: preprocess/2-4.ipynb
19 | - 2-5. 批次效应校正: preprocess/2-5.ipynb
20 | - 细胞类型识别:
21 | - 3-1. 聚类: Identifying cellular structure/3-1.ipynb
22 | - 3-2. 手动注释: Identifying cellular structure/3-2.ipynb
23 | - 3-3. 自动注释(一): Identifying cellular structure/3-3.ipynb
24 | - 3-4. 自动注释(二): Identifying cellular structure/3-4.ipynb
25 | - 3-5. 迁移注释: Identifying cellular structure/3-5.ipynb
26 | - 细胞状态分析:
27 | - 4-1. 差异表达分析: stages/4-1.ipynb
28 | - 4.2. 细胞组成分析: stages/4-2.ipynb
29 | - 4.3. 细胞扰动分析: stages/4-3.ipynb
30 | - 4.4. 基因集与通路富集分析: error.md
31 | - 4.5. 细胞沉默分析: error.md
32 | - 轨迹推断:
33 | - 5-1. 拟时序计算: trajectory/5-1.ipynb
34 | - 5-2. RNA速率模型: trajectory/5-2.ipynb
35 | - 5-3. 细胞命运推断: error.md
36 | - 调控建模:
37 | - 6-1. 基因调控网络GRN: error.md
38 | - 6-2. 细胞间通讯: error.md
39 | - 其他:
40 | - 7-1. NCBI数据公开: others/7-1.md
41 |
42 |
43 |
44 |
45 | plugins:
46 | - mkdocs-jupyter
47 | - bibtex:
48 | bib_file: "docs/ref/ref.bib"
49 |
50 | theme:
51 | name: material
52 | custom_dir: docs/overrides
53 |
54 | palette:
55 | - media: "(prefers-color-scheme: light)"
56 | scheme: default
57 | toggle:
58 | icon: material/toggle-switch-off-outline
59 | name: Switch to dark mode
60 | - media: "(prefers-color-scheme: dark)"
61 | scheme: slate
62 | toggle:
63 | icon: material/toggle-switch
64 | name: Switch to light mode
65 |
66 | features:
67 | - navigation.instant
68 | - navigation.tracking
69 | - navigation.indexes
70 |
71 | markdown_extensions:
72 | - admonition
73 | - pymdownx.details
74 | - attr_list
75 | - md_in_html
76 | - pymdownx.arithmatex:
77 | generic: true
78 | - pymdownx.highlight:
79 | linenums: true
80 | linenums_style: pymdownx-inline
81 | - pymdownx.superfences:
82 | custom_fences:
83 | - name: mermaid
84 | class: mermaid
85 | format: !!python/name:pymdownx.superfences.fence_code_format
86 | - pymdownx.inlinehilite
87 | - footnotes
88 |
89 | extra_javascript:
90 | - javascripts/config.js
91 | - https://polyfill.io/v3/polyfill.min.js?features=es6
92 | - https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js
93 |
94 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | omicverse
2 | pandas == 1.5.3
3 | scanpy
4 | gseapy == 0.10.8
5 | matplotlib
6 | seaborn
7 | scikit-learn
8 | sphinx_autosummary_accessors
9 | sphinx_autodoc_typehints
10 | recommonmark
11 | sphinx_markdown_tables
12 | sphinx_copybutton
13 | nbsphinx
14 | IPython
15 | ipywidgets
16 | lifelines
17 | boltons
18 | ctxcore
19 | multiprocess
20 | ktplotspy
21 | leidenalg
22 | datashader
23 | graphtools
24 | igraph
25 | phate
26 | jinja2 == 3.1.2
27 | griffe == 0.26.0
28 | mkdocs == 1.4.2
29 | mkdocs-jupyter==0.23.0
30 | mkdocs-material==9.1.2
31 | mkdocs-glightbox==0.3.2
32 | mkdocstrings == 0.20.0
33 | mkdocs-gen-files ==0.4.0
34 | mkdocstrings-python ==0.9.0
35 | python-dotplot
36 | metatime
37 | tensorboard
38 | mkdocs-bibtex
39 |
--------------------------------------------------------------------------------