├── LICENSE
├── README.md
├── assets
    ├── change_in_pseudotime.png
    ├── divergence_map.png
    ├── phase_simulations.png
    └── velocity_confidence.png
├── requirements.txt
├── setup.py
└── velodyn
    ├── __init__.py
    ├── velocity_ci.py
    ├── velocity_divergence.py
    ├── velocity_dpst.py
    └── velocity_dynsys.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | Apache License
 2 | 
 3 | Version 2.0, January 2004
 4 | 
 5 | http://www.apache.org/licenses/
 6 | 
 7 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
 8 | 
 9 | 1. Definitions.
10 | 
11 | "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
12 | 
13 | "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
14 | 
15 | "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
16 | 
17 | "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
18 | 
19 | "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
20 | 
21 | "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
22 | 
23 | "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
24 | 
25 | "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
26 | 
27 | "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
28 | 
29 | "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
30 | 
31 | 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
32 | 
33 | 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
34 | 
35 | 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
36 | 
37 | You must give any other recipients of the Work or Derivative Works a copy of this License; and
38 | You must cause any modified files to carry prominent notices stating that You changed the files; and
39 | You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
40 | If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
41 | 
42 | You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
43 | 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
44 | 
45 | 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
46 | 
47 | 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
48 | 
49 | 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
50 | 
51 | 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
52 | 
53 | END OF TERMS AND CONDITIONS


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # VeloDyn -- Quantitative analysis of RNA velocity
  2 | 
  3 | RNA velocity infers a rate of change for each transcript in an RNA-sequencing experiment based on the ratio of intronic to exonic reads.
  4 | This inferred velocity vectors serves as a prediction for the *future* transcriptional state of a cell, while the current read counts serve as a measurement of the instantaneous state.
  5 | Qualitative analysis of RNA velocity has been used to establish the order of gene expression states in a sequence, but quantitative analysis has generally been lacking.
  6 | 
  7 | `velodyn` adopts formalisms from dynamical systems to provide a quantitative framework for RNA velocity analysis.
  8 | The tools provided by `velodyn` along with their associated usage are described below.
  9 | All `velodyn` tools are designed to integrate with the `scanpy` ecosystem and `anndata` structures.
 10 | 
 11 | We have released `velodyn` in association with a recent paper.
 12 | Please cite our paper if you find `velodyn` useful for your work.
 13 | 
 14 | 
 15 | [**Differentiation reveals latent features of aging and an energy barrier in murine myogenesis**](https://pubmed.ncbi.nlm.nih.gov/33910007/)  
 16 | Jacob C Kimmel, Nelda Yi, Margaret Roy, David G Hendrickson, David R Kelley  
 17 | *Cell Reports* 2021, 35 (4); doi: https://doi.org/10.1016/j.celrep.2021.109046
 18 | 
 19 | **BibTeX**
 20 | 
 21 | ```
 22 | @article{kimmel_latent_2021,
 23 | 	title = {Differentiation reveals latent features of aging and an energy barrier in murine myogenesis},
 24 | 	volume = {35},
 25 | 	issn = {2211-1247},
 26 | 	url = {https://www.cell.com/cell-reports/abstract/S2211-1247(21)00362-4},
 27 | 	doi = {10.1016/j.celrep.2021.109046},
 28 | 	language = {English},
 29 | 	number = {4},
 30 | 	urldate = {2021-05-19},
 31 | 	journal = {Cell Reports},
 32 | 	author = {Kimmel, Jacob C. and Yi, Nelda and Roy, Margaret and Hendrickson, David G. and Kelley, David R.},
 33 | 	month = apr,
 34 | 	year = {2021},
 35 | 	pmid = {33910007},
 36 | 	note = {Publisher: Elsevier},
 37 | 	keywords = {aging, dynamical systems, fibro/adipogenic progenitor, muscle stem cell, myogenesis, RNA-seq, single cell, stem cell}
 38 | }
 39 | ```
 40 | 
 41 | If you have any questions or comments, please feel free to email me.
 42 | 
 43 | Jacob C. Kimmel, PhD  
 44 | [jacobkimmel+velodyn@gmail.com](mailto:jacobkimmel+velodyn@gmail.com)  
 45 | Calico Life Sciences, LLC
 46 | 
 47 | 
 48 | ## Installation
 49 | 
 50 | ```bash
 51 | git clone https://github.com/calico/velodyn
 52 | cd velodyn
 53 | pip install .
 54 | ```
 55 | 
 56 | or 
 57 | 
 58 | ```bash
 59 | pip install velodyn
 60 | ```
 61 | 
 62 | # Tutorial
 63 | 
 64 | We have provided a `velodyn` tutorial using the Colab computing environment from Google.
 65 | This notebook allows for execution of a `velodyn` workflow, end-to-end, all within your web browser.
 66 | 
 67 | [velodyn tutorial](https://colab.research.google.com/drive/1JMjw_nJYHmOAEn7ZHL8q2MQbyxmphbni)
 68 | 
 69 | ## Gene expression state stability measurements
 70 | 
 71 | `velodyn` can provide a quantitative measure of gene expression state stability based on the divergence of the RNA velocity field.
 72 | The divergence reflects the net flow of cells to a particular region of state space and is frequently used to characterize vector fields in physical systems.
 73 | Divergence measures can reveal stable attractor states and unstable repulsor states in gene expression space.
 74 | For example, we computed the divergence of gene expression states during myogenic differentiation and identified two attractor states, separated by a repulsor state.
 75 | This repulsor state is unstable, suggesting it represents a decision point where cells decide to adopt one of the attractor states.
 76 | 
 77 | ![Divergence maps of myogenic differentiation. Two attractor states along a one-dimensional manifold are separated by a repulsor state in the center.](assets/divergence_map.png)
 78 | 
 79 | 
 80 | ### Usage
 81 | 
 82 | ```python
 83 | from velodyn.velocity_divergence import compute_div, plot_div
 84 | 
 85 | D = compute_div(
 86 |     adata=adata,
 87 |     use_rep='pca',
 88 |     n_grid_points=30,
 89 | )
 90 | print(D.shape) # (30, 30,)
 91 | 
 92 | fig, ax = plot_div(D)
 93 | ```
 94 | 
 95 | ## State transition rate comparisons with phase simulations
 96 | 
 97 | Across experimental conditions, the rates of change in gene expression space may change significantly.
 98 | However, it is difficult to determine where RNA velocity fields differ across conditions, and what impact any differences may have on the transit time between states.
 99 | In dynamical systems, phase point analysis is used to quantify the integrated behavior of a vector field.
100 | For a review of phase point simulation methods, we highly recommend *Nonlinear Dynamics & Chaos* by Steven Strogatz.
101 | 
102 | In brief, a phase point simulation instantiates a particle ("phase point") at some position in a vector field.
103 | The position of the particle is updated ("evolved") over a number of timesteps using numerical methods.
104 | 
105 | For `velodyn`, we implement our update step using a stochastic weighted nearest neighbors model.
106 | We have a collection of observed cells and their associated velocity vectors as the source of our vector field.
107 | For each point at each timestep, we estimate the parameters of a Gaussian distribution of possible update steps based on the mean and variance of observed velocity vectors in neighboring cells.
108 | We then draw a sample from this distribution to update the position of the phase point.
109 | The stochastic nature of this evolution mirrors the stochastic nature of gene expression.
110 | 
111 | By applying phase point simulations to RNA velocity fields, `velodyn` allows for comparisons of state transition rates across experimental conditions.
112 | For example, we used phase point simulations to analyze the rate of myogenic differentiation in young and aged muscle stem cells.
113 | These analyses revealed that aged cells progress more slowly toward the differentiated state than their young counterparts.
114 | 
115 | ![Phase point simulations show the direction and rate of motion in an RNA velocity field.](assets/phase_simulations.png)
116 | 
117 | ### Usage
118 | 
119 | ```python
120 | from velodyn.velocity_dynsys import PhaseSimulation
121 | 
122 | simulator = PhaseSimulation(
123 |     adata=adata,
124 | )
125 | # set the velocity basis to use
126 | simulator.set_velocity_field(basis='pca')
127 | # set starting locations for phase points
128 | # using a categorical variable in `adata.obs`
129 | simulator.set_starting_point(
130 |     method='metadata', 
131 |     groupby='starting_points',
132 |     group='forward',
133 | )
134 | # run simulations using the stochastic kNN velocity estimator
135 | trajectories = simulator.simulate_phase_points(
136 |     n_points=n_points_to_simulate,
137 |     n_timesteps=n_timesteps_to_simulate,
138 |     velocity_method='knn',
139 |     velocity_method_attrs={'vknn_method': 'stochastic'},          
140 |     step_scale=float(step_scale),
141 |     multiprocess=True, # use multiple cores
142 | )
143 | 
144 | print(trajectories.shape)
145 | # [
146 | #     n_points_to_simulate, 
147 | #     n_timesteps, 
148 | #     n_embedding_dims, 
149 | #     (position, velocity_mean, velocity_std),
150 | # ]
151 | ```
152 | 
153 | ## Change in pseudotime predictions
154 | 
155 | Dynamic cell state transitions are often parameterized by a pseudotime curve, as introduced by Cole Trapnell in `monocle`.
156 | Given RNA velocity vectors and pseudotime coordinates, `velodyn` can predict a "change in pseudotime" for each individual cell.
157 | The procedure for predicting a change in pseudotime is fairly simple.
158 | `velodyn` trains a machine learning model to predict pseudotime coordinates from gene expression embedding coordinates (e.g. coordinates in principal component space).
159 | The future position of each cell in this embedding is computed as the current position shifted by the RNA velocity vector and a new pseudotime coordinate is predicted using the trained model.
160 | The "change in pseudotime" is then returned as the difference between the pseudotime coordinate for the predicted future point and the pseudotime coordinate for the observed point.
161 | 
162 | ![Change in pseudotime is predicted using a machine learning model for each cell.](assets/change_in_pseudotime.png)
163 | 
164 | ### Usage
165 | 
166 | ```python
167 | from velodyn.velocity_dpst import dPseudotime
168 | 
169 | DPST = dPseudotime(
170 |     adata=adata,
171 |     use_rep='pca',
172 |     pseudotime_var='dpt_pseudotime',
173 | )
174 | change_in_pseudotime = DPST.predict_dpst()
175 | ```
176 | 
177 | ## Velocity confidence intervals
178 | 
179 | RNA velocity estimates for each cell are incredibly useful, but there is no notion of variance inherent to the inference procedure.
180 | If we wish to make comparisons between cells that moving in different directions in gene expression space, we require confidence intervals on each cell's RNA velocity vector.
181 | `velodyn` introduces a molecular parameteric bootstrapping procedure to compute these confidence intervals.
182 | Briefly, we parameterize a multinomial distribution across genes using the mRNA profile for each cell.
183 | We then parameterize a second multinomial distribution for each gene in each cell based on the observed counts of spliced, unspliced, and ambiguous reads.
184 | We sample reads to the observed depth across genes, using the gene-level multinomial to distribute these reads as spliced, unspliced, or ambiguous observations and repeat this prodcued many times for each cell.
185 | We then compute RNA velocity vectors for each bootstrap sample and use these vectors to compute RNA velocity confidence intervals.
186 | 
187 | ![RNA velocity confidence intervals for each cell.](assets/velocity_confidence.png)
188 | 
189 | ### Usage
190 | 
191 | ```python
192 | from velodyn.velocity_ci import VelocityCI
193 | 
194 | # initialize velocity CI
195 | vci = VelocityCI(
196 |     adata=adata,
197 | )
198 | # sample velocity vectors
199 | # returns [n_iter, Cells, Genes]
200 | velocity_bootstraps = vci.bootstrap_velocity(
201 |     n_iter=n_iter,
202 |     save_counts=out_path,
203 |     embed=adata_embed, # experiment with genes of interest and relevant embedding
204 | )
205 | ```
206 | 


--------------------------------------------------------------------------------
/assets/change_in_pseudotime.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calico/velodyn/b98a1d15a031feff48479dc4e2963c4f62ba07d6/assets/change_in_pseudotime.png


--------------------------------------------------------------------------------
/assets/divergence_map.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calico/velodyn/b98a1d15a031feff48479dc4e2963c4f62ba07d6/assets/divergence_map.png


--------------------------------------------------------------------------------
/assets/phase_simulations.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calico/velodyn/b98a1d15a031feff48479dc4e2963c4f62ba07d6/assets/phase_simulations.png


--------------------------------------------------------------------------------
/assets/velocity_confidence.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calico/velodyn/b98a1d15a031feff48479dc4e2963c4f62ba07d6/assets/velocity_confidence.png


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | anndata>=0.6.22.post1
 2 | h5py>=2.10.0
 3 | loompy>=2.0.16
 4 | matplotlib>=3.0.2
 5 | numpy>=1.17.4
 6 | pandas>=0.23.4
 7 | scanpy>=1.4
 8 | scikit-learn>=0.21.3
 9 | scipy>=1.2.0
10 | scvelo>=0.1.16.dev41+74978dd
11 | seaborn>=0.9.0
12 | pathos>=0.2.5


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | if sys.version_info < (3, 6,):
 3 |     sys.exit('velodyn requires Python >= 3.6')
 4 | from pathlib import Path
 5 | 
 6 | from setuptools import setup, find_packages
 7 | 
 8 | try:
 9 |     from velodyn import __author__, __email__
10 | except ImportError:  # Deps not yet installed
11 |     __author__ = __email__ = ''
12 | 
13 | 
14 | long_description = '''
15 | RNA velocity infers a rate of change for each transcript in an RNA-sequencing experiment based on the ratio of intronic to exonic reads. This inferred velocity vectors serves as a prediction for the future transcriptional state of a cell, while the current read counts serve as a measurement of the instantaneous state. Qualitative analysis of RNA velocity has been used to establish the order of gene expression states in a sequence, but quantitative analysis has generally been lacking.\n
16 | \n
17 | velodyn adopts formalisms from dynamical systems to provide a quantitative framework for RNA velocity analysis. The tools provided by velodyn along with their associated usage are described below. All velodyn tools are designed to integrate with the scanpy ecosystem and anndata structures.\n
18 | \n
19 | We have released velodyn in association with a recent pre-print. Please cite our pre-print if you find velodyn useful for your work.\n
20 | \n
21 | Differentiation reveals the plasticity of age-related change in murine muscle progenitors\n
22 | Jacob C Kimmel, David G Hendrickson, David R Kelley\n
23 | bioRxiv 2020.03.05.979112; doi: https://doi.org/10.1101/2020.03.05.979112
24 | '''
25 | 
26 | setup(
27 |     name='velodyn',
28 |     version='0.1.0',
29 |     description='Dynamical systems approaches for RNA velocity analysis',
30 |     long_description=long_description,
31 |     url='http://github.com/calico/velodyn',
32 |     author=__author__,
33 |     author_email=__email__,
34 |     license='Apache',
35 |     python_requires='>=3.6',
36 |     install_requires=[
37 |         l.strip() for l in
38 |         Path('requirements.txt').read_text('utf-8').splitlines()
39 |     ],
40 |     packages=find_packages(),
41 |     classifiers=[
42 |         'Intended Audience :: Science/Research',
43 |         'Topic :: Scientific/Engineering :: Bio-Informatics',
44 |     ],
45 | )
46 | 


--------------------------------------------------------------------------------
/velodyn/__init__.py:
--------------------------------------------------------------------------------
1 | __author__ = 'Jacob C. Kimmel'
2 | __email__ = 'jacobkimmel@gmail.com'
3 | __version__ = '0.1'
4 | 
5 | # populate the namespace so top level imports work
6 | # e.g.
7 | # >> from velodyn.velocity_divergence import VelocityDivergence
8 | from . import velocity_ci, velocity_divergence, velocity_dpst, velocity_dynsys


--------------------------------------------------------------------------------
/velodyn/velocity_ci.py:
--------------------------------------------------------------------------------
  1 | """Generate confidence intervals for RNA velocity models by bootstrapping
  2 | across reads.
  3 | 
  4 | Our bootstrapping procedure is as follows:
  5 | 
  6 | 1. Given a spliced count matrix ([Cells, Genes]) S and an unspliced matrix U,
  7 | create a total counts matrix X = S + U.
  8 | 2.1 For each cell X_i \in X, fit a multinomial distribution. Sample D (depth) reads
  9 | from each multinomial to create a sampled count distribution across genes \hat X_i.
 10 | 2.2 For each gene g in \hat X_i, fit a binomial distribution Binom(n=\hat X_ig, p=\frac{S_ig}{X_ig})
 11 | which represents the distribution of spliced vs. unspliced counts.
 12 | 2.3 Sample an estimate of the spliced counts for X_ig, \hat S_ig ~ Binom(n=X_ig, p=S_ig/X_ig).
 13 | Compute the conjugate unspliced read count \hat U_ig = \hat X_ig - \hat S_ig.
 14 | 3. Given the complete bootstrapped samples \hat S, \hat U, estimate a bootstrapped
 15 | velocity vector for consideration.
 16 | 
 17 | Bootstrap samples of cell counts therefore have the same number of counts as the original
 18 | cell, preventing any issues due to differing library depths:
 19 | 
 20 |     \sum_i \sum_j X_{ij} \equiv \sum_i \sum_j \hat X_{ij}
 21 | 
 22 | """
 23 | import numpy as np
 24 | import anndata
 25 | import scvelo as scv
 26 | import time
 27 | import os.path as osp
 28 | import argparse
 29 | import multiprocessing
 30 | 
 31 | 
 32 | class VelocityCI(object):
 33 |     """Compute confidence intervals for RNA velocity vectors
 34 | 
 35 |     Attributes
 36 |     ----------
 37 |     adata : anndata.AnnData
 38 |         [Cells, Genes] experiment with spliced and unspliced read
 39 |         matrices in `.layers` as "spliced", "unspliced", "ambiguous".
 40 |         `.X` should contain raw count values, rather than transformed
 41 |         counts.
 42 |     S : np.ndarray
 43 |         [Cells, Genes] spliced read counts.
 44 |     U : np.ndarray
 45 |         [Cells, Genes] unspliced read counts.
 46 |     A : np.ndarray
 47 |         [Cells, Genes] ambiguous read counts.
 48 | 
 49 |     Methods
 50 |     -------
 51 |     _sample_abundance_profile(x)
 52 |         sample a total read count vector from a multinomial fit
 53 |         to the observed count vector `x`.
 54 |     _sample_spliced_unspliced(s, u, a, x_hat)
 55 |         sample spliced, unspliced, and ambiguous read counts from
 56 |         a multinomial given a sample of total read counts `x_hat`
 57 |         and observed `s`pliced, `u`nspliced, `a`mbigious counts.
 58 |     _sample_matrices()
 59 |         samples a matrix of spliced, unspliced and ambiguous read
 60 |         counts for all cells and genes in `.adata`.
 61 |     _fit_velocity(SUA_hat,)
 62 |         fits a velocity model to sampled spliced, unspliced counts
 63 |         in an output from `_sample_matrices()`
 64 |     bootstrap_velocity(n_iter, embed)
 65 |         generate bootstrap samples of RNA velocity estimates using
 66 |         `_sample_matrices` and `_fit_velocity` sequentially.
 67 | 
 68 |     Notes
 69 |     -----
 70 |     Parallelization requires use of shared ctypes to avoid copying our
 71 |     large data arrays for each child process. See `_sample_matrices` for
 72 |     a discussion of the relevant considerations and solutions.
 73 |     Due to this issue, we have modified `__getstate__` such that pickling
 74 |     this object will not preserve all of the relevant data.
 75 |     """
 76 | 
 77 |     def __init__(
 78 |         self,
 79 |         adata: anndata.AnnData,
 80 |     ) -> None:
 81 |         """Compute confidence intervals for RNA velocity vectors
 82 | 
 83 |         Parameters
 84 |         ----------
 85 |         adata : anndata.AnnData
 86 |             [Cells, Genes] experiment with spliced and unspliced read
 87 |             matrices in `.layers` as "spliced", "unspliced", "ambiguous".
 88 |             `.X` should contain raw count values, rather than transformed
 89 |             counts.
 90 | 
 91 |         Returns
 92 |         -------
 93 |         None.
 94 |         """
 95 |         # check that all necessary layers are present
 96 |         if 'spliced' not in adata.layers.keys():
 97 |             msg = 'spliced matrix must be available in `adata.layers`.'
 98 |             raise ValueError(msg)
 99 |         if 'unspliced' not in adata.layers.keys():
100 |             msg = 'unspliced matrix must be available in `adata.layers`.'
101 |             raise ValueError(msg)
102 |         if 'ambiguous' not in adata.layers.keys():
103 |             msg = 'ambiguous matrix must be available in `adata.layers`.'
104 |             raise ValueError(msg)
105 | 
106 |         # copy relevant layers in memory to avoid altering the original
107 |         # input
108 |         self.adata = adata
109 |         self.S = adata.layers['spliced'].copy()
110 |         self.U = adata.layers['unspliced'].copy()
111 |         self.A = adata.layers['ambiguous'].copy()
112 | 
113 |         # convert arrays to dense format if they are sparse
114 |         if type(self.S) != np.ndarray:
115 |             try:
116 |                 self.S = self.S.toarray()
117 |             except ValueError:
118 |                 msg = 'self.S was not np.ndarray, failed .toarray()'
119 |                 print(msg)
120 | 
121 |         if type(self.U) != np.ndarray:
122 |             try:
123 |                 self.U = self.U.toarray()
124 |             except ValueError:
125 |                 msg = 'self.U was not np.ndarray, failed .toarray()'
126 |                 print(msg)
127 | 
128 |         if type(self.A) != np.ndarray:
129 |             try:
130 |                 self.A = self.A.toarray()
131 |             except ValueError:
132 |                 msg = 'self.A was not np.ndarray, failed .toarray()'
133 |                 print(msg)
134 | 
135 |         # here, `X` is the total number of counts per feature regardless
136 |         # of the region where the reads map
137 |         self.X = self.S + self.U + self.A
138 |         self.data_shape = self.X.shape
139 |         assert type(self.X) == np.ndarray
140 | 
141 |         # set normalization scale for velocity fitting
142 |         self.counts_per_cell_after = 1e4
143 | 
144 |         return
145 | 
146 |     def __getstate__(self,) -> dict:
147 |         """
148 |         Override the default `__getstate__` behavior
149 |         so we do not pickle huge arrays.
150 | 
151 |         Returns
152 |         -------
153 |         d : dict
154 |             object state dictionary, with large arrays removed
155 |             to allow pickling and passage to child processes.
156 |         
157 |         Notes
158 |         -----
159 |         When we perform multiprocessing, we pickly the `VelocityCI`
160 |         class to pass to workers. Here, we remove all large memory
161 |         objects from the `__getstate__` method which is used during
162 |         the pickle process to gather all the relevant components of
163 |         an object in memory. We provide access to a shared buffer 
164 |         with these objects to each worker to avoid copying them.
165 |         """
166 |         d = dict(self.__dict__)
167 |         for attr in ['X', 'S', 'U', 'A']:
168 |             del d[attr]
169 |             del d[attr+'_batch']
170 |         large_arr = ['adata', 'SUA_hat', 'embed', 'velocity_estimates']
171 |         for k in large_arr:
172 |             if k in d.keys():
173 |                 del d[k]
174 |         return d
175 | 
176 |     def _sample_abundance_profile(
177 |         self,
178 |         x: np.ndarray,
179 |     ) -> np.ndarray:
180 |         """Given an observed mRNA abundance profile, fit a multinomial
181 |         distribution and randomly sample a corresponding profile.
182 | 
183 |         Parameters
184 |         ----------
185 |         x : np.ndarray
186 |             [Genes,] observed mRNA counts vector.
187 | 
188 |         Returns
189 |         -------
190 |         x_hat : np.ndarray
191 |             [Genes,] a randomly sampled abundance profile, 
192 |             given the multinomial distribution specified by `x`.
193 |         """
194 |         # we need to instantiate a local random state to ensure
195 |         # each multiprocess thread generates true random numbers
196 |         local_rnd = np.random.RandomState()
197 |         # cast everything to `np.float64` before operations due to a
198 |         # `numpy` bug
199 |         # https://github.com/numpy/numpy/issues/8317
200 |         x = x.astype(np.float64)
201 |         # compute relative abundance profile as feature proportions
202 |         pvals = x / np.sum(x)
203 |         # sample a count distribution from the multinomial
204 |         x_hat = local_rnd.multinomial(
205 |             n=int(np.sum(x)),
206 |             pvals=pvals,
207 |         )
208 |         return x_hat
209 | 
210 |     def _sample_spliced_unspliced(
211 |         self,
212 |         s: np.ndarray,
213 |         u: np.ndarray,
214 |         a: np.ndarray,
215 |         x_hat: np.ndarray,
216 |     ) -> np.ndarray:
217 |         """Sample the proportion of spliced/unspliced reads for a 
218 |         randomly sampled mRNA profile given observed spliced and
219 |         unspliced read counts.
220 | 
221 |         Parameters
222 |         ----------
223 |         s : np.ndarray
224 |             [Genes,] observed spliced read counts for each gene.
225 |         u : np.ndarray
226 |             [Genes,] observed unspliced read counts for each gene.
227 |         a : np.ndarray
228 |             [Genes,] ambiguous read counts for each gene.
229 |         x_hat : np.ndarray
230 |             [Genes,] sampled total gene counts profile.
231 | 
232 |         Returns
233 |         -------
234 |         sua_hat : np.ndarray
235 |             [Genes, (Spliced, Unspliced, Ambiguous)] read counts
236 |             randomly sampled from a multinomial.
237 |         """
238 |         # we need to instantiate a local random state to ensure
239 |         # each multiprocess thread generates true random numbers
240 |         local_rnd = np.random.RandomState()
241 |         # Genes, (Spliced, Unspliced, Ambiguous)
242 |         sua_hat = np.zeros((len(x_hat), 3))
243 |         # compute total reads per feature
244 |         x = s + u + a
245 |         x = x.astype(np.float64)
246 |         
247 |         # for each gene, sample the proportion of counts that originate
248 |         # from spliced, unspliced, or ambiguous regions using a multinomial
249 |         # distribution parameterized with the observed proportions
250 |         for g in range(len(x_hat)):
251 | 
252 |             if x[g] == 0:
253 |                 sua_hat[g, :] = 0
254 |                 continue
255 | 
256 |             pvals = np.array([s[g], u[g], a[g]], dtype=np.float64) / x[g]
257 |             sua_hat[g, :] = local_rnd.multinomial(
258 |                 n=x_hat[g],
259 |                 pvals=pvals,
260 |             )
261 | 
262 |         return sua_hat
263 | 
264 |     def _sample_cell(self,
265 |                      i: int,
266 |                      ) -> np.ndarray:
267 |         """Draw samples for a single cell.
268 | 
269 |         Parameters
270 |         ----------
271 |         i : int
272 |             cell index in `.X, .S, .U, .A` matrices.
273 | 
274 |         Returns
275 |         -------
276 |         sua_hat : np.ndarray
277 |             [Genes, (Spliced, Unspliced, Ambig.)] for a single
278 |             cell at index `i` in `.X`, ...
279 | 
280 |         Notes
281 |         -----
282 |         This implementation allows for simple parallelization with
283 |         a map across the cell indices.
284 |         """
285 |         # gather the count arrays from a shared `RawArray`
286 |         # buffer and reshape them from flat [N*M,] to array
287 |         # [N, M] format
288 |         X = np.frombuffer(
289 |             var_args['X_batch'],
290 |             dtype=np.float64,
291 |         ).reshape(var_args['data_shape_batch'])
292 |         S = np.frombuffer(
293 |             var_args['S_batch'],
294 |         ).reshape(var_args['data_shape_batch'])
295 |         U = np.frombuffer(
296 |             var_args['U_batch'],
297 |             dtype=np.float64,
298 |         ).reshape(var_args['data_shape_batch'])
299 |         A = np.frombuffer(
300 |             var_args['A_batch'],
301 |             dtype=np.float64,
302 |         ).reshape(var_args['data_shape_batch'])
303 | 
304 |         # get the read counts of each type for 
305 |         # a single cell
306 |         
307 |         x = X[i, :]  # total read counts
308 |         s = S[i, :]  # spliced read counts
309 |         u = U[i, :]  # unspliced read counts
310 |         a = A[i, :]  # ambiguous read counts
311 | 
312 |         # sample the relative abudance across genes
313 |         x_hat = self._sample_abundance_profile(
314 |             x=x,
315 |         )
316 |         # for each gene, sample the proportion of reads
317 |         # originating from each type of region
318 |         sua_hat = self._sample_spliced_unspliced(
319 |             s=s,
320 |             u=u,
321 |             a=a,
322 |             x_hat=x_hat,
323 |         )
324 |         return sua_hat
325 | 
326 |     def _sample_matrices(
327 |         self,
328 |         batch_size: int = 256,
329 |     ) -> np.ndarray:
330 |         """Sample a spliced and unspliced counts matrix
331 |         for a bootstrapped velocity vector estimation.
332 | 
333 |         Parameters
334 |         ----------
335 |         batch_size : int
336 |             number of cells to sample in parallel.
337 |             smaller batches use less RAM.
338 | 
339 |         Returns
340 |         -------
341 |         SUA_hat : np.ndarray
342 |             [Cells, Genes, (Spliced, Unspliced, Ambiguous)] 
343 |             randomly sampled array of read counts assigned
344 |             to a splicing status.
345 | 
346 |         Notes
347 |         -----
348 |         `_sample_matrices` uses `multiprocessing` to parallelize
349 |         bootstrap simulations. We run into a somewhat tricky issue
350 |         do to the size of our source data arrays (`.X, .S, .U, .A`).
351 |         The usual approach to launching multiple processes is to use
352 |         a `multiprocessing.Pool` to launch child processes, then copy
353 |         the relevant data to each process by passing it as arguments 
354 |         or through pickling of object attributes.
355 | 
356 |         Here, the size of our arrays means that copying the large matrices
357 |         to memory for each child process is (1) memory prohibitive and 
358 |         (2) really, really slow, defeating the whole purpose of parallelization.
359 | 
360 |         Here, we've implemented a batch processing solution to preserve RAM.
361 |         We also use shared ctype arrays to avoid copying memory across workers.
362 |         Use of ctype arrays increases the performance by ~5-fold. From this, we
363 |         infer that copying even just the minibatch count arrays across all the 
364 |         child processes is prohibitively expensive.
365 | 
366 |         We can create shared ctype arrays using `multiprocessing.sharedctypes` 
367 |         that allow child processes to reference a single copy of each 
368 |         relevant array in memory.
369 |         Because these data are read-only, we can get away with using
370 |         `multiprocessing.RawArray` since we don't need process synchronization 
371 |         locks or any other sophisticated synchronization.
372 | 
373 |         Using `RawArray` with child processes in a pool is a little strange.
374 |         We can't pass the `RawArray` pointer through a pickle, so we have to
375 |         declare the pointers as global variables that get inherited by each
376 |         child process through use of an `initializer` function in the pool.
377 |         We also have to ensure that our parent object `__getstate__` function
378 |         doesn't contain any of these large arrays, so that they aren't 
379 |         accidently pickled in with the class methods. To fix that, we modify
380 |         `__getstate__` above to remove large attributes from the object dict.       
381 |         """
382 |         # [Cells, Genes, (Spliced, Unspliced, Ambiguous)]
383 |         SUA_hat = np.zeros(
384 |             self.X.shape + (3,)
385 |         )
386 |         # compute the total number of batches to use
387 |         n_batches = int(np.ceil(self.X.shape[0]/batch_size))
388 | 
389 |         batch_idx = 0
390 |         for batch in range(n_batches):
391 |             end_idx = min(batch_idx+batch_size, self.X.shape[0])
392 | 
393 |             # set batch specific count arrays as attributes
394 |             for attr in ['X', 'S', 'U', 'A']:
395 |                 attr_all = getattr(self, attr)
396 |                 attr_batch = attr_all[batch_idx:end_idx, :]
397 |                 setattr(self, attr+'_batch', attr_batch)
398 | 
399 |             # generate shared arrays for child processes
400 |             shared_arrays = {'data_shape_batch': self.X_batch.shape}
401 |             for attr in ['X_batch', 'S_batch', 'U_batch', 'A_batch']:
402 |                 data = getattr(self, attr)
403 |                 # create the shared array
404 |                 # RawArray will only take a flat, 1D array
405 |                 # so we create it with as many elements as
406 |                 # our desired data
407 |                 shared = multiprocessing.RawArray(
408 |                     'd',  # doubles
409 |                     int(np.prod(data.shape)),
410 |                 )
411 |                 # load our new shared array into a numpy frame
412 |                 # and copy data into it after reshaping
413 |                 shared_np = np.frombuffer(
414 |                     shared,
415 |                     dtype=np.float64,
416 |                 )
417 |                 shared_np = shared_np.reshape(data.shape)
418 |                 # copy data into the new shared buffer
419 |                 # this is reflected in `shared`, even though we're
420 |                 # copying to the numpy frame here
421 |                 np.copyto(shared_np, data)
422 | 
423 |                 shared_arrays[attr] = shared
424 | 
425 |             # create a global dictionary to hold arguments
426 |             # we pass to each worker using an initializer.
427 |             # this is necessary because we can't pass `RawArray`
428 |             # in a pickled object (e.g. as an attribute of `self`)
429 |             global var_args
430 |             var_args = {}
431 | 
432 |             # this method is called after each work is initialized
433 |             # and sets all of the shared arrays as part of the global
434 |             # variable `var_args`
435 |             def init_worker(shared_arrays):
436 |                 for k in shared_arrays:
437 |                     var_args[k] = shared_arrays[k]
438 | 
439 |             start = time.time()
440 |             print(f'Drawing bootstrapped samples, batch {batch:04}...')
441 |             with multiprocessing.Pool(
442 |                     initializer=init_worker,
443 |                     initargs=(shared_arrays,)) as P:
444 |                 results = P.map(
445 |                     self._sample_cell,
446 |                     range(self.X_batch.shape[0]),
447 |                 )
448 | 
449 |             # [Cells, Genes, (Spliced, Unspliced, Ambiguous)]
450 |             batch_SUA_hat = np.stack(results, 0)
451 |             SUA_hat[batch_idx:end_idx, :, :] = batch_SUA_hat
452 |             batch_idx += batch_size
453 | 
454 |             end = time.time()
455 |             print('Duration: ', end-start)
456 | 
457 |         return SUA_hat
458 | 
459 |     def _fit_velocity(
460 |         self,
461 |         SUA_hat: np.ndarray,
462 |         velocity_mode: str='deterministic',
463 |     ) -> np.ndarray:
464 |         """Fit a deterministic RNA velocity model to the 
465 |         bootstrapped count matrices.
466 | 
467 |         Parameters
468 |         ----------
469 |         SUA_hat : np.ndarray
470 |             [Cells, Genes, (Spliced, Unspliced, Ambiguous)] 
471 |             randomly sampled array of read counts assigned
472 |             to a splicing status.
473 |         velocity_mode : str
474 |             mode argument for `scvelo.tl.velocity`.
475 |             one of ("deterministic", "stochastic", "dynamical").
476 | 
477 |         Returns
478 |         -------
479 |         velocity : np.ndarray
480 |             [Cells, Genes] RNA velocity estimates.
481 |         """
482 |         dtype = np.float64
483 |         # create an AnnData object from a bootstrap sample
484 |         # of counts
485 |         boot = anndata.AnnData(
486 |             X=SUA_hat[:, :, 0].astype(dtype).copy(),
487 |             obs=self.adata.obs.copy(),
488 |             var=self.adata.var.copy(),
489 |         )
490 |         for i, k in enumerate(['spliced', 'unspliced', 'ambiguous']):
491 |             boot.layers[k] = SUA_hat[:, :, i].astype(dtype)
492 | 
493 |         if self.velocity_prefilter_genes is not None:
494 |             # filter genes to match a pre-existing velocity computation
495 |             # this is useful for e.g. embedding in a common PC space
496 |             # with the observed velocity
497 |             boot = boot[:, self.velocity_prefilter_genes].copy()
498 | 
499 |         # normalize
500 |         scv.pp.normalize_per_cell(
501 |             boot,
502 |             counts_per_cell_after=self.counts_per_cell_after,
503 |         )
504 | 
505 |         # filter genes as in the embedding
506 |         if hasattr(self, 'embed'):
507 |             # if an embedded AnnData is provided
508 |             # subset to genes used for the original embedding
509 |             cell_bidx = np.array([
510 |                 x in self.embed.obs_names for x in boot.obs_names
511 |             ])
512 | 
513 |             boot = boot[:, self.embed.var_names].copy()
514 |             boot = boot[cell_bidx, :].copy()
515 |             print(
516 |                 'Subset bootstrap samples to embedding dims: ',
517 |                 boot.shape,
518 |             )
519 |         else:
520 |             msg = 'must providing an embedding object containing\n'
521 |             msg += 'cells and genes to use for velocity estimation.'
522 |             raise ValueError(msg)
523 | 
524 |         # log1p only the `.X` layer, leaving `.layers` untouched.
525 |         scv.pp.log1p(boot)
526 | 
527 |         # fit the velocity model deterministically, following the original
528 |         # RNA velocity publication
529 |         scv.pp.pca(boot, use_highly_variable=False)
530 |         scv.pp.moments(boot, n_pcs=30, n_neighbors=100)
531 |         scv.tl.velocity(boot, mode=velocity_mode)
532 | 
533 |         return boot.layers['velocity']
534 | 
535 |     def bootstrap_velocity(
536 |         self,
537 |         n_iter: int = 100,
538 |         embed: anndata.AnnData = None,
539 |         velocity_prefilter_genes: list = None,
540 |         verbose: bool = False,
541 |         save_counts: str = None,
542 |         **kwargs,
543 |     ) -> np.ndarray:
544 |         """
545 |         Generated bootstrap estimates of the RNA velocity for 
546 |         each cell and gene.
547 | 
548 |         Parameters
549 |         ----------
550 |         n_iter : int
551 |             number of bootstrap iterations to perform.
552 |         embed : anndata.AnnData, optional
553 |             [Cells, Genes] experiment describing the genes of interest
554 |             and containing a relevant embedding for projection of
555 |             velocity vectors.
556 |         velocity_prefilter_genes : list
557 |             genes selected by `scv.pp.filter_genes` in the embedding object
558 |             before normalization. often selected with `min_shared_counts`.
559 |             it is important to carry over this prefiltering step to ensure
560 |             that normalization is comparable to the original embedding.
561 |         verbose : bool
562 |             use verbose stdout printing.
563 |         save_counts : str, optional
564 |             save sampled count matrices to the specified path as 
565 |             `sampled_counts_{_iter:04}.npy` with shape 
566 |             [Sample, Cells, Genes, (Spliced, Unspliced, Ambig.)].
567 |         **kwargs passed to `_sample_matrices()`.
568 | 
569 |         Returns
570 |         -------
571 |         velocity : np.ndarray
572 |             [Sample, Cells, Genes] bootstrap estimates of RNA
573 |             velocity for each cell and gene. 
574 |         """
575 |         # use genes in an embedding object if provided, otherwise
576 |         # get the n_top_genes most variable genes
577 |         if embed is not None:
578 |             self.embed = embed
579 |             embed_genes = self.embed.shape[1]
580 |         else:
581 |             embed_genes = self.n_top_genes
582 | 
583 |         if velocity_prefilter_genes is not None:
584 |             self.velocity_prefilter_genes = velocity_prefilter_genes
585 |         else:
586 |             self.velocity_prefilter_genes = None
587 | 
588 |         # store velocity estimates for each gene
589 |         # [Iterations, Cells, Genes]
590 |         velocity = np.zeros((n_iter, self.embed.shape[0], embed_genes))
591 | 
592 |         for _iter in range(n_iter):
593 |             if verbose:
594 |                 print('Beginning sampling for iteration %03d' % _iter)
595 |                 
596 |             # sample a counts matrix
597 |             SUA_hat = self._sample_matrices(**kwargs)
598 | 
599 |             if save_counts is not None:
600 |                 # save the raw counts sample to disk
601 |                 np.save(
602 |                     osp.join(
603 |                         save_counts,
604 |                         f'sampled_counts_{_iter:04}.npy',
605 |                     ),
606 |                     SUA_hat,
607 |                 )
608 | 
609 |             if verbose:
610 |                 print('Sampling complete.')
611 |                 print('Fitting velocity model...')
612 |             # fit a velocity model to the sampled counts matrix
613 |             # yielding an estimate of velocity for each gene
614 |             iter_velo = self._fit_velocity(
615 |                 SUA_hat=SUA_hat,
616 |             )
617 |             velocity[_iter, :, :] = iter_velo
618 |             if verbose:
619 |                 print('Velocity fit, iteration %03d complete.' % _iter)
620 | 
621 |         self.velocity_estimates = velocity
622 |         return velocity
623 | 
624 |     def bootstrap_vectors(
625 |         self,
626 |         embed: anndata.AnnData = None,
627 |     ) -> np.ndarray:
628 |         """
629 |         Generate embedded velocity vectors for each bootstrapped sample
630 |         of spliced/unspliced counts.
631 | 
632 |         Returns
633 |         -------
634 |         velocity_embeddings : np.ndarray
635 |             [n_iter, Cells, EmbeddingDims] RNA velocity vectors
636 |             for each bootstrap sampled set of counts in the 
637 |             provided PCA embedding space.
638 |         """
639 |         if embed is not None:
640 |             self.embed = embed
641 | 
642 |         if not hasattr(self, 'embed'):
643 |             msg = 'must provide an `embed` argument.'
644 |             raise AttributeError(msg)
645 | 
646 |         # copy the embedding object to use for low-rank embedding
647 |         project = self.embed.copy()
648 |         # remove any extant `velocity_settings` to use defaults.
649 |         # in the current `scvelo`, using non-default settings will throw a silly
650 |         # error in `scv.tl.velocity_embedding`.
651 |         if 'velocity_settings' in project.uns.keys():
652 |             project.uns.pop('velocity_settings')
653 | 
654 |         # for each velocity profile estimate, compute the corresponding
655 |         # PCA embedding of those vectors using "direct_projection",
656 |         # aka as standard matrix multiplication.
657 |         #
658 |         # the `scvelo` nearest neighbor projection method introduces
659 |         # several assumptions that we do not wish to inherit here.
660 |         velocity_embeddings = []
661 |         for _iter in range(self.velocity_estimates.shape[0]):
662 |             V = self.velocity_estimates[_iter, :, :]
663 |             project.layers['velocity'] = V
664 | 
665 |             scv.tl.velocity_embedding(
666 |                 project,
667 |                 basis='pca',
668 |                 direct_pca_projection=True,
669 |                 autoscale=False,  # do not adjust vectors for aesthetics
670 |             )
671 |             velocity_embeddings.append(
672 |                 project.obsm['velocity_pca'],
673 |             )
674 |         velocity_embeddings = np.stack(
675 |             velocity_embeddings,
676 |             axis=0,
677 |         )
678 |         self.velocity_embeddings = velocity_embeddings
679 |         return velocity_embeddings
680 | 
681 |     def compute_ci(self,) -> np.ndarray:
682 |         """
683 |         Compute confidence intervals for the velocity vector
684 |         on each cell from bootstrap samples of embedded velocity vectors.
685 | 
686 |         Returns
687 |         -------
688 |         velocity_intervals : np.ndarray
689 |             [Cells, EmbeddingDims, (Mean, Std, LowerCI, UpperCI)]
690 |             estimates of the mean and confidence interval around the
691 |             RNA velocity vector computed for each cell.
692 |         """
693 |         if not hasattr(self, 'velocity_embeddings'):
694 |             msg = 'must run `bootstrap_vectors` first to generate vector samples.'
695 |             raise AttributeError(msg)
696 | 
697 |         # [Cells, Dims, (Mean, SD, Lower, Upper)]
698 |         self.velocity_intervals = np.zeros(
699 |             self.velocity_embeddings.shape[1:] + (4,)
700 |         )
701 |         # for each cell, compute the mean, std, and CI for
702 |         # each dimension in the embedding
703 |         # this provides a hypersphere of confidence for cell state transitions
704 |         # in the embedding space
705 |         for j in range(self.velocity_embeddings.shape[1]):
706 |             cell = self.velocity_embeddings[:, j, :]  # Iter, Dims
707 |             mean = np.mean(cell, axis=0)  # Dims
708 |             std = np.std(cell, axis=0)  # Dims
709 |             # compute the 95% CI assuming normality
710 |             l_ci = mean - 1.96*std
711 |             u_ci = mean + 1.96*std
712 |             self.velocity_intervals[j, :, 0] = mean
713 |             self.velocity_intervals[j, :, 1] = std
714 |             self.velocity_intervals[j, :, 2] = l_ci
715 |             self.velocity_intervals[j, :, 3] = u_ci
716 | 
717 |         return self.velocity_intervals
718 | 
719 | 
720 | ##################################################
721 | # main
722 | ##################################################
723 | 
724 | 
725 | def add_parser_arguments(parser):
726 |     """Add arguments to an `argparse.ArgumentParser`."""
727 |     parser.add_argument(
728 |         '--data',
729 |         type=str,
730 |         help='path to AnnData object with "spliced", "unspliced", "ambiguous" in `.layers`',
731 |     )
732 |     parser.add_argument(
733 |         '--out_path',
734 |         type=str,
735 |         help='output path for velocity bootstrap samples.'
736 |     )
737 |     parser.add_argument(
738 |         '--n_iter',
739 |         type=int,
740 |         default=100,
741 |         help='number of bootstrap iterations to perform.'
742 |     )
743 |     return parser
744 | 
745 | 
746 | def make_parser():
747 |     """Generate an `argparse.ArgumentParser`."""
748 |     parser = argparse.ArgumentParser(
749 |         description='Compute confidence intervals for RNA velocity by molecular bootstrapping'
750 |     )
751 |     parser = add_parser_arguments(parser)
752 |     return parser
753 | 
754 | 
755 | def main():
756 |     parser = make_parser()
757 |     args = parser.parse_args()
758 | 
759 |     # load anndata
760 |     print('Loading data...')
761 |     adata = anndata.read_h5ad(args.data)
762 |     print(f'{adata.shape[0]} cells and {adata.shape[1]} genes loaded.')
763 | 
764 |     # check for layers
765 |     for k in ['spliced', 'unspliced', 'ambiguous']:
766 |         if k not in adata.layers.keys():
767 |             msg = f'{k} not found in `adata.layers`'
768 |             raise ValueError(msg)
769 | 
770 |     # intialize velocity bootstrap object
771 |     print('\nBootstrap sampling velocity...\n')
772 |     vci = VelocityCI(
773 |         adata=adata,
774 |     )
775 | 
776 |     # sample velocity vectors
777 |     velocity_bootstraps = vci.bootstrap_velocity(
778 |         n_iter=args.n_iter,
779 |         save_counts=args.out_path,
780 |     )
781 | 
782 |     # save bootstrap samples to disk
783 |     np.save(
784 |         osp.join(args.out_path, 'velocity_bootstrap_samples.npy'),
785 |         velocity_bootstraps,
786 |     )
787 |     print('Done.')
788 |     return
789 | 
790 | 
791 | if __name__ == '__main__':
792 |     main()
793 | 


--------------------------------------------------------------------------------
/velodyn/velocity_divergence.py:
--------------------------------------------------------------------------------
  1 | """Compute divergence maps from RNA velocity fields"""
  2 | import numpy as np
  3 | import anndata
  4 | 
  5 | from sklearn.neighbors import NearestNeighbors
  6 | from scipy.stats import norm as normal
  7 | 
  8 | import matplotlib
  9 | import matplotlib.pyplot as plt
 10 | import seaborn as sns
 11 | 
 12 | 
 13 | # modified from
 14 | # https://github.com/theislab/scvelo/blob/master/scvelo/plotting/velocity_embedding_grid.py
 15 | def compute_velocity_on_grid(
 16 |     X_emb: np.ndarray,
 17 |     V_emb: np.ndarray,
 18 |     density: float = None,
 19 |     smooth: float = None,
 20 |     n_neighbors: int = None,
 21 |     min_mass: float = None,
 22 |     n_grid_points: int = 50,
 23 |     adjust_for_stream: bool = False,
 24 |     grid_min_max: tuple = None,
 25 | ) -> (np.ndarray, np.ndarray):
 26 |     """
 27 |     Compute a grid of velocity vectors in gene expression space
 28 |     where each vector in the grid is a Gaussian weighted average of
 29 |     neighboring observed cell vectors.
 30 | 
 31 |     Parameters
 32 |     ----------
 33 |     X_emb : np.ndarray
 34 |         [Cells, (embedding0, embedding1)] cell coordinates in the
 35 |         embedding.
 36 |     V_emb : np.ndarray
 37 |         [Cells, (embedding0, embedding1)] cell velocities in the
 38 |         embedding.
 39 |     density : float
 40 |         [0, 1.] proportion of n_grid_points to use.
 41 |     smooth : float
 42 |         smoothing parameter for the Gaussian kernel.
 43 |     n_neighbors : int
 44 |         number of neighbors to consider.
 45 |     min_mass : float
 46 |         minimum probability mass to return a value for a grid cell.
 47 |     n_grid_points : int
 48 |         number of grid points along each dimension.
 49 |     adjust_for_stream : bool
 50 |         adjust grid velocities to be compatible with stream plots.
 51 |     grid_min_max : tuple
 52 |         ((min, max), (min, max)) values for coarse-graining grid
 53 |         coordinates. set manually to ensure coarse-grained coordinates
 54 |         are consistent across samples passed to `X_emb`.
 55 | 
 56 |     Returns
 57 |     -------
 58 |     X_grid : np.ndarray
 59 |         [n_grid_points, n_grid_points] locations of each vector
 60 |         in embedding space.
 61 |     V_grid : np.ndarray
 62 |         [n_grid_points, n_grid_points] RNA velocity vectors in
 63 |         the local neighborhood at a series of grid points.
 64 |     """
 65 |     # remove invalid cells
 66 |     idx_valid = np.isfinite(X_emb.sum(1) + V_emb.sum(1))
 67 |     X_emb = X_emb[idx_valid]
 68 |     V_emb = V_emb[idx_valid]
 69 | 
 70 |     # prepare grid
 71 |     n_obs, n_dim = X_emb.shape
 72 |     density = 1 if density is None else density
 73 |     smooth = .5 if smooth is None else smooth
 74 | 
 75 |     # Generates a linearly spaced grid from the minimum to maximum
 76 |     # embedding coordinate along each dimension
 77 |     # the number of grid locations is specified with `n_grid_points`
 78 |     grs = []
 79 |     for dim_i in range(n_dim):
 80 |         if grid_min_max is None:
 81 |             m, M = np.min(X_emb[:, dim_i]), np.max(X_emb[:, dim_i])
 82 |             m = m - .01 * np.abs(M - m)
 83 |             M = M + .01 * np.abs(M - m)
 84 |         else:
 85 |             m, M = grid_min_max[dim_i]
 86 |             pass
 87 |         gr = np.linspace(m, M, n_grid_points * density)
 88 |         grs.append(gr)
 89 | 
 90 |     meshes_tuple = np.meshgrid(*grs)
 91 |     X_grid = np.vstack([i.flat for i in meshes_tuple]).T
 92 | 
 93 |     # estimate grid velocities
 94 |     # find nearest neighbors to each grid point using `n_neighbors`
 95 |     # determine their relative distances
 96 |     if n_neighbors is None:
 97 |         n_neighbors = int(n_obs/50)
 98 |     nn = NearestNeighbors(n_neighbors=n_neighbors, n_jobs=-1)
 99 |     nn.fit(X_emb)
100 |     # [GridPoints, Dims] floats, array indices for nearest neighbors
101 |     dists, neighs = nn.kneighbors(X_grid)
102 | 
103 |     # weight the contribution of each point with a Gaussian kernel
104 |     # centered on the point of interest
105 | 
106 |     # here, `smooth` is a scaling factor that determines the sigma
107 |     # of the Gaussian, which is the product of the total range of a dimension
108 |     # and the scaling parameter
109 |     # defaults to a sigma == 0.5*DimensionRange
110 |     scale = np.mean([(g[1] - g[0]) for g in grs]) * smooth
111 | 
112 |     # here, we evaluate a weight for each point as the PDF of a Gaussian with
113 |     # the specified scale centered at the point, since we feed in distances
114 |     # rather than coordinates
115 |     weight = normal.pdf(x=dists, scale=scale)  # weights is [GridPoints, Dims]
116 | 
117 |     # p_mass stores how much probability mass is near a point
118 |     # if all neighbors are very far away, this will be small
119 |     p_mass = weight.sum(1)  # p_mass is [GridPoints,]
120 | 
121 |     V_grid = (V_emb[neighs] * weight[:, :, None]).sum(1) / \
122 |         np.maximum(1, p_mass)[:, None]
123 | 
124 |     if adjust_for_stream:
125 |         X_grid = np.stack([np.unique(X_grid[:, 0]), np.unique(X_grid[:, 1])])
126 |         ns = int(np.sqrt(len(V_grid[:, 0])))
127 |         V_grid = V_grid.T.reshape(2, ns, ns)
128 | 
129 |         mass = np.sqrt((V_grid ** 2).sum(0))
130 |         V_grid[0][mass.reshape(V_grid[0].shape) < 1e-5] = np.nan
131 |     else:
132 |         if min_mass is None:
133 |             min_mass = np.clip(np.percentile(p_mass, 95) / 100, 1e-2, 1)
134 |         # zero out vectors with little support
135 |         V_grid[p_mass < min_mass] = 0.
136 | 
137 |     return X_grid, V_grid
138 | 
139 | 
140 | def divergence(f):
141 |     """
142 |     Computes the divergence of the vector field.
143 | 
144 |     Parameters
145 |     ----------
146 |     f : list of ndarrays
147 |         [D,] each array contains values for one dimension of
148 |         the vector field.
149 | 
150 |     Returns
151 |     -------
152 |     D : np.ndarray
153 |         divergence values in the same shape a items in `f`
154 |         
155 |     Notes
156 |     -----
157 |     The divergence of a vector field :math:`V(x, y)` is given by the sum of
158 |     partial derivatives of the d-component with respect to d, where d is either
159 |     x or y.
160 |     
161 |     .. math::
162 |     
163 |         \nabla V = \sum_{d \in \{x, y\}} \partial V_d(x, y) / \partial d
164 |         
165 |         \nabla V = \partial V_x(x, y)/\partial x + \partial V_y(x, y)/\partial y
166 |     """
167 |     num_dims = len(f)
168 |     # for each dimension of the vector field `i`, compute the gradient with
169 |     # respect to that dimension and add the results
170 |     D = np.ufunc.reduce(
171 |         np.add,
172 |         [np.gradient(f[num_dims - i - 1], axis=i) for i in range(num_dims)]
173 |     )
174 |     return D
175 | 
176 | 
177 | def compute_div(
178 |     adata: anndata.AnnData,
179 |     use_rep: str = 'pca',
180 |     n_grid_points: int = 30,
181 |     return_grid: bool=False,
182 |     **kwargs,
183 | ) -> np.ndarray:
184 |     """
185 |     Compute divergence in gene expression space for a single
186 |     cell experiment.
187 | 
188 |     Parameters
189 |     ----------
190 |     adata : anndata.AnnData
191 |         [Cells, Genes] single cell experiment containing velocity
192 |         vectors for each cell.
193 |     use_rep : str
194 |         representation to use for divergence field calculation.
195 |         `adata.obsm[f'X_{use_rep}']` and `adata.obsm[f'velocity_{use_rep}']`
196 |         must be present.
197 |     n_grid_points : int
198 |         number of grid points along each dimension.
199 |     **kwargs passed to `compute_velocity_on_grid`.
200 | 
201 |     Returns
202 |     -------
203 |     D : np.ndarray
204 |         [n_grid_points, n_grid_points] divergence values.
205 |     X_grid : np.ndarray, optional
206 |         [n_grid_points, EmbedDims] grid locations in the embedding.
207 |         returned if `return_grid=True`.
208 |     V_grid : np.ndarray, optional
209 |         [n_grid_points, EmbedDims] velocity values at grid locations.
210 |         returned if `return_grid=True`.
211 | 
212 |     See Also
213 |     --------
214 |     compute_velocity_on_grid
215 |     divergence
216 |     """
217 |     # compute a grid of positions and their Gaussian
218 |     # weighted velocities across the embedding space
219 |     X_grid, V_grid = compute_velocity_on_grid(
220 |         adata.obsm[f'X_{use_rep}'][:, :2],
221 |         adata.obsm[f'velocity_{use_rep}'][:, :2],
222 |         n_grid_points=n_grid_points,
223 |         **kwargs,
224 |     )
225 |     # reshape the grid points into an [X, Y, 2] matrix
226 |     V_spatial = V_grid.reshape(
227 |         n_grid_points, 
228 |         n_grid_points, 
229 |         2,
230 |     )
231 |     # compute the divergence
232 |     D_spatial = divergence([V_spatial[:, :, i]
233 |                             for i in range(V_spatial.shape[2])])
234 |     if return_grid:
235 |         return D_spatial, X_grid, V_grid
236 |     
237 |     return D_spatial
238 | 
239 | 
240 | def plot_div(
241 |     D_spatial,
242 |     pal='PRGn',
243 |     center: float = 0.,
244 |     cbar_label='Divergence',
245 |     xticklabels: bool = False,
246 |     yticklabels: bool = False,
247 |     figsize: tuple = (6, 4),
248 |     **kwargs,
249 | ) -> (matplotlib.figure.Figure, matplotlib.axes.Axes):
250 |     """Plot a heatmap of the divergence values in an RNA velocity field.
251 | 
252 |     Parameters
253 |     ----------
254 |     D_spatial : np.ndarray
255 |         [n_grid_points, n_grid_points] divergence values.
256 |     pal : Union[str, matplotlib.colors.Colormap]
257 |         color map for divergence colors. can be a matplotlib
258 |         named colormap.
259 |     center : float
260 |         value for centering a divergent colormap.
261 |     cbar_label : str
262 |         label for the colorbar.
263 |     xticklabels : bool
264 |         use x-axis tick labels.
265 |     yticklabels : bool
266 |         use y-axis tick labels.
267 |     figsize : tuple
268 |         (W, H) of the matplotlib figure.
269 | 
270 |     Returns
271 |     -------
272 |     fig : matplotlib.figure.Figure
273 |     ax : matplotlib.axes.Axes
274 |     """
275 |     fig, ax = plt.subplots(1, 1, figsize=figsize)
276 |     sns.heatmap(
277 |         D_spatial,
278 |         cmap=pal,
279 |         ax=ax,
280 |         center=center,
281 |         cbar_kws={'label': cbar_label},
282 |         xticklabels=xticklabels,
283 |         yticklabels=yticklabels,
284 |         **kwargs,
285 |     )
286 |     ax.invert_yaxis()
287 |     ax.set_xlabel('PC1')
288 |     ax.set_ylabel('PC2')
289 |     return fig, ax
290 | 


--------------------------------------------------------------------------------
/velodyn/velocity_dpst.py:
--------------------------------------------------------------------------------
  1 | """Compute a change in pseudotime for each cell"""
  2 | import numpy as np
  3 | import anndata
  4 | from sklearn.neighbors import KNeighborsRegressor
  5 | from sklearn.model_selection import cross_val_score
  6 | 
  7 | 
  8 | class dPseudotime(object):
  9 |     """Compute a change in pseudotime value for each cell
 10 |     in a single cell experiment.
 11 | 
 12 |     Attributes
 13 |     ----------
 14 |     adata : anndata.AnnData
 15 |         [Cells, Genes] single cell experiment.
 16 |     use_rep : str
 17 |         representation to use for predicting pseudotime coordinates.
 18 |         `adata.obsm[f'X_{use_rep}']`, `adata.obsm[f'velocity_{use_rep}']`
 19 |         must be present.
 20 |     pseudotime_var : str
 21 |         scalar variable in `adata.obs` encoding pseudotime coordinates.
 22 |     model : sklearn.neighbors.KNeighborsRegressor
 23 |         k-nearest neighbors regression model for pseudotime prediction.
 24 |     X : np.ndarray
 25 |         [Cells, Embedding] observed coordinates in embedding space.
 26 |     V : np.ndarray
 27 |         [Cells, Embedding] velocity vectors in embedding space.
 28 |     y : np.ndarray
 29 |         [Cells,] pseudotime coordinates.
 30 |     X_pred : np.ndarray
 31 |         [Cells, Embedding] predicted future coordinates.
 32 |     pst_pred : np.ndarray
 33 |         [Cells,] pseudotime coordinates inferred for positions `X_pred`.
 34 |     dpst : np.ndarray
 35 |         [Cells,] change in pseudotime coordinate.
 36 | 
 37 |     Methods
 38 |     -------
 39 |     _fit_model
 40 |     predict_dpst
 41 |     """
 42 | 
 43 |     def __init__(
 44 |         self,
 45 |         adata: anndata.AnnData,
 46 |         use_rep: str = 'pca',
 47 |         pseudotime_var: str = 'dpt_pseudotime',
 48 |     ) -> None:
 49 |         """Compute a change in pseudotime value for each cell
 50 |         in a single cell experiment.
 51 | 
 52 |         Parameters
 53 |         ----------
 54 |         adata : anndata.AnnData
 55 |             [Cells, Genes] single cell experiment.
 56 |         use_rep : str
 57 |             representation to use for predicting pseudotime coordinates.
 58 |             `adata.obsm[f'X_{use_rep}']`, `adata.obsm[f'velocity_{use_rep}']`
 59 |             must be present.
 60 |         pseudotime_var : str
 61 |             scalar variable in `adata.obs` encoding pseudotime coordinates.
 62 | 
 63 |         Returns
 64 |         -------
 65 |         None.
 66 |         """
 67 |         self.adata = adata
 68 |         self.use_rep = use_rep
 69 |         self.pseudotime_var = pseudotime_var
 70 | 
 71 |         # check that necessary matrices are present
 72 |         if f'X_{use_rep}' in self.adata.obsm.keys():
 73 |             self.X = self.adata.obsm[f'X_{use_rep}']
 74 |         else:
 75 |             msg = f'X_{use_rep} is not in `adata.obsm'
 76 |             raise ValueError(msg)
 77 | 
 78 |         if f'velocity_{use_rep}' in self.adata.obsm.keys():
 79 |             self.V = self.adata.obsm[f'velocity_{use_rep}']
 80 |         else:
 81 |             msg = f'velocity_{use_rep} is not in `adata.obsm'
 82 |             raise ValueError(msg)
 83 | 
 84 |         if pseudotime_var in self.adata.obs.columns:
 85 |             self.y = self.adata.obs[pseudotime_var]
 86 |         else:
 87 |             msg = f'{pseudotime_var} is not in `adata.obs'
 88 |             raise ValueError(msg)
 89 | 
 90 |         return
 91 | 
 92 |     def _fit_model(
 93 |         self,
 94 |         n_neighbors: int = 50,
 95 |         weights: str = 'distance',
 96 |     ) -> None:
 97 |         """Fit a regression model to predict pseudotime coordinates
 98 |         from the specified embedding.
 99 | 
100 |         Parameters
101 |         ----------
102 |         n_neighbors : int
103 |             number of neighbors to use for regression model.
104 |         weights : str
105 |             method to weight neighbor contributions.
106 |             passed to `sklearn.neighbors.KNeighborsRegressor`.
107 | 
108 |         Returns
109 |         -------
110 |         None. assigns `self.model`, `self.cv_scores`.
111 |         """
112 |         # initialize a simple kNN regressor with multiprocessing
113 |         self.model = KNeighborsRegressor(
114 |             n_neighbors=n_neighbors,
115 |             weights=weights,
116 |             n_jobs=-1,
117 |         )
118 | 
119 |         # perform cross-validation scoring
120 |         self.cv_scores = cross_val_score(
121 |             self.model,
122 |             self.X,
123 |             self.y,
124 |             cv=5,
125 |         )
126 |         print('Cross-validation scores for prediction model:')
127 |         print(self.cv_scores)
128 |         print('Mean : ', np.mean(self.cv_scores))
129 |         print()
130 | 
131 |         # fit the final model on all the data
132 |         self.model.fit(self.X, self.y)
133 |         return
134 | 
135 |     def predict_dpst(
136 |         self,
137 |         step_size: float = 0.01,
138 |         **kwargs,
139 |     ) -> np.ndarray:
140 |         """Predict a change in pseudotime coordinate for each cell
141 |         in the experiment.
142 | 
143 |         Parameters
144 |         ----------
145 |         step_size : float
146 |             step size to use for future cell state predictions.
147 |             the RNA velocity vector is scaled by this coefficient
148 |             before addition to the current position.
149 |             we recommend step sizes smaller than `1`.
150 |         **kwargs are passed to `self._fit_model`.
151 | 
152 |         Returns
153 |         -------
154 |         dpst : np.ndarray
155 |             [Cells,] change in pseudotime value predicted for each
156 |             cell.
157 |         Also sets `self.pst_pred`, `self.dpst` atttributes.
158 | 
159 |         See Also
160 |         --------
161 |         self._fit_model
162 |         """
163 |         self._fit_model(**kwargs)
164 | 
165 |         # the predicted new pseudotime coordinate is the current
166 |         # coordinate + the velocity vector, scaled by a step size
167 |         self.X_pred = self.X + step_size * self.V
168 |         # we predict the new coordinate's pseudotime position
169 |         self.pst_pred = self.model.predict(self.X_pred)
170 |         # the \Delta pseudotime coordinate is the difference between
171 |         # predicted and observed coordinates
172 |         self.dpst = self.pst_pred - self.y
173 |         return self.dpst
174 | 


--------------------------------------------------------------------------------
/velodyn/velocity_dynsys.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Dynamical systems simulations in RNA velocity space
  3 | """
  4 | import numpy as np
  5 | from scipy import stats
  6 | import anndata
  7 | import tqdm
  8 | import typing
  9 | from typing import Collection
 10 | import warnings
 11 | # multiprocessing tools. pathos uses `dill` rather than `pickle`,
 12 | # which provides more robust serialization.
 13 | from pathos.multiprocessing import ProcessPool
 14 | from sklearn.neighbors import NearestNeighbors
 15 | # plotting
 16 | import matplotlib
 17 | import matplotlib.pyplot as plt
 18 | import seaborn as sns
 19 | 
 20 | 
 21 | class PhaseSimulation(object):
 22 |     """Perform phase point simulations in velocity fields.
 23 | 
 24 |     Attributes
 25 |     ----------
 26 |     adata : anndata.AnnData
 27 |         [Cells, Genes] object with precomputed attributes
 28 |         for RNA velocity in `.layers`.
 29 |         keys: {velocity, spliced, unspliced}.
 30 |     vadata : anndata.AnnData
 31 |         view of `.adata` used for velocity field estimation.
 32 |     pfield : np.ndarray
 33 |         [Cells, Features] positions of cells in the velocity field.
 34 |     vfield : np.ndarray
 35 |         [Cells, Features] velocities of cells in the velocity field.
 36 |     starting_points : np.ndarray
 37 |         [Cells, Features] starting points for phase points in the
 38 |         velocity field.
 39 |     v_model : Callable
 40 |         a model of RNA velocity that predicts velocity given a positional
 41 |         coordinate in the desired basis.
 42 |     trajectories : np.ndarray
 43 |         [PhasePoints, Time, Dimensions, (Position, V_mu, V_sig)]
 44 |         trajectories of phase points in the velocity field.
 45 |     boundary_fence : dict
 46 |         {"min", "max"} specifies fence conditions if the boundary constraint
 47 |         is set to obey a predefined fence. minimum and maximum values for
 48 |         each dimension are stored as lists.
 49 |     timesteps : int
 50 |         [T,] number of timesteps for phase point evolution.
 51 |     step_scale : float
 52 |         scaling factor for phase point steps in the chosen basis.
 53 |     noise_scale : float
 54 |         scaling factor for noise introduced during phase point evolution.
 55 |         defaults to a noiseless simulation.
 56 |     velocity_k : int
 57 |         number of nearest neighbors to consider when employing a 'knn'
 58 |         velocity model.
 59 |     vknn_method : str
 60 |         method by which to the kNN model computes velocity estimates for
 61 |         phase points.
 62 |         "deterministic" -- use the mean of kNN RNA velocity vectors.
 63 |         "stochastic" -- fit a multivar. Gaussian to kNN vectors and sample.
 64 |         "knn_random_sample" -- randomly sample an observed vector from kNN.
 65 |     Methods
 66 |     -------
 67 |     boundary_contraint(position, velocity)
 68 |         impose a boundary constraint by modifying the predicted position
 69 |         of an evolving phase point. defaults to an identity function.
 70 |     v_fxn : callable
 71 |         returns velocity as a function of position in the
 72 |         embedding space. takes a [D,] np.ndarray as input, returns
 73 |         a [D,] np.ndarray.
 74 |     """
 75 | 
 76 |     def __init__(
 77 |         self,
 78 |         adata: anndata.AnnData,
 79 |         **kwargs,
 80 |     ) -> None:
 81 |         """Perform phase point simulations in velocity fields.
 82 | 
 83 |         Parameters
 84 |         ----------
 85 |         adata : anndata.AnnData
 86 |             [Cells, Genes] object with precomputed attributes
 87 |             for RNA velocity in `.layers`.
 88 |             keys: {velocity, spliced, unspliced}.
 89 | 
 90 |         Returns
 91 |         -------
 92 |         None.
 93 |         """
 94 |         self.adata = adata
 95 |         if self.adata.raw is not None:
 96 |             print('`adata.raw` is not `None`.')
 97 |             print('This can cause indexing issues with some anndata versions.')
 98 |             print('Consider setting `adata.raw = None`\n.')
 99 | 
100 |         # set the number of nearest neighbors to use when inferring
101 |         # phase point velocities
102 |         self.velocity_k = 100
103 |         if 'velocity_k' in kwargs.keys():
104 |             self.velocity_k = kwargs['velocity_k']
105 | 
106 |         # set an identity function as our initial boundary constraint
107 |         # until we choose a different one
108 |         self.boundary_constraint = self._identity_placeholder
109 |         return
110 | 
111 |     def set_velocity_field(
112 |         self,
113 |         groupby: str = None,
114 |         group: typing.Any = None,
115 |         basis: str = 'counts',
116 |     ) -> None:
117 |         """Set a subset of cells to use when defining the
118 |         velocity field.
119 | 
120 |         Parameters
121 |         ----------
122 |         groupby : str
123 |             column in `.adata.obs` to use for group selection.
124 |         group : Any
125 |             value in `groupby` to use for selecting cells.
126 |         basis : str
127 |             basis for setting the velocity field. must be one
128 |             of {'counts', 'pca', 'umap', 'tsne'}.
129 |             if not 'counts', must have 'velocity_%s'%basis attribute.
130 | 
131 |         Returns
132 |         -------
133 |         None. Sets `.vadata`.
134 | 
135 |         Notes
136 |         -----
137 |         Generates a view of `.adata` with only the selected
138 |         cells, `.vadata`.
139 |         Sets the `.vfield` and `.pfield` attribute with selected
140 |         cells in the desired basis.
141 |         """
142 |         # ensure that arguments are valid
143 |         if groupby is not None and group is None:
144 |             raise ValueError('Must supply a `group` for cell selection.')
145 |         if group is not None and groupby is None:
146 |             raise ValueError('Must supply a `groupby` for cell selection.')
147 |         
148 |         # check that the specified basis is supported
149 |         bases = ['counts', 'pca', 'umap', 'tsne']
150 |         if basis not in bases:
151 |             raise ValueError('%s is not a valid basis.' % basis)
152 | 
153 |         # if no grouping variable is provided
154 |         # create a single "dummy" group
155 |         if groupby is not None and group is not None:
156 |             bidx = self.adata.obs[groupby] == group
157 |         else:
158 |             bidx = np.ones(self.adata.shape[0]).astype(np.bool)
159 | 
160 |         # get the relevant cells from the grouping
161 |         self.vadata = self.adata[bidx, :].copy()
162 | 
163 |         # set the velocity field and position field using
164 |         # cell observations
165 |         if basis == 'counts':
166 |             self.vfield = self.vadata.layers['velocity']
167 |             self.pfield = self.vadata.X
168 |         else:
169 |             self.vfield = self.vadata.obsm['velocity_%s' % basis]
170 |             self.pfield = self.vadata.obsm['X_%s' % basis]
171 | 
172 |         # convert to dense if not
173 |         # TODO: make downstream ops sparse compatible
174 |         if type(self.vfield) != np.ndarray:
175 |             self.vfield = self.vfield.toarray()
176 |         if type(self.pfield) != np.ndarray:
177 |             self.pfield = self.pfield.toarray()
178 | 
179 |         return
180 | 
181 |     def _set_starting_point_metadata(
182 |         self,
183 |         groupby: str = None,
184 |         group: typing.Any = None,
185 |     ) -> None:
186 |         """Set starting points for phase point simulations based
187 |         on sample annotations.
188 | 
189 |         Parameters
190 |         ----------
191 |         groupby : str
192 |             column in `.adata.obs` to use for group selection.
193 |         group : Any
194 |             value in `groupby` to use for selecting cells.
195 | 
196 |         Returns
197 |         -------
198 |         None. Sets `.starting_points`.
199 |         """
200 |         # check that arguments are valid
201 |         if groupby is None or group is None:
202 |             raise ValueError('must supply both groupby and group.')
203 | 
204 |         # set starting points as the designated positions in the 
205 |         # position field
206 |         bidx = self.vadata.obs[groupby] == group
207 |         print(f'Found {sum(bidx)} points matching starting criteria.')
208 |         self.starting_points = self.pfield[bidx, :]
209 |         return
210 | 
211 |     def _set_starting_point_embedding(
212 |         self,
213 |         basis: str = None,
214 |         borders: tuple = None,
215 |     ) -> None:
216 |         """Set starting points for phase point simulations based
217 |         on embedding locations.
218 | 
219 |         Parameters
220 |         ----------
221 |         basis : str
222 |             embedding basis to use for selection.
223 |             expects `'X_'+basis` in `.obsm.keys()`.
224 |             e.g. 'pca', 'umap', 'tsne'.
225 |         borders : tuple
226 |             [N,] minimum and maximum values in each dimension of the
227 |             embedding to use for starting point selection.
228 |             e.g. ((-1, 1), (-3, 1)) for a 2D embedding.
229 | 
230 |         Returns
231 |         -------
232 |         None. sets `.starting_points`.
233 |         """
234 |         # check that the basis is present
235 |         bindices = []
236 |         if 'X_'+basis not in self.adata.obsm.keys():
237 |             raise ValueError(
238 |                 'X_%s is not an embedding in `.adata.obsm`.' % basis)
239 |         embed = self.vadata.obsm['X_'+basis]
240 | 
241 |         # get all cells within the borders specified along each dimension
242 |         for i, min_max in enumerate(borders):
243 |             bidx = np.logical_and(
244 |                 embed[:, i] > min_max[0],
245 |                 embed[:, i] < min_max[1],
246 |             )
247 |             bindices.append(bidx)
248 | 
249 |         # use cells that meet all border criteria as starting points
250 |         bidx = np.logical_and.reduce(bindices)
251 |         self.starting_points = self.pfield[bidx, :]
252 |         return
253 | 
254 |     def _set_starting_point_expression(
255 |         self,
256 |         genes: Collection[str] = None,
257 |         min_expr_levels: Collection[float] = None,
258 |         use_raw: bool = True,
259 |     ) -> None:
260 |         """Set starting points for phase point simulations based
261 |         on gene expression levels.
262 | 
263 |         Parameters
264 |         ----------
265 |         genes : Collection[str]
266 |             [N,] gene names to use for starting point selection.
267 |         min_expr_levels : Collection[float]
268 |             [N,] minimum expression level for each gene.
269 |         use_raw : bool
270 |             use the `.adata.raw.X` attribute for gene expression levels
271 |             instead of `.adata.X`.
272 | 
273 |         Returns
274 |         -------
275 |         None. sets `.starting_points`.
276 |         """
277 |         # check argument validity
278 |         if genes is None or min_expr_levels is None:
279 |             raise ValueError('must supply both genes and min_expr_levels')
280 | 
281 |         if len(genes) != len(min_expr_levels):
282 |             ll = (len(genes), len(min_expr_levels))
283 |             raise ValueError(
284 |                 '%d genes and %d min_expr_levels, must be equal.' % ll)
285 | 
286 |         # tolerate singleton arguments begrudgingly
287 |         if type(genes) == str:
288 |             warnings.warn(
289 |                 'casting `genes` to list in `_set_starint_point_expression`.'
290 |             )
291 |             genes = [genes]
292 |         if type(min_expr_levels) == float:
293 |             min_expr_levels = [min_expr_levels]
294 |             warnings.warn(
295 |                 'casting `min_expr_level` to list `_set_starint_point_expression`.'
296 |             )
297 | 
298 |         if use_raw:
299 |             ad = self.vadata.raw
300 |         else:
301 |             ad = self.vadata
302 |             
303 |         # get cells that express the relevant genes at the minimum
304 |         # levels specified
305 |         bindices = []
306 |         for i, g in enumerate(genes):
307 |             expr = ad[:, g].X
308 |             if type(expr) != np.ndarray:
309 |                 expr = expr.toarray()
310 |             bidx = expr > min_expr_levels[i]
311 |             bindices.append(bidx)
312 | 
313 |         # take only cells meeting all criteria as starting points
314 |         bidx = np.logical_and.reduce(bindices)
315 |         self.starting_points = self.pfield[bidx, :]
316 |         return
317 | 
318 |     def set_starting_point(
319 |         self,
320 |         method: str,
321 |         **kwargs,
322 |     ) -> None:
323 |         """Set starting points for phase point simulations.
324 |         Uses metadata, embedding locations, or gene expression values.
325 | 
326 |         Parameters
327 |         ----------
328 |         method : str
329 |             {'metadata', 'embedding', 'expression'}.
330 |         **kwargs : dict
331 |             passed to the relevant `._set_starting_point_{method}` function.
332 | 
333 |         Returns
334 |         -------
335 |         None. sets `.starting_points`.
336 | 
337 |         Notes
338 |         -----
339 |         Calls the relevant method for setting starting points based
340 |         on the `method` argument and passes remaining keyword arguments.
341 |         """
342 |         # check argument validity
343 |         acceptable_methods = ['metadata',  'embedding', 'expression']
344 |         if method not in acceptable_methods:
345 |             raise ValueError('%s is not an acceptable method.' % method)
346 | 
347 |         if not hasattr(self, 'pfield'):
348 |             raise ValueError(
349 |                 'must set a `pfield` with `set_velocity_field` first.')
350 | 
351 |         f = getattr(self, '_set_starting_point_'+method)
352 |         f(**kwargs)
353 |         return
354 | 
355 |     def _identity_placeholder(
356 |         self,
357 |         x: typing.Any,
358 |     ) -> typing.Any:
359 |         """An identity function that returns an argument
360 |         without modification. Useful as a placeholder."""
361 |         return x
362 | 
363 |     def _boundary_constraint_fence(
364 |         self,
365 |         x: np.ndarray,
366 |     ) -> np.ndarray:
367 |         """Imposes a boundary constraint on phase point position
368 |         `x` by forcing each dimension to sit within a pre-defined
369 |         fence.
370 | 
371 |         Parameters
372 |         ----------
373 |         x : np.ndarray
374 |             [D,] position of a phase point.
375 | 
376 |         Returns
377 |         -------
378 |         x_constrained : np.ndarray
379 |             [D,] position of the phase point with dimensions clamped
380 |             to a pre-defined fence.
381 |         """
382 |         # clip dimensions to fit within the boundary
383 |         x_constrained = np.clip(
384 |             x,
385 |             self.boundary_fence['min'],
386 |             self.boundary_fence['max'],
387 |         )
388 |         return x_constrained
389 | 
390 |     def _boundary_constraint_nn_dist(
391 |         self,
392 |         x: np.ndarray,
393 |     ) -> np.ndarray:
394 |         """Imposes a boundary constraint on phase point position
395 |         `x` by forcing `x` to the nearest point that is less than
396 |         a predefined distance from its nearest neighbors.
397 | 
398 |         Parameters
399 |         ----------
400 |         x : np.ndarray
401 |             [D,] position of a phase point.
402 | 
403 |         Returns
404 |         -------
405 |         x_constrained : np.ndarray
406 |             [D,] position of the phase point with dimensions clamped
407 |             to a pre-defined fence.
408 | 
409 |         Notes
410 |         -----
411 |         Phase points are contrained to a maximum distance from their
412 |         nearest neighbor. This distance can be adaptively determined
413 |         by taking the median nearest neighbor distance from the data
414 |         set and using some multiple of this distance as the boundary
415 |         constraint.
416 | 
417 |         When a phase point passes beyond this distance, a distance
418 |         vector is computed between the point and the neighbor, and
419 |         the point location is shrunken along the vector to satisfy
420 |         the boundary constraint.
421 | 
422 |         See Also
423 |         --------
424 |         `.set_boundaries`.
425 |         """
426 |         if len(x.shape) == 1:
427 |             # pad to a [1, N] matrix for sklearn
428 |             x = np.expand_dims(x, 0)
429 |         # compute the distance to the nearest neighbor
430 |         distances, indices = self.boundary_nn.kneighbors(x)
431 |         if distances[0, 0] < self.max_nn_distance:
432 |             x_constrained = x
433 |         else:
434 |             nn_point = self.pfield[indices[0, 0]:indices[0, 0]+1, :]
435 |             d_vec = x - nn_point
436 |             # how much larger is the difference vector than what we allow?
437 |             scale_factor = self.max_nn_distance / distances[0, 0]
438 |             # scale the difference vector and compute x_constrained
439 |             # as this scaled vector moving away from the NN
440 |             d_vec *= scale_factor
441 |             x_constrained = nn_point + d_vec
442 |         return x_constrained
443 | 
444 |     def set_boundaries(
445 |         self,
446 |         method: str = 'fence',
447 |         borders: tuple = None,
448 |         max_nn_distance: float = None,
449 |         boundary_knn: int = 5,
450 |     ) -> None:
451 |         """Impose boundaries for phase point simulations.
452 |         During evolution, phase points will not move beyond
453 |         these boundaries. This can prevent numerical instability
454 |         issues where a phase point travels "off the map".
455 | 
456 |         Parameters
457 |         ----------
458 |         method : str
459 |             one of {'fence', 'nn'}.
460 |             fence - restrict phase points to a "fence" of the basis described
461 |             with minimum and maximum values for each dimension.
462 |             nn - restrict phase points to a maximum distance away from their
463 |             nearest neighbor. this maximum distance is determined either
464 |             empirically or by taking the median nearest neighbor distance
465 |             from the data set. when points travel beyond this distance, they
466 |             are shrunken back toward the neighbor along the distance vector.
467 |         borders : tuple
468 |             ((min_i, max_i), ...) for each dimension of the basis.
469 |             only used if `method` is "fence".
470 |         max_nn_distance : float
471 |             maximum distance a phase point may travel from the
472 |             nearest neighbor. if `None`, set to the median nearest neighbor
473 |             distance in the data set.
474 |             only used if `method` is "nn".
475 |         boundary_knn : int
476 |             number of nearest neighbors to use for 'nn' boundary fencing.
477 |             moves cells toward the centroid of this nearest neighbor group.
478 | 
479 |         Returns
480 |         -------
481 |         None. Sets `.boundary_constraint` attribute.
482 | 
483 |         See Also
484 |         --------
485 |         _boundary_constraint_fence
486 |         _boundary_constraint_nn_distance
487 |         """
488 |         # check argument validity
489 |         if method not in ('fence', 'nn'):
490 |             raise NotImplementedError(
491 |                 '%s is not an implemented method.' % method)
492 | 
493 |         if method.lower() == 'fence':
494 |             if borders is None:
495 |                 raise ValueError('must specify borders if method is fence.')
496 |             # unpack border criteria into an attribute
497 |             self.boundary_fence = {}
498 |             self.boundary_fence['min'] = [x[0] for x in borders]
499 |             self.boundary_fence['max'] = [x[1] for x in borders]
500 |             # set the boundary contraint function to consider
501 |             # the border fence during phase point updates
502 |             self.boundary_constraint = self._boundary_constraint_fence
503 |         elif method.lower() == 'nn':
504 |             # the "nearest neighbor" to each point after fitting the NN
505 |             # model is the point itself, so we fit k = 2 here and take
506 |             # the "second" nearest neighbor for each point when predicting
507 |             # on the points themselves. Note that since phase points aren't
508 |             # in the training set, we subsequently use only the first neighbor.
509 |             self.boundary_nn = NearestNeighbors(
510 |                 n_neighbors=2, metric='euclidean')
511 |             self.boundary_nn.fit(self.pfield)
512 |             if max_nn_distance is None:
513 |                 # Compute nearest neighbor distances in the data set
514 |                 if not hasattr(self, 'pfield'):
515 |                     raise ValueError(
516 |                         'must `set_velocity_field` before NN boundaries.')
517 |                 distances, indices = self.boundary_nn.kneighbors(self.pfield)
518 |                 median_distance = np.median(
519 |                     distances[:, 1:self.boundary_knn+1])
520 |                 self.max_nn_distance = median_distance
521 |             else:
522 |                 self.max_nn_distance = max_nn_distance
523 |             self.boundary_constraint = self._boundary_constraint_nn_dist
524 |         return
525 | 
526 |     def _velocity_knn(
527 |         self,
528 |         x: np.ndarray,
529 |     ) -> np.ndarray:
530 |         """Calculate the velocity of a given position based
531 |         on the average velocity of the k-NN to that position.
532 | 
533 |         Parameters
534 |         ----------
535 |         x : np.ndarray
536 |             [D,] position vector in embedding space.
537 | 
538 |         Returns
539 |         -------
540 |         nn_v : np.ndarray
541 |             [D, (Mean, Std)] velocity vector in embedding space.
542 | 
543 |         See Also
544 |         --------
545 |         .k
546 |         """
547 |         # find nearest neighbors
548 |         nn_dist, nn_idx = self.v_nn.kneighbors(
549 |             x.reshape(1, -1),
550 |             return_distance=True,
551 |         )
552 | 
553 |         nn_idx = nn_idx.flatten()
554 | 
555 |         # calculate the velocity vector
556 |         if self.vknn_method == 'deterministic':
557 |             nn_v_mu = self.vfield[nn_idx, :].mean(0)
558 |         elif self.vknn_method == 'stochastic':
559 |             # fit a multivariate Gaussian to the observed
560 |             # RNA velocity vectors of the nearest neighbors
561 | 
562 |             # compute weights for each neighboring cell
563 |             weights = stats.norm.pdf(x=nn_dist, scale=self.mean_nn_distance)
564 | 
565 |             weights_mat = np.tile(
566 |                 weights.reshape(-1, 1),
567 |                 (1, self.vfield.shape[1]),
568 |             )
569 |             mu = np.sum(weights_mat*self.vfield[nn_idx, :], 0)/np.sum(weights)
570 |             # get weighted covariance
571 |             # \Sigma = \frac{1}{\sum_{i=1}^{N} w_i - 1}
572 |             #   {\sum_{i=1}^N w_i \left(x_i - \mu^*\right)^T\left(x_i - \mu^*\right)}
573 | 
574 |             cov = np.cov(
575 |                 self.vfield[nn_idx, :],
576 |                 aweights=weights.flatten(),
577 |                 rowvar=False,
578 |             )
579 |             
580 |             # init a multivariate normal with the weighted
581 |             # mean and covariance
582 |             norm = stats.multivariate_normal(
583 |                 mean=mu,
584 |                 cov=cov,
585 |             )
586 |             # sample from the fitted Gaussian
587 |             nn_v_mu = norm.rvs()
588 |         elif self.vknn_method == 'knn_random_sample':
589 |             # randomly sample a velocity vector 
590 |             # from one of the nearest neighbors
591 |             ridx = int(np.random.choice(nn_idx, size=1))
592 |             nn_v_mu = self.vfield[ridx, :]
593 |         else:
594 |             msg = f'{self.vknn_method} is not a valid method for ._velocity_knn'
595 |             raise AttributeError(msg)
596 | 
597 |         nn_v_sd = self.vfield[nn_idx, :].std(0)
598 |         nn_v = np.stack([nn_v_mu, nn_v_sd], -1)
599 |         return nn_v
600 | 
601 |     def _evolve(
602 |         self,
603 |         x0_idx: int,
604 |     ) -> np.ndarray:
605 |         """
606 |         Place a phase point at `x0` and evolve for `t` timesteps.
607 | 
608 |         Parameters
609 |         ----------
610 |         x0_idx : int
611 |             index for starting point `self.starting_points`.
612 | 
613 |         Returns
614 |         -------
615 |         trajectory : np.ndarray
616 |             [T, D, (Position, V_mu, V_sig)] trajectory of the
617 |             phase point.
618 |         """
619 |         x0 = self.starting_points[x0_idx, :]
620 |         if type(x0) != np.ndarray:
621 |             x0 = x0.toarray()
622 |         x0 = x0.flatten()
623 |         # [T, Dims, (Position, Velocity)]
624 |         trajectory = np.zeros(
625 |             (self.timesteps, x0.shape[0], 3), dtype=np.float32)
626 |         
627 |         # for each timestep, update the position of the phase point
628 |         # based on the velocity of nearest neighbors and obey any
629 |         # boundary constraints
630 |         x = x0
631 |         for t in range(self.timesteps):
632 |             trajectory[t, :, 0] = x  # match x position to dv/dx
633 |             v = self.v_fxn(x=x.reshape(-1),)
634 |             # add white noise if desired to better emulate a stochastic process
635 |             noise = v[:, 1] * np.random.randn(v.shape[0]) * self.noise_scale
636 |             x_new = x + (v[:, 0] + noise)*self.step_scale
637 |             # constrain to a set of pre-defined boundaries
638 |             # defaults to an identity if not set explicitly
639 |             x_new = self.boundary_constraint(x_new)
640 |             trajectory[t, :, 1] = v[:, 0]
641 |             trajectory[t, :, 2] = v[:, 1]
642 |             x = x_new
643 |         return trajectory
644 | 
645 |     def _evolve2disk(self, **kwargs) -> str:
646 |         """Performs phase point evolution, but saves results to disk rather
647 |         than returning the array."""
648 |         raise NotImplementedError('evolve2disk is not yet implemented.')
649 | 
650 |     def __getstate__(self) -> dict:
651 |         """Redefine __getstate__ to allow serialization of class methods.
652 |         `anndata.AnnData` doesnt support serialization.
653 |         """
654 |         self_dict = self.__dict__.copy()
655 |         # we remove large objects from `__getstate__` to allow
656 |         # pickling for `multiprocessing.Pool` workers without
657 |         # high memory overhead
658 |         del self_dict['adata']
659 |         del self_dict['vadata']
660 |         return self_dict
661 | 
662 |     def simulate_phase_points(
663 |         self,
664 |         n_points: int = 1000,
665 |         n_timesteps: int = 1000,
666 |         velocity_method: str = 'knn',
667 |         velocity_method_attrs: dict = {
668 |             'vknn_method': 'deterministic',
669 |         },
670 |         step_scale: float = 1.,
671 |         noise_scale: float = 0.,
672 |         multiprocess: bool = False,
673 |     ) -> np.ndarray:
674 |         """Simulate phase points moving through the velocity field.
675 | 
676 |         Parameters
677 |         ----------
678 |         n_points : int
679 |             number of points to simulate.
680 |         n_timesteps : int
681 |             number of timesteps for evolution.
682 |         velocity_method : str
683 |             method for estimating velocity during phase point evolution.
684 |             one of {'knn', 'v_model'}.
685 |             if 'v_model', must set the `.v_model` attribute with a Callable
686 |             that takes in a position and outputs a velocity. useful if you
687 |             want to train a model to map positions to velocities.
688 |         velocity_method_attrs: dict
689 |             attributes for use in a particular velocity method.
690 |             keys are attribute names added to `self` with corresponding
691 |             values.
692 |         step_scale : float
693 |             scaling factor for steps in the embedding space.
694 |         noise_scale : float
695 |             scaling factor for noise introduced during simulation.
696 |             defaults to a noiseless simulation.     
697 |         multiprocess : bool
698 |             use multiprocessing.
699 | 
700 |         Returns
701 |         -------
702 |         trajectories : np.ndarray
703 |             [PhasePoints, Time, Dimensions, (Position, V_mu, V_sig)]
704 |             trajectories of phase points in the velocity field.
705 |         also sets `.trajectories` attribute.
706 | 
707 |         Notes
708 |         -----
709 |         TODO: multithread these operations
710 |         """
711 |         # check argument validity
712 |         if velocity_method not in ['knn', 'v_model']:
713 |             raise ValueError(
714 |                 '%s is not a valid velocity method.' % velocity_method)
715 | 
716 |         if not hasattr(self, 'vfield') or not hasattr(self, 'pfield'):
717 |             raise ValueError(
718 |                 'must first set velocity field with `set_velocity_field`.')
719 | 
720 |         if not hasattr(self, 'starting_points'):
721 |             raise ValueError(
722 |                 'must first set starting points with `set_starting_points`.')
723 | 
724 |         if velocity_method == 'knn':
725 |             self.v_fxn = self._velocity_knn
726 |             if 'vknn_method' not in velocity_method_attrs:
727 |                 msg = 'velocity_method knn requires a "vknn_method" attribute.'
728 |                 raise ValueError(msg)
729 |             # fit a nearest neighbors model to the data
730 |             self.v_nn = NearestNeighbors(n_neighbors=self.velocity_k)
731 |             self.v_nn.fit(self.pfield)
732 | 
733 |             # get the mean distance between nearest neighbors
734 |             d, _ = self.v_nn.kneighbors(self.pfield)
735 |             self.mean_nn_distance = d[:, 1].mean()
736 | 
737 |         elif velocity_method == 'v_model':
738 |             if not hasattr(self, 'v_model'):
739 |                 raise ValueError('must specify a neural network model first.')
740 |             self.v_fxn = self.v_model
741 |         else:
742 |             msg = f'{velocity_method} is not a valid velocity method.'
743 |             raise ValueError(msg)
744 | 
745 |         if velocity_method_attrs is not None:
746 |             # add the velocity method attrs to self
747 |             for k in velocity_method_attrs.keys():
748 |                 setattr(self, k, velocity_method_attrs[k])
749 | 
750 |         self.timesteps   = n_timesteps
751 |         self.step_scale  = step_scale
752 |         self.noise_scale = noise_scale
753 | 
754 |         if multiprocess:
755 |             # get a set of starting locations
756 |             ridx = np.random.choice(np.arange(self.starting_points.shape[0]),
757 |                                     size=n_points,
758 |                                     replace=True)
759 |             # open a process pool
760 |             p = ProcessPool()
761 |             # distribute tasks to workers
762 |             res = p.map(self._evolve, ridx.tolist())
763 |             p.close()
764 |             # aggregate trajectory results
765 |             trajectories = np.stack(res, 0)
766 |         else:
767 |             trajectories = np.zeros(
768 |                 (
769 |                     n_points,
770 |                     n_timesteps,
771 |                     self.pfield.shape[1],
772 |                     3,
773 |                 ),
774 |                 dtype=np.float32,
775 |             )
776 |             for i in tqdm.tqdm(
777 |                 range(n_points),
778 |                 desc='simulating trajectories'
779 |             ):
780 | 
781 |                 # select a random starting point
782 |                 ridx = np.random.choice(
783 |                     np.arange(self.starting_points.shape[0]),
784 |                     size=1,
785 |                     replace=False,
786 |                 )
787 | 
788 |                 # simulate the trajectory!
789 |                 phase_traj = self._evolve(x0_idx=ridx,)
790 |                 trajectories[i, :, :, :] = phase_traj
791 | 
792 |         self.trajectories = trajectories
793 |         return trajectories
794 | 
795 | 
796 | ##########################################
797 | # plotting methods
798 | ##########################################
799 | 
800 | def plot_phase_simulations(
801 |     adata: anndata.AnnData,
802 |     trajectories: np.ndarray,
803 |     basis: str = 'pca',
804 |     figsize: tuple = (6, 4),
805 |     point_color='lightgray',
806 |     trajectory_cmap='Purples',
807 |     n_colors: int = 40,
808 |     **kwargs,
809 | ) -> (matplotlib.figure.Figure, matplotlib.axes.Axes):
810 |     """Plot phase simulation trajectories.
811 | 
812 |     Parameters
813 |     ----------
814 |     adata : anndata.AnnData
815 |         [Cells, Genes] experiment object.
816 |     trajectories : np.ndarray
817 |         [PhasePoints, Time, Dimensions, (Position, V_mu, V_sig)]
818 |         trajectories of phase points in the velocity field.
819 |     basis : str
820 |         coordinate basis in `adata.obsm` to use.
821 |         retrieves `adata.obsm[f'X_{basis}']`.
822 |     figsize : tuple
823 |         (W, H) for matplotlib figure.
824 |     point_color : str
825 |         color to use for observed cell coordinate points.
826 |     trajectory_cmap : str
827 |         colormap to use for plotting trajectories.
828 |         single color maps (e.g. "Purples", "Blues") work well.
829 |     n_colors : int
830 |         number of steps in the color gradient and number of unique
831 |         points to plot for each trajectory.
832 | 
833 |     Returns
834 |     -------
835 |     fig : matplotlib.figure.Figure
836 |     ax : matplotlib.axes.Axes
837 |     """
838 | 
839 |     E = adata.obsm[f'X_{basis}']
840 | 
841 |     fig, ax = plt.subplots(1, 1, figsize=figsize)
842 |     ax.scatter(
843 |         E[:, 0],
844 |         E[:, 1],
845 |         color=point_color,
846 |         alpha=0.5,
847 |     )
848 | 
849 |     n_steps = trajectories.shape[1]
850 | 
851 |     gradient = sns.color_palette(trajectory_cmap, n_colors)
852 |     for i, t in enumerate(
853 |         np.arange(0, n_steps, n_steps//n_colors)[:-1][:n_colors]
854 |     ):
855 |         T = trajectories[:, t, :, 0]
856 |         ax.scatter(
857 |             T[:, 0],
858 |             T[:, 1],
859 |             color=gradient[i],
860 |             **kwargs,
861 |         )
862 |     ax.set_xlabel(f'{basis} 1')
863 |     ax.set_ylabel(f'{basis} 2')
864 |     ax.set_title(f'Phase Points - {basis} Basis')
865 |     return fig, ax
866 | 


--------------------------------------------------------------------------------