├── .gitignore
├── LICENSE
├── README.md
├── cytopus
    ├── __init__.py
    ├── data
    │   ├── Cytopus_1.2.txt
    │   ├── Cytopus_1.22.txt
    │   ├── Cytopus_1.23.txt
    │   ├── Cytopus_1.31nc.txt
    │   └── adata_spectra.h5ad
    ├── knowledge_base
    │   ├── __init__.py
    │   └── kb_queries.py
    └── tl
    │   ├── __init__.py
    │   ├── create.py
    │   ├── hierarchy.py
    │   └── label.py
├── img
    ├── celltype_hierarchy_1.2.png
    └── cytopus_v1.1_stable_graph.png
├── notebooks
    ├── Cytopus_utils_tutorial.ipynb
    ├── Hierarchical_annotation_tutorial.ipynb
    ├── KnowledgeBase_construct.ipynb
    ├── KnowledgeBase_queries_colaboratory.ipynb
    └── Utils_tutorial.ipynb
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | #Python build files
 2 | /dist/
 3 | __pycache__/
 4 | /*-env/
 5 | /env-*/
 6 | /build/
 7 | /*.egg-info/
 8 | *.egg-info/
 9 | *.egg
10 | 
11 | #Individual files
12 | pyproject_deprecated.toml


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2022 Thomas Walle
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Cytopus :octopus: [![DOI](https://zenodo.org/badge/389175717.svg)](https://zenodo.org/badge/latestdoi/389175717)
  2 | 
  3 | 
  4 | ## Single cell omics biology annotations
  5 | 
  6 | ![Image of Cytopus](https://github.com/wallet-maker/cytopus/blob/main/img/cytopus_v1.1_stable_graph.png)
  7 | 
  8 | 
  9 | ## Overview:
 10 | 
 11 | Package to query our single cell genomics KnowledgeBase.
 12 | 
 13 | If you use Cytopus :octopus: or its gene sets please cite the original Cytopus [publications](https://doi.org/10.1038/s41587-023-01940-3) &
 14 | [![DOI](https://zenodo.org/badge/389175717.svg)](https://zenodo.org/badge/latestdoi/389175717). 
 15 | 
 16 | For details see the [license](https://github.com/wallet-maker/cytopus/blob/Cytopus_1.3/LICENSE)
 17 | 
 18 | The KnowledgeBase is provided in graph format based on the networkx package. Central to the KnowledgeBase is a cell type hierarchy and **cellular_processess** which correspond to the cell types in this hierarchy. Cell types are supported by gene sets indicative of their **cellular identities**. Moreover, the KnowledgeBase contains metadata about the gene sets such as author ship, the gene set topic etc.. 
 19 | 
 20 | The KnowledgeBase can be queried to retrieve gene sets for specific cell types and organize them in a dictionary format for downstream use with the [Spectra](https://github.com/dpeerlab/spectra) package: 
 21 | 
 22 | 
 23 | ## Installation
 24 | 
 25 | install from pypi:
 26 | 
 27 | ```
 28 | pip install cytopus
 29 | ```
 30 | 
 31 | install from source:
 32 | 
 33 | ```
 34 | pip install git+https://github.com/wallet-maker/cytopus.git
 35 | ```
 36 | 
 37 | Some plotting functions require pygraphviz or pyvis. Install either or both:
 38 | 
 39 | pygraphviz using conda:
 40 | ```
 41 | conda install --channel conda-forge pygraphviz
 42 | ```
 43 | 
 44 | pyvis using pip
 45 | ```
 46 | pip install pyvis
 47 | ```
 48 | 
 49 | ## Tutorial
 50 | 
 51 | ### Quickstart - Querying the Knowledge Base:
 52 | 
 53 | Retrieve default KnowledgeBase (human only):
 54 | 
 55 | ```
 56 | import cytopus as cp
 57 | G = cp.KnowledgeBase()
 58 | ```
 59 | Retrieve custom KnowledgeBase (documentation to build KnowledgeBase object [here](https://github.com/wallet-maker/cytopus/blob/Cytopus_1.3/notebooks/KnowledgeBase_construct.ipynb)):
 60 | ```
 61 | file_path = '~/dir1/dir2/knowledgebase_file.txt'
 62 | G = cp.KnowledgeBase(file_path)
 63 | ```
 64 | Access data in KnowledgeBase:
 65 | ```
 66 | #list of all cell types in KnowledgeBase
 67 | G.celltypes
 68 | #dictionary of all cellular processes in KnowledgeBase as a dictionary {'process_1':['gene_a','gene_e','gene_y',...],'process_2':['gene_b','gene_u',...],...}
 69 | G.processes
 70 | #dictionary of all cellular identities in KnowledgeBase as a dictionary {'identity_1':['gene_j','gene_k','gene_z',...],'identity_2':['gene_y','gene_p',...],...}
 71 | G.identities
 72 | #dictionary with gene set properties (for cellular processes or identities)
 73 | G.graph.nodes['gene_set_name']
 74 | ```
 75 | 
 76 | Plot the cell type hierarchy stored in the KnowledgeBase as a directed graph with edges pointing into the direction of the parents:
 77 | ```
 78 | G.plot_celltypes()
 79 | ```
 80 | 
 81 | 
 82 | ![Image of Cell type hierarchy](https://github.com/wallet-maker/cytopus/blob/main/img/celltype_hierarchy_1.2.png)
 83 | 
 84 | 
 85 | 
 86 | Prepare a nested dictionary assigning cell types to their cellular processes and cellular processes to their corresponding genes. This dictionary can be used as an input for Spectra.
 87 | 
 88 | First, select the cell types which you want to retrieve gene sets for. 
 89 | These cell types can be selected from the cell type hierarchy (see .plot_celltypes() method above)
 90 | ```
 91 | celltype_of_interest = ['M','T','B','epi']
 92 | ```
 93 | 
 94 | Second, select the cell types which you want merge gene sets and set them as global gene sets for the Spectra package. These gene sets should be valid for all cell types in the data. 
 95 | ```
 96 | ##e.g. if you are working with different human cells
 97 | global_celltypes = ['all-cells']
 98 | ##e.g. if you are working with human leukocytes
 99 | global_celltypes = ['all-cells','leukocyte']
100 | ##e.g. if you are working with B cells
101 | global_celltypes = ['all-cells','leukocyte','B']
102 | ```
103 | 
104 | Third retrieve dictionary of format {celltype_a:{process_a:[gene_a,gene_b,...],...},...}.
105 | Decide whether you want to merge gene sets for all children or all parents (unusual) of the selected cell types.
106 | ```
107 | G.get_celltype_processes(celltype_of_interest,global_celltypes = global_celltypes,get_children=True,get_parents =False)
108 | ```
109 | 
110 | Fourth, dictionary will be stored in the KnowledgeBase
111 | ```
112 | G.celltype_process_dict
113 | ```
114 | 
115 | ### Detailed tutorial for Querying the Knowledge Base:
116 | Learn how to explore the Knowledge Base and retrieve a dicitionary which can be used for [Spectra](https://github.com/dpeerlab/spectra):
117 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/wallet-maker/cytopus/blob/main/notebooks/KnowledgeBase_queries_colaboratory.ipynb)
118 | 
119 | ### Detailed tutorial for Generating a cytopus Knowledge Base object:
120 | Learn how to create a Knowledge Base object from gene sets annotations and cell type hierarchies stored in .csv files:
121 | [here](https://github.com/wallet-maker/cytopus/blob/main/notebooks/KnowledgeBase_construct.ipynb)
122 | 
123 | ### Utils tutorial - Labeling Factor Analysis Outputs (Spectra):
124 | Learn how to label marker genes from factor analysis, determine factor cell type specificity and export the Knowledge Base content as .gmt files for other applications:
125 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/wallet-maker/cytopus/blob/main/notebooks/Cytopus_utils_tutorial.ipynb)
126 | 
127 | ### Hierarchy tutorial
128 | Hierarchically annotate and query cells using AnnData and Cytopus:
129 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/wallet-maker/cytopus/blob/main/notebooks/Hierarchical_annotation_tutorial.ipynb)
130 | 
131 | ## you can
132 |  submit gene sets to be added to the KnowledgeBase here:
133 | 
134 | https://docs.google.com/forms/d/e/1FAIpQLSfWU7oTZH8jI7T8vFK0Nqq2rfz6_83aJIVamH5cogZQMlciFQ/viewform?usp=sf_link
135 | 
136 | All submissions will be reviewed and if needed revised before they will be added to the database. This will ensure consistency of the annotations and avoid gene set duplication. Authorship will be acknowledged in the KnowledgeBase for all submitted gene sets which pass review and are added to the KnowledgeBase. You can also create entirely new KnowledgeBase objects with this package.
137 | 
138 | ## Citation and Usage 
139 | 
140 | For gene sets from external sources you must also abide to the licenses of the original gene sets. To make this easier we have stored these in the Knowledge Base object:
141 | 
142 | ```
143 | import cytopus as cp
144 | G = cp.KnowledgeBase()
145 | gene_set_of_interest = 'all_macroautophagy_regulation_positive'
146 | print(G.graph.nodes[gene_set_of_interest])
147 | ```
148 | 
149 | 


--------------------------------------------------------------------------------
/cytopus/__init__.py:
--------------------------------------------------------------------------------
1 | """A Knowledge Base for Single Cell Biology"""
2 | 
3 | from .tl import *
4 | from .knowledge_base import KnowledgeBase, get_data


--------------------------------------------------------------------------------
/cytopus/data/Cytopus_1.2.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wallet-maker/cytopus/638dd917cdd047e322213b6e4767bf0b9b78bf80/cytopus/data/Cytopus_1.2.txt


--------------------------------------------------------------------------------
/cytopus/data/Cytopus_1.22.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wallet-maker/cytopus/638dd917cdd047e322213b6e4767bf0b9b78bf80/cytopus/data/Cytopus_1.22.txt


--------------------------------------------------------------------------------
/cytopus/data/Cytopus_1.23.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wallet-maker/cytopus/638dd917cdd047e322213b6e4767bf0b9b78bf80/cytopus/data/Cytopus_1.23.txt


--------------------------------------------------------------------------------
/cytopus/data/Cytopus_1.31nc.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wallet-maker/cytopus/638dd917cdd047e322213b6e4767bf0b9b78bf80/cytopus/data/Cytopus_1.31nc.txt


--------------------------------------------------------------------------------
/cytopus/data/adata_spectra.h5ad:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wallet-maker/cytopus/638dd917cdd047e322213b6e4767bf0b9b78bf80/cytopus/data/adata_spectra.h5ad


--------------------------------------------------------------------------------
/cytopus/knowledge_base/__init__.py:
--------------------------------------------------------------------------------
1 | """Tools to plot and query KnowledgeBase"""
2 | from .kb_queries import KnowledgeBase, get_data
3 | 
4 | 
5 | 
6 | 
7 | 


--------------------------------------------------------------------------------
/cytopus/knowledge_base/kb_queries.py:
--------------------------------------------------------------------------------
  1 | from os.path import dirname
  2 | import matplotlib.pyplot as plt
  3 | import pickle
  4 | import pkg_resources
  5 | import networkx as nx
  6 | import numpy as np
  7 | 
  8 | 
  9 | def get_data(filename):
 10 |     """
 11 |     Load data from cytopus/data.
 12 |     """
 13 |     return pkg_resources.resource_filename('cytopus', 'data/' + filename)
 14 | 
 15 | #nested dict with celltype hierarchy
 16 | def extract_hierarchy(G, node='all-cells',invert=False):
 17 |     '''
 18 |     extract cell type hierarchy from KnowledgeBase object as a nested dictionary
 19 |     G: cytopus.kb.KnowledgeBase, with a hierarchy of celltypes
 20 |     node: str, celltype to use as starting points in the hiearchy (e.g. 'all-cells')
 21 |     invert: bool, if False the dict will contain all children below the node, if True the dict will contain all parents above the node
 22 |     '''
 23 |     node_list_plot = G.celltypes
 24 | 
 25 |     def filter_node(n1):
 26 |         return n1 in node_list_plot
 27 | 
 28 |     view = nx.subgraph_view(G.graph, filter_node=filter_node)
 29 |     if invert:
 30 |         predecessors = view.successors(node)
 31 |     else:
 32 |         predecessors = view.predecessors(node)
 33 |     if not predecessors:
 34 |         return node
 35 |     else:
 36 |         return {s: extract_hierarchy(G, s) for s in predecessors}
 37 | 
 38 | 
 39 | class KnowledgeBase:
 40 |     def __init__(self, graph=None):
 41 |         '''
 42 |         load KnowledgeBase from file
 43 |         retrieve all cell types in KnowledgeBase
 44 |         create dictionary for cellular processes in KnowledgeBase
 45 |         graph: str or networkx.DiGraph, path to pickled networkx.DiGraph object formatted for cytopus or networkx.DiGraph
 46 |         '''
 47 |         
 48 |         # Initialise default graph data
 49 |         if graph is None:
 50 |             graph = get_data("Cytopus_1.31nc.txt")
 51 |         # load KnowledgeBase from pickled file
 52 |         if isinstance(graph, nx.classes.digraph.DiGraph):
 53 |             self.graph = graph
 54 |         elif isinstance(graph, str):
 55 |             with open(graph, 'rb') as f:  # notice the r instead of w
 56 |                 self.graph = pickle.load(f) 
 57 |         else:
 58 |             raise(ValueError('graph must be path (str) or networkx.classes.digraph.DiGraph object'))
 59 |         #retrieve all cell types from data
 60 |         self.celltypes = self.filter_nodes(attribute_name = 'class',attributes= ['cell_type'],origin=None,target=None)
 61 |         
 62 |         #create gene set : gene dict for all cellular processes
 63 |         processes = self.get_processes(gene_sets = list(set([x[0] for x in self.filter_edges(attribute_name = 'class', attributes = ['process_OF'],target=self.celltypes)])))
 64 |         processes = {k:[x for x in v if x not in ['nan',np.nan]] for k,v in processes.items()}
 65 |         self.processes = processes 
 66 |         #self.processes = #self.filter_nodes(attribute_name = 'class',attributes= ['processes'],origin=None,target=None)
 67 |         print(self)
 68 |         #create gene set : gene dict for all cellular identities
 69 |         identities = self.get_identities(self.filter_nodes(attribute_name = 'class',attributes=['cell_type'],origin=None,target=None))
 70 |         identities = {k:[x for x in v if x not in ['nan',np.nan]] for k,v in identities.items()}
 71 |         self.identities = identities
 72 |         
 73 |     
 74 |     def __str__(self):
 75 |         print(f"KnowledgeBase object containing {len(self.celltypes)} cell types and {len(self.processes)} cellular processes")
 76 |         return ""
 77 |     
 78 |     def filter_nodes(self, attributes,attribute_name=None, 
 79 |                      origin=None,target=None):
 80 |         '''
 81 |         filter nodes in networkx graph for their attributes
 82 |         G: networkx graph
 83 |         attribute_name: attribute key in node dictionary
 84 |         attributes: attribute list to select
 85 |         origin: list of node origin of node
 86 |         target: list of node end/target
 87 |         '''
 88 |         node_list = [] 
 89 |         if attribute_name == None:
 90 |             node_list = self.graph.nodes
 91 |         else:
 92 |             for x in self.graph.nodes:
 93 |                 if attribute_name in self.graph.nodes[x].keys():
 94 |                     if self.graph.nodes[x][attribute_name] in attributes:
 95 |                         node_list.append(x)
 96 |         if origin!=None:
 97 |             node_list = [x for x in node_list if x[0] in origin]
 98 |         if target!=None:
 99 |             node_list = [x for x in node_list if x[1] in target]
100 |         return node_list
101 |     
102 |     
103 |         
104 | 
105 |     def filter_edges(self,  attributes,attribute_name=None,origin=None,target=None, ):
106 |         '''
107 |         filter edges in networkx graph for their attributes
108 |         G: networkx graph
109 |         attribute_name: attribute key in node dictionary
110 |         attributes: attribute list to select
111 |         origin: list of node origin of edge
112 |         target: list of node end/target
113 |         '''
114 |         edge_list = [] 
115 |         if attribute_name == None:
116 |             edge_list = self.graph.edges
117 |         else:
118 |             for x in self.graph.edges:
119 |                 if attribute_name in self.graph.edges[x].keys():
120 |                     if self.graph.edges[x][attribute_name] in attributes:
121 |                         edge_list.append(x)
122 |         if origin!=None:
123 |             edge_list = [x for x in edge_list if x[0] in origin]
124 |         if target!=None:
125 |             edge_list = [x for x in edge_list if x[1] in target]
126 |         return edge_list
127 |     
128 |     def get_celltype_hierarchy(self, node='all-cells',invert=False):
129 |         #retrieve hierarchy
130 |         hierarchy_dict = extract_hierarchy(self, node=node,invert=invert)
131 |         return hierarchy_dict
132 |         
133 |     def get_processes(self,gene_sets): 
134 |         '''
135 |         create dictionary gene sets for cellular processes : genes
136 |         self: KnowledgeBase object (networkx)
137 |         gene_sets: list of gene sets for cellular processes
138 |         '''
139 |         #get genes per gene set    
140 |         genes = self.filter_nodes(attribute_name = 'class',attributes =['gene'])
141 |         #dictionary geneset : genes
142 |         gene_edges = self.filter_edges( attribute_name ='class', attributes = ['gene_OF'],origin=gene_sets,target=genes)
143 |         gene_set_dict = {}
144 |         for i in gene_edges:
145 |             if i[0] in gene_set_dict.keys():
146 |                 gene_set_dict[i[0]].append(i[1])#
147 |             else:
148 |                 gene_set_dict[i[0]]= [i[1]]#
149 |         return gene_set_dict
150 |     
151 |     def get_celltype_processes(self,celltypes,global_celltypes=[None],get_parents =True, get_children =True, parent_depth=1, child_depth= None, fill_missing=True,parent_depth_dict=None, child_depth_dict=None, inplace=True):
152 |         '''
153 |         get gene sets for specific cell types
154 |         self: KnowledgeBase object (networkx)
155 |         celltypes: list of celltypes to retrieve
156 |         global_celltypes: list of celltypes to set as 'global' for Spectra
157 |         get_parent: also retrieve gene sets for the parents of the cell types in celltypes
158 |         get_children: also retrieve gene sets for the parents of the cell types in celltypes
159 |         fill_missing: add an empty dictionary for cell types not found in KnowledgeBase   
160 |         parent_depth: steps from cell type to go up the hierarchie to retrieve gene sets linked to parents (e.g. 2 would be up to grandparents)
161 |         parent_depth_dict: you can also set the depth for specific celltype with a dictionary {celltype1:depth1,celltype2:depth2}
162 |         child_depth: steps from cell type to go down the hierarchie to retrieve gene sets linked to children (e.g. 2 would be down to grandchildren) 
163 |         child_depth_dict: you can also set the depth for specific celltype with a dictionary {celltype1:depth1,celltype2:depth2}
164 |         inplace: bool, if True save output under self.celltype_process_dict
165 |         '''
166 |         import itertools
167 |         import warnings
168 |         from collections import Counter
169 | 
170 |         ## limit to celltype subgraph to retrieve relevant celltypes
171 | 
172 |         node_list_plot = self.celltypes
173 | 
174 |         def filter_node(n1):
175 |             return n1 in node_list_plot
176 | 
177 |         view = nx.subgraph_view(self.graph, filter_node=filter_node)
178 | 
179 |         for x in list(set(celltypes+global_celltypes)):
180 |             if x not in list(view.nodes):
181 |                 warnings.warn('Not all cell types are contained in the Immune Knowledge base')
182 |         if get_parents:
183 |             all_celltypes_parents = {}
184 |             if parent_depth_dict == None:
185 |                 parent_depth_dict = {}
186 | 
187 |             for i in celltypes:
188 |                 if i in view.nodes:#is celltype in KnowledgeBase
189 |                     if i in parent_depth_dict.keys(): #check if query depth was manually defined
190 |                         if parent_depth_dict[i] == None:
191 |                             all_celltypes_parents[i]=  [i] 
192 |                         else:
193 |                             all_celltypes_parents[i]=  [n for n in nx.traversal.bfs_tree(view, i,depth_limit=parent_depth_dict[i])] 
194 |                     else:
195 |                         all_celltypes_parents[i]=  [n for n in nx.traversal.bfs_tree(view, i,depth_limit=parent_depth)] 
196 |                 elif fill_missing: 
197 |                     all_celltypes_parents[i] = {} #if not add an empty dictionary
198 |                     print('adding empty dictionary for cell type:',i)
199 |                 else:
200 |                     all_celltypes_parents[i]=  [i]
201 |                     print('cell type of interest',i,'is not in the knowledge base')
202 |         if get_children:
203 |             all_celltypes_children = {}
204 |             if child_depth_dict == None:
205 |                 child_depth_dict = {}
206 | 
207 |             for i in celltypes:
208 |                 if i in view.nodes:
209 |                     if i in child_depth_dict.keys(): #check if query depth was manually defined
210 |                         if child_depth_dict[i] ==None:
211 |                             all_celltypes_children[i]=  [i]
212 |                         else:
213 |                             all_celltypes_children[i]=  [n for n in nx.traversal.bfs_tree(view, i,reverse=True,depth_limit=child_depth_dict[i])]
214 |                     else:
215 |                         all_celltypes_children[i]=  [n for n in nx.traversal.bfs_tree(view, i,reverse=True,depth_limit=child_depth)]
216 |                 else:
217 |                     all_celltypes_children[i]=  [i]
218 |                     print('cell type of interest',i,'is not in the knowledge base')
219 | 
220 |         if get_parents ==True and  get_children==True:
221 |             all_celltypes = list(itertools.chain.from_iterable(list(all_celltypes_children.values())+list(all_celltypes_parents.values())))
222 |         elif get_parents==True and  get_children==False:
223 |             all_celltypes = list(itertools.chain.from_iterable(list(all_celltypes_parents.values())))
224 |         elif get_parents == False and  get_children==True:
225 |             all_celltypes = list(itertools.chain.from_iterable(list(all_celltypes_children.values())))
226 |         else:
227 |             all_celltypes = []
228 |         all_celltypes = list(set(all_celltypes +  global_celltypes + celltypes))
229 |         
230 |         #get process genesets connected to these celltypes
231 |         gene_set_edges =self.filter_edges(attribute_name = 'class', attributes = ['process_OF'],target=all_celltypes)  
232 | 
233 |         #dictionary gene set  : cell type
234 |         gene_set_celltype_dict = dict(gene_set_edges)
235 |         #dictionary cell type: [gene_set1, gene_set2,...]
236 |         celltype_gene_set_dict = {}
237 | 
238 |         for key,value in gene_set_celltype_dict.items():
239 |             if value in celltype_gene_set_dict.keys():
240 |                 celltype_gene_set_dict[value].append(key)
241 |             else:
242 |                 celltype_gene_set_dict[value] = [key]
243 |         
244 |         #get dict gene sets for cellular processes : genes
245 |         gene_set_dict = self.get_processes(gene_sets = list(set([x[0] for x in gene_set_edges])))
246 |         
247 |         #construct dictionary
248 |         process_dict = {}
249 | 
250 |         for key,value in celltype_gene_set_dict.items():
251 |             process_dict[key] = {}
252 |             for gene_set in value:
253 |                 process_dict[key][gene_set] = gene_set_dict[gene_set]
254 |         if global_celltypes != [None]:
255 |             global_gs = {}
256 |             for i in global_celltypes:
257 |                 if i in process_dict.keys():
258 |                     global_gs.update(process_dict[i])
259 |                     del process_dict[i]
260 |                 else:
261 |                     print('did not find',i,'in cell type keys to set as global')
262 |             process_dict['global'] = global_gs
263 | 
264 |         else:
265 |             print('you must add a "global" key to run Spectra. E.g. set <global_celltypes> to one cell type key to be set as "global"')
266 | 
267 |         ## merge relevant children and parents into cell type specific keys
268 | 
269 |         process_dict_merged = {}
270 |        
271 |         
272 |         if get_children:
273 |             for key,value in all_celltypes_children.items():
274 |                 merged_dict = {}
275 |                 for cell_type in value:
276 |                     if cell_type in process_dict.keys():
277 |                         merged_dict.update(process_dict[cell_type])
278 |                 if key in process_dict_merged.keys():
279 |                     process_dict_merged[key].update(merged_dict) 
280 |                 else:
281 |                     process_dict_merged[key]=merged_dict 
282 | 
283 |         if get_parents:
284 |             for key,value in all_celltypes_parents.items():
285 |                 merged_dict = {}
286 |                 for cell_type in value:
287 |                     if cell_type in process_dict.keys():
288 |                         merged_dict.update(process_dict[cell_type])
289 |                 if key in process_dict_merged.keys():
290 |                     process_dict_merged[key].update(merged_dict)
291 |                 else:
292 |                     process_dict_merged[key]=merged_dict 
293 |                 
294 |         if get_children==False and get_parents==False:
295 |             process_dict_merged =process_dict 
296 |                 
297 |         if global_celltypes != [None]:
298 |             process_dict_merged['global'] = process_dict['global']
299 |             
300 |         ## check if cell types contain shared children or parents
301 |         if get_children:
302 |             shared_children = []
303 |             for key,value in Counter(list(itertools.chain.from_iterable(list(all_celltypes_children.values())))).items():
304 |                 if value >1:
305 |                     shared_children.append(key)
306 |             if shared_children != []:
307 | 
308 |                 print('cell types of interest share the following children:',shared_children,'This may be desired.')
309 |         if get_parents:
310 |             shared_parents = []
311 |             for key,value in Counter(list(itertools.chain.from_iterable(list(all_celltypes_parents.values())))).items():
312 |                 if value >1:
313 |                     shared_parents.append(key)
314 |             if shared_parents != []:
315 |                 print('cell types of interest share the following parents:',shared_parents,'This may be desired.')
316 |         if inplace:
317 |             self.celltype_process_dict = process_dict_merged
318 |             #self.processes = gene_set_dict
319 |         else:
320 |             return process_dict_merged
321 |         
322 |     def get_identities(self, celltypes_identities,include_subsets=False):
323 |         '''
324 |         self: KnowledgeBase object (networkx)
325 |         celltypes: list of cell types to retrieve identity gene sets for
326 |         '''
327 |         if not isinstance(celltypes_identities, list):
328 |             raise TypeError('celltypes_identities must of be of type: list')
329 |         if include_subsets:
330 |             def filter_node(n1):
331 |                 return n1 in self.celltypes
332 |            
333 | 
334 |             celltype_view =nx.subgraph_view(self.graph, filter_node=filter_node)
335 |             celltypes_new = []
336 |             for i in celltypes_identities:
337 |                 nodes_of_specific_type =  [n for n in nx.traversal.bfs_tree(celltype_view, i,reverse=True)]
338 |                 celltypes_new += nodes_of_specific_type
339 |             celltypes_identities = list(set(celltypes_new))
340 |             
341 |         identity_edges = self.filter_edges( attribute_name ='class', attributes = ['identity_OF'],target=celltypes_identities)
342 |         
343 |         #construct dictionary geneset:gene
344 |         gene_edges = self.filter_edges( attribute_name ='class', attributes = ['gene_OF'])
345 |         gene_set_dict = {}
346 |         for i in gene_edges:
347 |             if i[0] in gene_set_dict.keys():
348 |                 gene_set_dict[i[0]].append(i[1])#
349 |             else:
350 |                 gene_set_dict[i[0]]= [i[1]]#
351 |                       
352 |         #construct dictionary celltype: identity_geneset
353 |         identity_dict = {}
354 |         
355 |         for edge in identity_edges:
356 |             if edge[1] in list(self.celltypes):
357 |                 identity_gs = gene_set_dict[edge[0]]
358 |                 identity_dict[edge[1]] = identity_gs
359 |             else:
360 |                 print(edge[1],'not contained in KnowledgeBase')
361 |         return identity_dict
362 |         
363 |     def plot_celltypes(self, figure_size = [30,30], node_size = 1000, edge_width= 1, arrow_size=20, 
364 |                        edge_color= 'k', node_color='#8decf5', label_size = 20):
365 |         ''''
366 |         plot all celltypes contained in the KnowledgeBase using matplotlib and graphviz
367 |         self: KnowledgeBase object (networkx)
368 |         figure_size: figure size
369 |         node_size: node size in graph
370 |         edge_with: edge width in graph
371 |         arrow_size: arrow size of directed edges
372 |         edge_color: edge color
373 |         node_color: node color
374 |         label_size: size of node labels
375 |         '''
376 |         try:
377 |             from networkx.drawing.nx_agraph import graphviz_layout
378 |         except ModuleNotFoundError:
379 |             print('please install graphviz')
380 |             pass
381 |         node_list_plot = self.filter_nodes(attribute_name='class', attributes = ['cell_type'])
382 | 
383 |         def filter_node(n1):
384 |             return n1 in node_list_plot
385 | 
386 |         plt.rcParams["figure.figsize"] = figure_size
387 |         plt.rcParams["figure.autolayout"] = True
388 | 
389 |         view = nx.subgraph_view(self.graph, filter_node=filter_node)
390 | 
391 |         pos=graphviz_layout(view)
392 | 
393 |         nodes = nx.draw_networkx_nodes(view, pos=pos,node_color=node_color,nodelist=None,node_size=node_size,label=True)
394 |         edges = nx.draw_networkx_edges(view, pos=pos, edgelist=None, width=edge_width, edge_color=edge_color, style='solid', alpha=None, arrowstyle=None, 
395 |                                        arrowsize=arrow_size, 
396 |                             edge_cmap=None, edge_vmin=None, edge_vmax=None, ax=None, arrows=None, label=None, 
397 |                             node_size=node_size, nodelist=None, node_shape='o', connectionstyle='arc3', 
398 |                             min_source_margin=0, min_target_margin=0)
399 |         labels = nx.draw_networkx_labels(view,pos=pos,font_size=label_size)
400 |         print('all celltypes in knowledge base:',list(labels.keys()))
401 |         
402 |     def plot_graph_interactive(self, attributes=['cell_type','cellular_process'],colors= ['red','blue'], save_path = 'graph.html'):
403 |         '''
404 |         plot excerpt from the KnowledgeBase using the pyvis package
405 |         self: KnowledgeBase object (networkx)
406 |         attributes: list of node classes to plot
407 |         colors: list of colors in the order of the node classes to plot
408 |         save_path: save path for .html file
409 |         '''
410 |         
411 |         try:
412 |             from pyvis.network import Network
413 |         except ModuleNotFoundError:
414 |             print('please install pyvis')
415 |             pass          
416 |         
417 |         while len(attributes)!=len(colors):
418 |             print('attributes and colors have to be same length')
419 |             break
420 |         while len(set(attributes))!=len(attributes):
421 |             print('attributes have to be unique')
422 |             break
423 |             
424 |         net = Network(notebook=True)
425 |         #cell types
426 |         node_list_plot = self.filter_nodes(attribute_name='class', attributes = attributes)
427 |         def filter_node(n1):
428 |                 return n1 in node_list_plot
429 |         view = nx.subgraph_view(self.graph,filter_node=filter_node)
430 |         net.from_nx(view)
431 | 
432 |         for i in net.nodes:
433 |             index_range = list(range(len(attributes)))
434 |             for v in index_range:
435 |                              if i['class'] == attributes[v]:
436 |                                i['color']= colors[v]
437 |                
438 |         #show
439 |         net.show(save_path)
440 |          
441 | 


--------------------------------------------------------------------------------
/cytopus/tl/__init__.py:
--------------------------------------------------------------------------------
1 | """Tools to use KnowledgeBase to label and interpret data"""
2 | from . import label 
3 | from . import create
4 | from . import hierarchy as hier


--------------------------------------------------------------------------------
/cytopus/tl/create.py:
--------------------------------------------------------------------------------
 1 | #import networkx as nx
 2 | 
 3 | def construct_kb(celltype_edges, geneset_gene_edges,geneset_celltype_edges,annotation_dict,metadata_dict=None,save=False, save_path=None):
 4 |     '''
 5 |     construct a cytopus.kb.KnowledgeBase object
 6 |     celltype_edges: list, list of tuples storing the edges of the cell type hierarchy as ('child', 'parent')
 7 |     geneset_gene_edges: list, list of tuples storing the edges connecting every gene_set with every gene as ('gene_set','gene')
 8 |     geneset_celltype_edges: list, list of tuples storing the edges connecting every gene sets with its cell type as ('gene_set','celltype')
 9 |     annotation_dict: dict, containing the gene set names as keys and their annotation names (cellular_process or cellular_identity) as values
10 |     metadata_dict: dict, nested dict containing the gene set names as keys and a dict storing their attributes_categories as keys and corresponding attributes as values
11 |     save: bool, if True saves the data to the path provided in save_path
12 |     save_path: str, path to save the data to (.txt file)
13 |     '''
14 | 
15 |     #get genes, genesets, celltypes
16 |     genes = list(set([x[1] for x in geneset_gene_edges]))
17 |     genes = [(x,{'class':'gene'}) for x in genes]
18 |     gene_sets = list(set([x[0] for x in geneset_gene_edges]))
19 |     celltypes = list(set([x[0] for x in celltype_edges]).union(set([x[1] for x in celltype_edges])))
20 |     celltypes = [(x,{'class':'cell_type'})for x in celltypes]
21 | 
22 |     #some sanity checks
23 |     celltypes_in_hierarchy = set([x[0] for x in celltypes])
24 |     celltypes_of_genesets = set([x[1] for x in geneset_celltype_edges])
25 |     set_dif = celltypes_of_genesets - celltypes_in_hierarchy
26 |     if  set_dif != set():
27 |         print('WARNING: missing cell types:',set_dif,'in the cell type hierarchy. Please append cell type hierarchy.')
28 |     else:
29 |         print('all cell types in gene set are contained in the cell type hierarchy')
30 |     
31 |     genesets_in_celltype_edges = set([x[0] for x in geneset_celltype_edges])
32 |     genesets_in_gene_edges = set([x[0] for x in geneset_gene_edges])
33 | 
34 |     if  genesets_in_celltype_edges != genesets_in_gene_edges:
35 |         print('WARNING: Gene sets in geneset_celltype_edges and geneset_gene_edges are not identical')
36 | 
37 |     #set edge attributes (important for queries)
38 |     geneset_gene_edges = [x + ({'class':'gene_OF'},) for x in geneset_gene_edges]
39 |     celltype_edges = [x + ({'class':'SUBSET_OF'},) for x in celltype_edges]
40 | 
41 | 
42 |     #sort processes and identities
43 |     processes = []
44 |     identities = []
45 | 
46 |     for i in gene_sets:
47 |         if annotation_dict[i] == 'cellular_process':
48 |             processes.append(i)
49 |         elif annotation_dict[i] == 'cellular_identity':
50 |             identities.append(i)
51 |         else:
52 |             raise(ValueError('all gene sets annotation names should be either cellular_process or cellular_identity'))
53 | 
54 |     geneset_gene_edges_processes = [x for x in geneset_gene_edges if x[0] in processes]
55 |     geneset_gene_edges_identities = [x for x in geneset_gene_edges if x[0] in identities]
56 |     geneset_celltype_edge_processes = [x + ({'class':'process_OF'},) for x in geneset_celltype_edges if x[0] in processes]
57 |     geneset_celltype_edge_identities = [x + ({'class':'identity_OF'},) for x in geneset_celltype_edges if x[0] in identities]
58 |         
59 |     #construct graph
60 |     G = nx.DiGraph()
61 |     G.add_nodes_from(genes)
62 |     G.add_nodes_from(gene_sets)
63 |     G.add_nodes_from(identities)
64 |     G.add_nodes_from(celltypes)
65 |     G.add_edges_from(geneset_gene_edges_processes)
66 |     G.add_edges_from(geneset_gene_edges_identities)
67 |     G.add_edges_from(celltype_edges)
68 |     G.add_edges_from(geneset_celltype_edge_processes)
69 |     G.add_edges_from(geneset_celltype_edge_identities)
70 | 
71 |     #set node metadata
72 |     if isinstance(metadata_dict,dict):
73 |         nx.set_node_attributes(G, metadata_dict)
74 |     else:
75 |         print('No metadata dictionary provided (optional), skipping metadata assignment.')
76 |     if save:
77 |         if not isinstance(save_path,str):
78 |             print('WARNING: Please provide save_path if you want to save the data. Skipping saving step.')
79 |         else:
80 |             import pickle
81 |             with open(save_path, 'wb') as f:
82 |                 pickle.dump(G, f)
83 |             print('Pickled and saved to:',save_path)
84 |     return KnowledgeBase(graph=G)
85 |     
86 | 


--------------------------------------------------------------------------------
/cytopus/tl/hierarchy.py:
--------------------------------------------------------------------------------
  1 | #import networkx as nx
  2 | from networkx.drawing.nx_agraph import graphviz_layout
  3 | def build_nested_dict(graph, node):
  4 |     '''
  5 |     build nested dictionary from reverse view of cytopus cell type hierarchy
  6 |     graph: networkx.DiGraph.view, reverse view of Cytopus cell type hierarchy
  7 |     root: str, name of root node in the reversed view
  8 |     '''
  9 |     nested_dict = {node: {}}
 10 |     for neighbor in graph.successors(node):
 11 |         nested_dict[node].update(build_nested_dict(graph, neighbor))
 12 |     return nested_dict
 13 | 
 14 | def get_hierarchy_dict(G):
 15 |     '''
 16 |     reverse Cytopus cell type hierarchy and build nested hierarchy from it
 17 |     G: Cytopus.KnowledgeBase, containing cell type hierarchy
 18 | 
 19 |     '''
 20 |     import networkx as nx
 21 |     #get view of cell type hierarchy
 22 |     node_list_plot = G.filter_nodes(attribute_name='class', attributes = ['cell_type']) 
 23 |     def filter_node(n1):
 24 |                 return n1 in node_list_plot
 25 |         
 26 |     view = nx.subgraph_view(G.graph, filter_node=filter_node)
 27 | 
 28 |     #reverse graph view (going from least granular to most granular cell type)
 29 |     reversed_view = view.reverse(copy=True)
 30 |     root_nodes = [n for n in reversed_view.nodes if reversed_view.in_degree(n) == 0]
 31 | 
 32 |     #build the nested dictionary
 33 |     hierarchy_dict = {}
 34 |     for root in root_nodes:
 35 |         hierarchy_dict.update(build_nested_dict(reversed_view, root))
 36 | 
 37 |     return hierarchy_dict
 38 | 
 39 | def create_hierarchical_graph(data, type_label):
 40 |     import networkx as nx
 41 |     G = nx.DiGraph()
 42 |     for parent, children in data.items():
 43 |         G.add_node(parent)
 44 |         if isinstance(children, dict):
 45 |             child_node = create_hierarchical_graph(children,type_label)
 46 |             G.add_nodes_from(child_node.nodes(data=True))
 47 |             G.add_edges_from([(u,v) for u,v in child_node.edges()])
 48 |             for child in children:
 49 |                 G.add_edge(child, parent)
 50 |         else:
 51 |             for child in children:
 52 |                 G.add_node(child)
 53 |                 G.add_edge(child, parent)
 54 |     nx.set_node_attributes(G, type_label,'type')
 55 |     return G
 56 | 
 57 | def get_all_keys(d):
 58 |     keys = set()
 59 |     for k, v in d.items():
 60 |         keys.add(k)
 61 |         if isinstance(v, dict):
 62 |             keys |= get_all_keys(v)
 63 |     return keys
 64 | 
 65 | def get_nodes_of_type(graph, node_type):
 66 |     nodes = [node for node in graph.nodes() if graph.nodes[node]['type'] == node_type]
 67 |     nodes.sort(key=lambda x: x.split('.'))
 68 |     return nodes
 69 | 
 70 | def get_indices(df, value):
 71 |     return df.index[df.astype(str).apply(lambda x: x == value).any(axis=1)].tolist()
 72 | 
 73 | def get_node_labels(graph, node_type):
 74 |     import networkx as nx
 75 |     nodes = [node for node in nx.dfs_postorder_nodes(graph) if graph.nodes[node]['type'] == node_type]
 76 |     return nodes[::-1]
 77 | 
 78 | 
 79 | class Hierarchy:
 80 |     import networkx as nx
 81 |     def __init__(self, hierarchy_dict):
 82 |         '''
 83 |         load hierarchy class
 84 |         hierarchy_dict: dict, nested dict containing the cell type hierarchy
 85 |         '''
 86 |         self.graph = create_hierarchical_graph(hierarchy_dict,type_label = 'cell_type')
 87 |         print(self.__str__())
 88 |         
 89 |     def __str__(self):
 90 |         all_celltypes = get_nodes_of_type(self.graph, 'cell_type')
 91 |         return f"Hierarchy class containing {len(all_celltypes)} cell types:{all_celltypes}"
 92 | 
 93 | 
 94 |     def plot_celltypes(self, node_color='#8decf5', node_size = 1000,edge_width= 1,arrow_size=20 ,edge_color= 'k',label_size = 10, figsize=[30,30]):
 95 |         '''
 96 |         plot all cell types contained in hierarchy object
 97 |         '''
 98 |         
 99 | 
100 |         #plt.rcParams["figure.figsize"] = figure_size
101 |         #plt.rcParams["figure.autolayout"] = True
102 |         import networkx as nx
103 |         import matplotlib.pyplot as plt
104 |         node_list_plot = get_nodes_of_type(self.graph, 'cell_type')
105 |         def filter_node(n1):
106 |                     return n1 in node_list_plot
107 | 
108 |         view = nx.subgraph_view(self.graph, filter_node=filter_node)
109 | 
110 |         pos=graphviz_layout(view)
111 |         plt.rcParams["figure.figsize"] = figsize
112 |         nodes = nx.draw_networkx_nodes(view, pos=pos,node_color=node_color,nodelist=None,node_size=node_size,label=True)
113 |         edges = nx.draw_networkx_edges(view, pos=pos, edgelist=None, width=edge_width, edge_color=edge_color, style='solid', alpha=None, arrowstyle=None, 
114 |                                         arrowsize=arrow_size, 
115 |                             edge_cmap=None, edge_vmin=None, edge_vmax=None, ax=None, arrows=None, label=None, 
116 |                             node_size=node_size, nodelist=None, node_shape='o', connectionstyle='arc3', 
117 |                             min_source_margin=0, min_target_margin=0)
118 |         labels = nx.draw_networkx_labels(view,pos=pos,font_size=label_size)
119 | 
120 |     def add_cells(self, adata, obs_columns=None):
121 |         '''
122 |         add cells to their most granular annotation in the hierarchy object
123 |         adata: anndata.AnnData, containing the cell type annotations under adata.obs
124 |         obs_columns: ls, list of columns in adata.obs where the cell type annotations are stored (recommended)
125 |         '''
126 |         import networkx as nx
127 |         if obs_columns==None:
128 |             adata_sub = adata.obs
129 |         else:
130 |             adata_sub = adata.obs[obs_columns]
131 | 
132 |         celltype_nodes = get_node_labels(self.graph,'cell_type')
133 |         #loop over celltypes to retrieve cells of each celltype
134 |         used_barcodes = set()
135 |         for i in celltype_nodes:
136 |             barcodes = get_indices(adata_sub, i)
137 |             barcodes = [x for x in barcodes if x not in used_barcodes]
138 |             used_barcodes = used_barcodes.union(set(barcodes))
139 |             for x in barcodes:
140 |                 self.graph.add_node(x, type='cell')
141 |                 self.graph.add_edge(i,x)
142 |             
143 |     def query_ancestors(self, query_node, adata=None, obs_key='hierarchical_query'):
144 |         '''
145 |         retrieves all cell barcodes belonging to the cell type and all of its subsets
146 |         query_node: str, cell type name fir which to retrieve barcodes 
147 |         node_type: str, node type of cell type node (here: 'cell_type')
148 |         adata: anndata.AnnData, adata to store the cell type annotations under adata.obs[obs_key]
149 |         obs_key: str, column label to store cell tyoe annotations under adata.obs[obs_key]
150 |         returns: dict, containing the barcodes belonging to each annotation in self.annotations, if adata is provided they will also be stored in adata.obs[obs_key]
151 |         '''
152 |         import networkx as nx
153 |         import anndata
154 |         node_type='cell_type'
155 |         if node_type == self.graph.nodes[query_node]['type']:
156 |             nodes_of_specific_type = [node for node in nx.ancestors(self.graph, query_node) if self.graph.nodes[node]['type'] == node_type]
157 |             nodes_of_specific_type.append(query_node)
158 |             cell_nodes = {}
159 |             for node in set(nodes_of_specific_type):
160 |                 cell_edges = [edge for edge in self.graph.edges(node) if self.graph.nodes[edge[1]]['type'] == 'cell']
161 |                 cell_nodes[node] = [edge[1] for edge in cell_edges] 
162 |             cell_nodes_inv = {}
163 |             for k,v in cell_nodes.items():
164 |                 for i in v:
165 |                     cell_nodes_inv[i] = k
166 |             if isinstance(adata,anndata._core.anndata.AnnData):
167 |                 adata.obs[obs_key]= adata.obs_names.map(cell_nodes_inv)
168 |             self.annotations =  cell_nodes
169 |         else:
170 |             print('query_node:',query_node,'should be of type',node_type,'stopping...')
171 |         


--------------------------------------------------------------------------------
/cytopus/tl/label.py:
--------------------------------------------------------------------------------
  1 | #import pandas as pd
  2 | #import csv
  3 | 
  4 | def overlap_coefficient(set_a,set_b):
  5 |     '''
  6 |     calculate the overlap coefficient between two sets
  7 |     '''
  8 |     min_len = min([len(set_a),len(set_b)])
  9 |     intersect_len = len(set_a.intersection(set_b))
 10 |     overlap = intersect_len/min_len
 11 |     return overlap
 12 | 
 13 | def label_marker_genes(marker_genes, gs_label_dict, threshold = 0.4):
 14 |     '''
 15 |     label an array of marker genes using a KnowledgeBase or a dictionary derived from the KnowledgeBase
 16 |     returns a dataframe of overlap coefficients for each gene set annotation and marker gene
 17 |     
 18 |     marker_genes: numpy.array or list of lists, factors x marker genes
 19 |     gs_label_dict: cytopus.KnowledgeBase or dict, with gene set names (str) as keys and gene sets (list) as values
 20 |     threshold: float, if overlap coefficient > than threshold the factor will be labeled with the gene set name with 
 21 |     maximum overlap coefficient
 22 |     
 23 |     returns: pandas.DataFrame, with overlap coefficients of factors (rows) and gene sets (columns), indices are relabeled 
 24 |     to the gene set with the maximum overlap coefficient
 25 |     '''
 26 |     #import numpy as np
 27 | 
 28 |     if isinstance(gs_label_dict,KnowledgeBase):
 29 |         #collapse annotation dict
 30 |         gs_dict = {}
 31 |         key_list = []
 32 |         for key, value in gs_label_dict.celltype_process_dict.items():
 33 |             for k,v in value.items():
 34 |                 if k not in key_list:
 35 |                     gs_dict[k]=v
 36 |                     key_list.append(k)
 37 |     elif isinstance(gs_label_dict, dict):
 38 |             for v in gs_label_dict.values():
 39 |                 if isinstance(v,dict):
 40 |                     raise ValueError('gs_label_dict is a nested dictionary. gs_label_dict must be a flat/non-nested dictionary with gene set names as keys (str) amd gene sets (lists of strings) as values')
 41 |             gs_dict = gs_label_dict
 42 |     else:
 43 |         raise ValueError('gs_label_dict must be a dictionary or a cytopus.kb.queries.KnowledgeBase object')
 44 | 
 45 |     overlap_df = pd.DataFrame()
 46 |     for i, v in pd.DataFrame(marker_genes).T.items():
 47 |         overlap_temp = []
 48 |         gs_names_temp = []
 49 |         for gs_name, gs in gs_dict.items():
 50 |             gene_set = set(gs)
 51 |             marker_set = set(v)
 52 |             #check and remove for nans
 53 |             if 'nan' in gene_set:
 54 |                 gene_set.remove('nan')
 55 |             if 'nan' in marker_set:
 56 |                 marker_set.remove('nan')
 57 |             if len(gene_set) > 0 and len(marker_set)>0:
 58 |                 overlap_temp.append(overlap_coefficient(set(gene_set),set(marker_set)))
 59 |             else:
 60 |                 overlap_temp.append(np.nan)
 61 |             gs_names_temp.append(gs_name)
 62 |         overlap_df_temp = pd.DataFrame(overlap_temp, columns=[i],index=gs_names_temp).T
 63 |         overlap_df = pd.concat([overlap_df,overlap_df_temp])
 64 |     marker_gene_labels = [] #gene sets
 65 |     for marker_set in overlap_df.index:
 66 |         max_overlap = overlap_df.loc[marker_set].sort_values().index[-1]
 67 |         if overlap_df.loc[marker_set].sort_values().values[-1] >threshold:
 68 |             marker_gene_labels.append(max_overlap)
 69 |         else:
 70 |             marker_gene_labels.append(marker_set)
 71 |     overlap_df.index = marker_gene_labels     
 72 |         
 73 |     return overlap_df
 74 | 
 75 | 
 76 | def get_celltype(adata, celltype_key,factor_list=None,Spectra_cell_scores= 'SPECTRA_cell_scores'):
 77 |     '''
 78 |     For a list of factors check in which cell types they are expressed
 79 |     adata: anndata.AnnData, containing cell type labels in adata.obs[celltype_key]
 80 |     celltype_key: str, key for adata.obs containing the cell type labels
 81 |     factor_list: list, list of keys for factor loadings in .obs, if none use factor loadings in adata.obsm['SPECTRA_factors']
 82 |     return: dictionary mapping factor names and celltypes
 83 |     Spectra_cell_scores: str, key for Spectra cell scores in adata.obsm
 84 |     '''
 85 |     
 86 |     if factor_list!= None:
 87 |         factors= adata.obs[factor_list]
 88 |         factors['celltype'] = list(adata.obs[celltype_key])
 89 |     else:
 90 |         factors = pd.DataFrame(adata.obsm[Spectra_cell_scores])
 91 |         factors['celltype'] = list(adata.obs[celltype_key])
 92 |     
 93 |     #create factor:celltype dict
 94 |     grouped_df = factors.groupby('celltype').mean()
 95 |     #get factor names for global (expressed in all cells) and cell type spec factors
 96 |     global_factor_names = grouped_df.T[(grouped_df!=0).all()].index
 97 |     specific_factor_names= [x for x in grouped_df.columns if x not in global_factor_names]
 98 |     #add global factors to dict
 99 |     factor_names_global = {x:'global' for x in global_factor_names}
100 | 
101 |     #get celltype for celltype spec factors
102 |     grouped_df_spec = grouped_df[specific_factor_names]
103 | 
104 |     for i in grouped_df_spec.columns:
105 |         factor_names_global[i] = grouped_df_spec[i].sort_values(ascending=False).index[0]
106 |     return factor_names_global
107 |     
108 | 
109 | def get_gmt(gs_dict,save=False,path=None):
110 |     '''
111 |     transform a dictionary into a .gmt file
112 |     gs_dict: dict, gene set dictionary with format {'gene set name':['Gene_a','Gene_b','Gene_c',...]}
113 |     save: bool, if True saves .gmt file to path
114 |     path: str, path to save .gmt file
115 |     '''
116 |     #import numpy as np
117 |     #import pandas as pd
118 |     #retrieve all genes from dict
119 |     genes = []
120 |     for k,v in gs_dict.items():
121 |         genes = genes+v
122 |     genes = list(set(genes))
123 |     
124 |     #pad the lists in gs_dict to equal lengths
125 |     max_length = max(map(len, gs_dict.values()))
126 | 
127 |     for k,v in gs_dict.items():
128 |         if len(v)<max_length:
129 |             gs_dict[k]+= [np.nan]*(max_length-len(v))
130 | 
131 |     #transform into df
132 |     gs_df = pd.DataFrame(gs_dict).T
133 |     
134 |     if save:
135 |         gs_df.to_csv(path,sep='\t',header=False)
136 |         print('print saving to:',path) 
137 |     else:
138 |         return gs_df
139 |     
140 | 
141 | def flatten_hierarchical_dict(d, parent_key=None):
142 |     items = []
143 |     for k, v in d.items():
144 |         if parent_key is not None:
145 |             items.append((parent_key, k))
146 |         if isinstance(v, dict):
147 |             items.extend(flatten_hierarchical_dict(v, k))
148 |     return items
149 | 
150 | def hierarchy_to_csv(hierarchy,filename='hierarchy.csv',header_name=['Parent','Child']):
151 |     '''
152 |     get hierarchy from knowledge base and write to .csv
153 |     hierarchy : dict, nested dict containing cell type hierarchy e.g. G.get_celltype_hierarchy()
154 |     filename : str, output file name to write csv to
155 |     header_name : ls, header name of the csv
156 |     '''
157 |     flat_list = flatten_hierarchical_dict(hierarchy)
158 |     # Write to CSV
159 |     with open(filename, 'w', newline='') as file:
160 |         writer = csv.writer(file)
161 |         writer.writerow(header_name)
162 |         for parent, child in flat_list:
163 |             writer.writerow([parent, child])
164 | 
165 | def geneset_to_csv(gs_dict, filename='geneset.csv', header_name=['gene_set_name','gene_name']):
166 |     '''
167 |     get gene sets from knowledge base and write to .csv
168 |     gs_dict : dict, gene set dictionary e.g. G.processes
169 |     header_name : ls, name of header in .csv file
170 |     filename : str, output file name to write csv to
171 |     '''
172 |     with open(filename, 'w', newline='') as file:
173 |         writer = csv.writer(file)
174 |         writer.writerow(header_name)
175 |         for key, values in gs_dict.items():
176 |             for value in values:
177 |                 writer.writerow([key, value])
178 | 
179 | 
180 | #import networkx as nx
181 | #import pandas as pd
182 | 
183 | def metadata_to_csv(graph, file_name, specific_class = False, class_value=None):
184 |     '''
185 |     get metadata and write to csv
186 |     graph : networkx.DiGraph, graph containing nodes with attributes
187 |     file_name : str, path to write csv to
188 |     specific_class : str, restrict to nodes with specific 'class' attribute
189 |     class_value : str, class attribute to restrict to
190 |     '''
191 |     # Filter nodes by 'class' attribute value
192 |     if specific_class:
193 |         attributes = {node: data for node, data in graph.nodes(data=True) if data.get('class') == class_value}
194 |     else:
195 |         attributes = attributes = {node: data for node, data in graph.nodes(data=True)}
196 | 
197 |     # Create a DataFrame from the filtered dictionary
198 |     df = pd.DataFrame.from_dict(attributes, orient='index')
199 | 
200 |     # Saving to CSV
201 |     df.to_csv(file_name)
202 | 
203 | 


--------------------------------------------------------------------------------
/img/celltype_hierarchy_1.2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wallet-maker/cytopus/638dd917cdd047e322213b6e4767bf0b9b78bf80/img/celltype_hierarchy_1.2.png


--------------------------------------------------------------------------------
/img/cytopus_v1.1_stable_graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wallet-maker/cytopus/638dd917cdd047e322213b6e4767bf0b9b78bf80/img/cytopus_v1.1_stable_graph.png


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup, find_packages
 2 | 
 3 | VERSION = '2.0'
 4 | 
 5 | setup(
 6 |     name='cytopus',
 7 |     version=VERSION,
 8 |     author="Thomas Walle",
 9 |     packages=find_packages(),
10 |     #packages=["cytopus"],
11 |     install_requires = [
12 |         #"pandas>1.3",
13 |         #"numpy>1.2",
14 |         "networkx>2.7",
15 |         #"matplotlib>3.4"
16 |         ],
17 |     include_package_data=True,
18 |     package_data={'cytopus': ['data/*.txt','data/*.h5ad']},
19 | )
20 | 


--------------------------------------------------------------------------------