├── .gitignore
├── .ipynb_checkpoints
    ├── README-checkpoint.md
    └── ST_simulation-checkpoint.py
├── README.md
├── ST_simulation.py
├── __init__.py
├── assemble_composition.py
├── assemble_design.py
├── assemble_st.py
├── merge_synthetic_ST.py
├── run_simulation2.sh
└── split_sc.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints
2 | __pycache__
3 | 


--------------------------------------------------------------------------------
/.ipynb_checkpoints/README-checkpoint.md:
--------------------------------------------------------------------------------
 1 | ## Simulation of Spatial Transcriptomics spots from single-cell reference
 2 | 
 3 | <!-- Note: This folder provides a collection of scripts we used to simulated data but the scripts need to be edited to be used on other platforms (contains hard-coded paths for our HPC). -->
 4 | This repo provides a collection of scripts used to generate simulated spatial transcriptomics data as a mixture of single-cell transcriptomics profiles (adapting code and model from [Andersson et al. 2019](https://www.biorxiv.org/content/10.1101/2019.12.13.874495v1)). Parameters are chosen to simulate the characteristics of 10X Genomics Visium chips.
 5 | <!-- 
 6 | #### Contents
 7 | 
 8 | - `ST_simulation.py` functions to simulate spots
 9 | - `split_sc.py` script to split mouse brain snRNA-seq reference into generation dataset (single cells used to make the synthetic spots) and validation dataset (single cells used as a reference to train the location model)
10 | - `assemble_ST.py` script to generate synthetic ST spots from the generation dataset
11 |  -->
12 | ### Run simulation 
13 | 
14 | Initial input: 
15 | 
16 | - AnnData object of raw counts per single-cells (saved as `h5ad` file)
17 | - A table of cell type annotations per cell that we want to deconvolve (saved as `csv` file)
18 | 
19 | **(Step 1) Split single-cell dataset:** we split the cells in the single-cell dataset in a 'generation set', that will be used to simulate the ST spots, and a 'validation' set, that will be used to train the deconvolution models that we want to benchmark. From the command line:
20 | 
21 | ```
22 | python split_sc.py <counts_h5ad> <annotation_csv> --annotation_col annotation_1  --out_dir <output_directory>
23 | ```
24 | 
25 | Output: generation and validation count matrices and cell type annotations are saved as `pickle` files, with a random seed identifying the split. 
26 | 
27 | **(Step 2) Build design matrix**: in this step we define which cell types are (A) low/high density and (B) Uniformly present in all the spots or localized in few spots (regional). To generate synthetic spots with ~10 cells per spot (as seen with nuclear segmentation on Visium spots) we reccommend setting the mean number of cells per spot per cell type < 5.
28 | 
29 | ```
30 | n_spots=100
31 | seed=$(ls labels_generation* | sed 's/.*_//' | sed 's/.p//')
32 | python ST_simulation/assemble_design.py \
33 |     $seed \
34 |   --tot_spots $n_spots --mean_high 3 --mean_low 1 \
35 |   --out_dir <output_directory>
36 | ```
37 | 
38 | Output: `synthetic_ST_seed${seed}_${assemble_id}_design.csv` contains the design used for the simulation:
39 | 
40 | | **Column**  | **Data**                                                                                         |
41 | |-------------|--------------------------------------------------------------------------------------------------|
42 | | uniform     | is the cell type uniformly located across spots (1) or localized in a small subset of spots (0)  |
43 | | density     | is the cell type present in a spot at low density (1) or high density (0)                        |
44 | | nspots      | total number of spots in which the cell type is located                                          |
45 | | mean_ncells | mean number of cells per spot                                                                    |
46 | 
47 | **(Step 3) Assemble cell type composition per spot:** based on the design matrix, we define how many cell types per  An assemble ID is used to identify 
48 | ```
49 | id=1
50 | python cell2location/pycell2location/ST_simulation/assemble_composition.py \
51 |     $seed \
52 |     --tot_spots $n_spots --assemble_id $id
53 | ```
54 | 
55 | Output: `synthetic_ST_seed${seed}_${assemble_id}_composition.csv` contains the number of cells per cell type in each spot, for benchmarking deconvolution models
56 | 
57 | **(Step 4) Assemble simulated ST spots**
58 | ```
59 | python assemble_st.py ${seed} --assemble_id $id
60 | ```
61 | 
62 | Output:
63 | 
64 | - `synthetic_ST_seed${seed}_${assemble_id}_counts.csv` contains the count matrix for the simulated ST spots
65 | - `synthetic_ST_seed${seed}_${assemble_id}_umis.csv` contains the number of UMIs per cell type in each spot, for benchmarking deconvolution methods that model number of UMIs
66 | 
67 | 
68 | ### Speeding up the simulation process
69 | 
70 | The current implementation is not optimized for speed, it takes ~ 2 minutes to assemble 100 spots. While I might consider writing a parallel version if needed, at the moment my suggestion to simulate thousands of spots is to assemble the design matrix once (step 1 above), then run steps 2 and 3 many times using wrapper 
71 | ```
72 | run_simulation2.sh <seed> <n_spots> <id> 
73 | ```
74 | then merge in one object
75 | ```
76 | python merge_synthetic_ST.py . $seed
77 | ```
78 | 
79 | 
80 | 


--------------------------------------------------------------------------------
/.ipynb_checkpoints/ST_simulation-checkpoint.py:
--------------------------------------------------------------------------------
  1 | import random
  2 | 
  3 | import numpy as np
  4 | import pandas as pd
  5 | import torch as t
  6 | import torch.distributions as dists
  7 | 
  8 | 
  9 | ## --- Version 2: cell densities --- ##
 10 | 
 11 | def assemble_ct_composition(design_df, tot_spots, ncells_scale=5):
 12 |     '''
 13 |     Parameters
 14 |     ----------
 15 |     design_df: pd.DataFrame containing number of spots (nspots) and mean n of 
 16 |         cells per spot (mean_ncell) per cell type
 17 |     tot_spots: int
 18 |         total number of spots to simulate
 19 |     Return
 20 |     ------
 21 |     pd.DataFrame of cell types x spots with no of cells 
 22 |     '''
 23 |     spots_members = pd.DataFrame(columns=range(tot_spots),
 24 |                                  index=design_df.index)
 25 |     ## Cell types to spot
 26 |     for i in range(len(design_df.nspots)):
 27 |         l = ([0] * (tot_spots - int(design_df.nspots[i]))) + ([1] * int(design_df.nspots[i]))
 28 |         l = random.sample(l, k=tot_spots)
 29 |         spots_members.iloc[i] = pd.Series(l)
 30 | 
 31 |     ## No of cells per spot
 32 |     ncells = [
 33 |         np.round(np.random.gamma(design_df.loc[ct,].mean_ncells, ncells_scale, size=int(design_df.loc[ct,].nspots))) for
 34 |         ct in design_df.index]
 35 |     for i in range(spots_members.shape[0]):
 36 |         spots_members.iloc[i, spots_members.columns[spots_members.iloc[i] == 1]] = ncells[i]
 37 |     return (spots_members)
 38 | 
 39 | 
 40 | def assemble_spot_2(cnt, labels, members):
 41 | #     uni_labels = members.index
 42 | #     spot_expr = t.zeros(cnt.shape[1]).type(t.float32)
 43 | #     for z in range(len(uni_labels)):
 44 | #         if members[z] > 0:
 45 | #             idx = np.where(labels == uni_labels[z])[0]
 46 | #             # pick random cells from type
 47 | #             np.random.shuffle(idx)
 48 | #             idx = idx[0:int(members[z])]
 49 | #             # add fraction of transcripts to spot expression
 50 | #             z_expr = t.tensor((cnt.iloc[idx, :] * fraction).sum(axis=0).round().astype(np.float32))
 51 | #             spot_expr += z_expr
 52 |     uni_labels = members.index
 53 |     spot_expr = t.zeros(cnt.shape[1]).type(t.float32)
 54 |     nUMIs = t.zeros((len(uni_labels))).type(t.float32)
 55 |     for z in range(len(uni_labels)):
 56 |         if members[z] > 0:
 57 |             idx = np.where(labels == uni_labels[z])[0]
 58 |             # pick random cells from type
 59 |             np.random.shuffle(idx)
 60 |             idx = idx[0:int(members[z])]
 61 |             # add transcripts to spot expression
 62 |             z_expr = t.tensor((cnt.iloc[idx, :]).sum(axis=0).round().astype(np.float32))
 63 |             spot_expr += z_expr
 64 |             nUMIs[z] = z_expr.sum()
 65 |     return (spot_expr, nUMIs)
 66 | 
 67 | def assemble_st_2(cnt, labels, spots_members):
 68 |     tot_spots = spots_members.shape[1]
 69 |     st_cnt = np.zeros((tot_spots, cnt.shape[1]))
 70 |     st_umis = np.zeros((tot_spots, spots_members.shape[0]))
 71 |     for spot in range(tot_spots):
 72 |         print("making spot no." + str(spot) + "...", flush=True)
 73 |         spot_data = assemble_spot_2(cnt, labels, spots_members.iloc[:, spot])
 74 |         st_cnt[spot, :] = spot_data[0]
 75 |         st_umis[spot, :] = spot_data[1]
 76 |     # convert to pandas DataFrames
 77 |     index = pd.Index(['Spotx' + str(x + 1) for \
 78 |                       x in range(tot_spots)])
 79 |     st_cnt = pd.DataFrame(st_cnt,
 80 |                           index=index,
 81 |                           columns=cnt.columns,
 82 |                           )
 83 |     st_umis = pd.DataFrame(st_umis,
 84 |                            index=index,
 85 |                            columns=spots_members.index,
 86 |                            )
 87 |     return (st_cnt, st_umis)
 88 | 
 89 | 
 90 | # ## --- Version 1: proportions --- ##
 91 | 
 92 | # def pick_cell_types(uni_labels, alpha, min_n_cells):
 93 | #     '''
 94 | #     Pick cell types to include in synthetic spots with proportions from 
 95 | #     Dirichlet distribution.
 96 |     
 97 | #     Parameters
 98 | #     ----------
 99 | #     uni_labels: np.array
100 | #         unique labels
101 | #     alpha: np.array 
102 | #         dirichlet distribution concentration value 
103 | #         (can be from cell type proportions in ST)
104 |         
105 | #     Return
106 | #     ------
107 | #     tuple of picked cell types and proportions
108 |     
109 | #     '''
110 | #     # get number of different
111 | #     # cell types present
112 | #     n_labels = uni_labels.shape[0]
113 | 
114 | #     # sample number of types to be present at current spot
115 | #     # w/o having more types than cells
116 | #     n_types = dists.uniform.Uniform(low=1,
117 | #                                     high=min([n_labels, min_n_cells])).sample()
118 | 
119 | #     n_types = n_types.round().type(t.int)
120 | 
121 | #     # select which types to include
122 | #     pick_types = t.randperm(n_labels)[0:n_types]
123 | #     alpha = t.Tensor(np.array(alpha[pick_types]))
124 | 
125 | #     # select cell type proportions
126 | #     member_props = dists.Dirichlet(concentration=alpha * t.ones(n_types)).sample()
127 | #     return ((pick_types, member_props))
128 | 
129 | 
130 | # def assemble_spot(cnt, labels, n_cells, fraction, pick_types, member_props):
131 | #     '''
132 | #     Generate one synthetic ST spot
133 |     
134 | #     Parameters:
135 | #     -----------
136 | #     cnt: pd.DataFrame of single-cell count data --> [n_cells x n_genes] <--
137 | #     labels: pd.DataFrame of single-cell annotations [n_cells]
138 | #     n_cells: int number of cells to include in spot
139 | #     fraction: float or np.array 
140 | #         fraction of transcripts from each cell being 
141 | #         observed in ST-spot (gene budgets in model)
142 | #     pick_types: torch.Tensor of cell types to include in spot (output of pick_cell_types)
143 | #     member_props: torch.Tensor of the proportions of different cell types in spots (output of pick_cell_types)
144 |     
145 | #     Returns:
146 | #     --------
147 | #     Dictionary with expression data,
148 | #     proportion values and number of
149 | #     cells from each type at every
150 | #     spot
151 | #     '''
152 | #     # get unique labels found in single cell data
153 | #     uni_labels, uni_counts = np.unique(labels,
154 | #                                        return_counts=True)
155 | #     n_labels = uni_labels.shape[0]
156 | 
157 | #     assert np.all(uni_counts >= 30), "Insufficient number of cells"
158 | 
159 | #     # get no. of members of spot for each cell type
160 | #     members = t.zeros(n_labels).type(t.float)
161 | #     members[pick_types] = (n_cells * member_props).round()
162 | #     # get final proportion of each type
163 | #     props = members / members.sum()
164 | #     # convert members to integers
165 | #     members = members.type(t.int)
166 | #     # generate spot expression data
167 | #     spot_expr = t.zeros(cnt.shape[1]).type(t.float32)
168 | #     nUMIs = t.zeros((len(uni_labels))).type(t.float32)
169 | #     for z in range(len(uni_labels)):
170 | #         if members[z] > 0:
171 | #             idx = np.where(labels == uni_labels[z])[0]
172 | #             # pick random cells from type
173 | #             np.random.shuffle(idx)
174 | #             idx = idx[0:members[z]]
175 | #             # add fraction of transcripts to spot expression
176 | #             z_expr = t.tensor((cnt.iloc[idx, :] * fraction).sum(axis=0).round().astype(np.float32))
177 | #             nUMIs[z] = z_expr.sum()
178 | #             spot_expr += z_expr
179 | #     return {'expr': spot_expr,
180 | #             'proportions': props,
181 | #             'members': members,
182 | #             'umis': nUMIs
183 | #             }
184 | 
185 | 
186 | # def assemble_region(cnt, labels, n_cells_vec, alpha, fraction):
187 | #     '''
188 | #     Assemble ST-spots from a single synthetic region 
189 | #     i.e. with the same proportions of cell types in each spot
190 | 
191 | #     Parameters
192 | #     ----------
193 | #     n_cell_vec: vector of number of cells to mix for each synthetic spot
194 | #     alpha: np.array 
195 | #         dirichlet distribution concentration value 
196 | #         (can be from cell type proportions in ST)
197 | #     fraction: float or np.array 
198 | #         fraction of transcripts from each cell being 
199 | #         observed in ST-spot (gene budgets in model)
200 | #     '''
201 | 
202 | #     n_spots = len(n_cells_vec)
203 | 
204 | #     # get unique labels
205 | #     uni_labels = np.unique(labels.values)
206 | #     n_labels = uni_labels.shape[0]
207 | 
208 | #     # prepare matrices
209 | #     st_cnt = np.zeros((n_spots, cnt.shape[1]))
210 | #     st_prop = np.zeros((n_spots, n_labels))
211 | #     st_memb = np.zeros((n_spots, n_labels))
212 | #     st_umis = np.zeros((n_spots, n_labels))
213 | 
214 | #     # generate one spot at a time
215 | #     pick_types, member_props = pick_cell_types(uni_labels, alpha, min(n_cells_vec))
216 | #     #     np.random.seed(1337)
217 | #     #     t.manual_seed(1337)
218 | #     for spot in range(n_spots):
219 | #         spot_data = assemble_spot(cnt,
220 | #                                   labels,
221 | #                                   n_cells_vec[spot], fraction, pick_types, member_props
222 | #                                   )
223 | 
224 | #         st_cnt[spot, :] = spot_data['expr']
225 | #         st_prop[spot, :] = spot_data['proportions']
226 | #         st_memb[spot, :] = spot_data['members']
227 | #         st_umis[spot, :] = spot_data['umis']
228 | 
229 | #         index = pd.Index(['Spotx' + str(x + 1) for \
230 | #                           x in range(n_spots)])
231 | 
232 | #     # convert to pandas DataFrames
233 | #     st_cnt = pd.DataFrame(st_cnt,
234 | #                           index=index,
235 | #                           columns=cnt.columns,
236 | #                           )
237 | #     st_prop = pd.DataFrame(st_prop,
238 | #                            index=index,
239 | #                            columns=uni_labels,
240 | #                            )
241 | #     st_memb = pd.DataFrame(st_memb,
242 | #                            index=index,
243 | #                            columns=uni_labels,
244 | #                            )
245 | #     st_umis = pd.DataFrame(st_umis,
246 | #                            index=index,
247 | #                            columns=uni_labels,
248 | #                            )
249 | #     return {'counts': st_cnt,
250 | #             'proportions': st_prop,
251 | #             'members': st_memb,
252 | #             'umis': st_umis}
253 | 
254 | 
255 | # def assemble_st(cnt, labels, n_regions, n_cells_tot, alpha, fraction):
256 | #     '''
257 | #     Assemble synthetic ST data from count matrix and predicted 
258 | #     cell type labels for each single-cell. Regions are modelled as groups of spots with
259 | #     the same proportion of cell types (and roughly the same number of cells per spot). 
260 | 
261 | #     Parameters
262 | #     ----------
263 | #     n_spots: int 
264 | #         number of spots to simulate
265 | #     n_regions: int 
266 | #         number of regions in which spots should be divided
267 | #     alpha: np.array 
268 | #         dirichlet distribution concentration value 
269 | #         (can be from cell type proportions in ST)
270 | #     fraction: float or np.array 
271 | #         fraction of transcripts from each cell being 
272 | #         observed in ST-spot (gene budgets in model)
273 |     
274 | #     (if you don't want zonation you can just make as many regions as spots)
275 | #     '''
276 | #     # count total number of spots
277 | #     tot_spots = len(n_cells_tot)
278 | 
279 | #     # get unique labels
280 | #     uni_labels = np.unique(labels.values)
281 | #     n_labels = uni_labels.shape[0]
282 | 
283 | #     # assign spots to regions
284 | #     # avoding to have regions with no spots 
285 | #     if n_regions != tot_spots:
286 | #         region_labels = []
287 | #         while len(np.unique(region_labels)) != n_regions:
288 | #             region_labels = np.array(random.choices(range(n_regions), k=tot_spots))
289 | #     else:
290 | #         region_labels = np.array(range(n_regions))
291 | 
292 | #     # prepare matrices
293 | #     st_cnt = np.zeros((tot_spots, cnt.shape[1]))
294 | #     st_prop = np.zeros((tot_spots, n_labels))
295 | #     st_memb = np.zeros((tot_spots, n_labels))
296 | #     st_umis = np.zeros((tot_spots, n_labels))
297 | #     idx = 0
298 | 
299 | #     # sort number of cells to have ~ same number of cells per spot for each region
300 | #     n_cells_tot.sort()
301 | 
302 | #     # assemble one region at a time
303 | #     for reg in range(n_regions):
304 | #         print("making reg" + str(reg) + "...", flush=True)
305 | #         n_spots_reg = len(region_labels[region_labels == reg])
306 | #         n_cells_vec = n_cells_tot[idx:idx + n_spots_reg]
307 | #         reg_data = assemble_region(cnt, labels, n_cells_vec, alpha, fraction)
308 | 
309 | #         st_cnt[idx:idx + n_spots_reg, :] = reg_data['counts']
310 | #         st_prop[idx:idx + n_spots_reg, :] = reg_data['proportions']
311 | #         st_memb[idx:idx + n_spots_reg, :] = reg_data['members']
312 | #         st_umis[idx:idx + n_spots_reg, :] = reg_data['umis']
313 | #         idx = idx + n_spots_reg
314 | 
315 | #     index = pd.Index(['Spotx' + str(x + 1) for \
316 | #                       x in range(tot_spots)])
317 | #     # convert to pandas DataFrames
318 | #     st_cnt = pd.DataFrame(st_cnt,
319 | #                           index=index,
320 | #                           columns=cnt.columns,
321 | #                           )
322 | 
323 | #     st_prop = pd.DataFrame(st_prop,
324 | #                            index=index,
325 | #                            columns=uni_labels,
326 | #                            )
327 | #     st_memb = pd.DataFrame(st_memb,
328 | #                            index=index,
329 | #                            columns=uni_labels,
330 | #                            )
331 | #     st_umis = pd.DataFrame(st_umis,
332 | #                            index=index,
333 | #                            columns=uni_labels,
334 | #                            )
335 | #     return {'counts': st_cnt,
336 | #             'proportions': st_prop,
337 | #             'members': st_memb,
338 | #             'umis': st_umis,
339 | #             'regions': region_labels}
340 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## Simulation of Spatial Transcriptomics spots from single-cell reference
 2 | 
 3 | <!-- Note: This folder provides a collection of scripts we used to simulated data but the scripts need to be edited to be used on other platforms (contains hard-coded paths for our HPC). -->
 4 | This repo provides a collection of scripts used to generate simulated spatial transcriptomics data as a mixture of single-cell transcriptomics profiles (adapting code and model from [Andersson et al. 2019](https://www.biorxiv.org/content/10.1101/2019.12.13.874495v1)). Parameters are chosen to simulate the characteristics of 10X Genomics Visium chips.
 5 | <!-- 
 6 | #### Contents
 7 | 
 8 | - `ST_simulation.py` functions to simulate spots
 9 | - `split_sc.py` script to split mouse brain snRNA-seq reference into generation dataset (single cells used to make the synthetic spots) and validation dataset (single cells used as a reference to train the location model)
10 | - `assemble_ST.py` script to generate synthetic ST spots from the generation dataset
11 |  -->
12 | ### Run simulation 
13 | 
14 | Initial input: 
15 | 
16 | - AnnData object of raw counts per single-cells (saved as `h5ad` file)
17 | - A table of cell type annotations per cell that we want to deconvolve (saved as `csv` file)
18 | 
19 | **Step 1: Split single-cell dataset:** we split the cells in the single-cell dataset in a 'generation set', that will be used to simulate the ST spots, and a 'validation' set, that will be used to train the deconvolution models that we want to benchmark. From the command line:
20 | 
21 | ```
22 | python split_sc.py <counts_h5ad> <annotation_csv> --annotation_col annotation_1  --out_dir <output_directory>
23 | ```
24 | 
25 | Output: generation and validation count matrices and cell type annotations are saved as `pickle` files, with a random seed identifying the split. 
26 | 
27 | **Step 2: Build design matrix**: in this step we define which cell types are (A) low/high density and (B) Uniformly present in all the spots or localized in few spots (regional). To generate synthetic spots with ~10 cells per spot (as seen with nuclear segmentation on Visium spots) we reccommend setting the mean number of cells per spot per cell type < 5.
28 | 
29 | ```
30 | n_spots=100
31 | seed=$(ls labels_generation* | sed 's/.*_//' | sed 's/.p//')
32 | python ST_simulation/assemble_design.py \
33 |     $seed \
34 |   --tot_spots $n_spots --mean_high 3 --mean_low 1 \
35 |   --out_dir <output_directory>
36 | ```
37 | 
38 | Output: `synthetic_ST_seed${seed}_${assemble_id}_design.csv` contains the design used for the simulation:
39 | 
40 | | **Column**  | **Data**                                                                                         |
41 | |-------------|--------------------------------------------------------------------------------------------------|
42 | | uniform     | is the cell type uniformly located across spots (1) or localized in a small subset of spots (0)  |
43 | | density     | is the cell type present in a spot at low density (1) or high density (0)                        |
44 | | nspots      | total number of spots in which the cell type is located                                          |
45 | | mean_ncells | mean number of cells per spot                                                                    |
46 | 
47 | **Step 3: Assemble cell type composition per spot:** based on the design matrix, we define the cell type composition of each spot i.e. how many cells per cell type
48 | are in each spot. An assemble ID is used to identify the assembly (we assemble many composition matrices with the same design).
49 | ```
50 | id=1
51 | python cell2location/pycell2location/ST_simulation/assemble_composition.py \
52 |     $seed \
53 |     --tot_spots $n_spots --assemble_id $id
54 | ```
55 | 
56 | Output: `synthetic_ST_seed${seed}_${assemble_id}_composition.csv` contains the number of cells per cell type in each spot, for benchmarking deconvolution models.
57 | 
58 | **(Step 4) Assemble simulated ST spots**
59 | ```
60 | python assemble_st.py ${seed} --assemble_id $id
61 | ```
62 | 
63 | Output:
64 | 
65 | - `synthetic_ST_seed${seed}_${assemble_id}_counts.csv` contains the count matrix for the simulated ST spots
66 | - `synthetic_ST_seed${seed}_${assemble_id}_umis.csv` contains the number of UMIs per cell type in each spot, for benchmarking deconvolution methods that model number of UMIs
67 | 
68 | 
69 | ### Speeding up the simulation
70 | 
71 | The current implementation is not optimized for speed, it takes ~ 2 minutes to assemble 100 spots. At the moment my suggestion to simulate thousands of spots is to assemble the design matrix once (step 1 above), then run steps 2 and 3 many times using wrapper 
72 | ```
73 | run_simulation2.sh <seed> <n_spots> <id> 
74 | ```
75 | then merge in one object
76 | ```
77 | python merge_synthetic_ST.py . $seed
78 | ```
79 | 
80 | 
81 | 


--------------------------------------------------------------------------------
/ST_simulation.py:
--------------------------------------------------------------------------------
  1 | import random
  2 | 
  3 | import numpy as np
  4 | import pandas as pd
  5 | import torch as t
  6 | 
  7 | 
  8 | ## --- Version 2: cell densities --- ##
  9 | 
 10 | def assemble_ct_composition(design_df, tot_spots, ncells_scale=5):
 11 |     """
 12 |     Parameters
 13 |     ----------
 14 |     design_df: pd.DataFrame containing number of spots (nspots) and mean n of
 15 |         cells per spot (mean_ncell) per cell type
 16 |     tot_spots: int
 17 |         total number of spots to simulate
 18 |     Return
 19 |     ------
 20 |     pd.DataFrame of cell types x spots with no of cells
 21 |     """
 22 |     spots_members = pd.DataFrame(columns=range(tot_spots),
 23 |                                  index=design_df.index)
 24 |     ## Cell types to spot
 25 |     for i in range(len(design_df.nspots)):
 26 |         l = ([0] * (tot_spots - int(design_df.nspots[i]))) + ([1] * int(design_df.nspots[i]))
 27 |         l = random.sample(l, k=tot_spots)
 28 |         spots_members.iloc[i] = pd.Series(l)
 29 | 
 30 |     ## No of cells per spot
 31 |     ncells = [
 32 |         #np.round(np.random.gamma(design_df.loc[ct, ].mean_ncells, ncells_scale,
 33 |         #                         size=int(design_df.loc[ct, ].nspots)))
 34 |         np.random.poisson(design_df.loc[ct, ].mean_ncells,
 35 |                           size=int(design_df.loc[ct, ].nspots))
 36 |         for ct in design_df.index]
 37 |     for i in range(spots_members.shape[0]):
 38 |         spots_members.iloc[i, spots_members.columns[spots_members.iloc[i] == 1]] = ncells[i]
 39 |     return spots_members
 40 | 
 41 | 
 42 | def assemble_spot_2(cnt, labels, members):
 43 |     """
 44 |     Parameters
 45 |     ----------
 46 |     design_df: pd.DataFrame containing number of spots (nspots) and mean n of
 47 |         cells per spot (mean_ncell) per cell type
 48 |     tot_spots: int
 49 |         total number of spots to simulate
 50 |     Return
 51 |     ------
 52 |     pd.DataFrame of cell types x spots with no of cells
 53 |     """
 54 |     uni_labels = members.index
 55 |     spot_expr = t.zeros(cnt.shape[1]).type(t.float32)
 56 |     nUMIs = t.zeros((len(uni_labels))).type(t.float32)
 57 |     for z in range(len(uni_labels)):
 58 |         if members[z] > 0:
 59 |             idx = np.where(labels == uni_labels[z])[0]
 60 |             # pick random cells from type
 61 |             np.random.shuffle(idx)
 62 |             idx = idx[0:int(members[z])]
 63 |             # add transcripts to spot expression
 64 |             z_expr = t.tensor((cnt.iloc[idx, :]).sum(axis=0).round().astype(np.float32))
 65 |             spot_expr += z_expr
 66 |             nUMIs[z] = z_expr.sum()
 67 |     return (spot_expr, nUMIs)
 68 | 
 69 | 
 70 | def assemble_st_2(cnt, labels, spots_members):
 71 |     tot_spots = spots_members.shape[1]
 72 |     st_cnt = np.zeros((tot_spots, cnt.shape[1]))
 73 |     st_umis = np.zeros((tot_spots, spots_members.shape[0]))
 74 |     for spot in range(tot_spots):
 75 |         print("making spot no." + str(spot) + "...", flush=True)
 76 |         spot_data = assemble_spot_2(cnt, labels, spots_members.iloc[:, spot])
 77 |         st_cnt[spot, :] = spot_data[0]
 78 |         st_umis[spot, :] = spot_data[1]
 79 |     # convert to pandas DataFrames
 80 |     index = pd.Index(['Spotx' + str(x + 1) for \
 81 |                       x in range(tot_spots)])
 82 |     st_cnt = pd.DataFrame(st_cnt,
 83 |                           index=index,
 84 |                           columns=cnt.columns,
 85 |                           )
 86 |     st_umis = pd.DataFrame(st_umis,
 87 |                            index=index,
 88 |                            columns=spots_members.index,
 89 |                            )
 90 |     return (st_cnt, st_umis)
 91 | 
 92 | # ## --- Version 1: proportions --- ##
 93 | 
 94 | # def pick_cell_types(uni_labels, alpha, min_n_cells):
 95 | #     '''
 96 | #     Pick cell types to include in synthetic spots with proportions from 
 97 | #     Dirichlet distribution.
 98 | 
 99 | #     Parameters
100 | #     ----------
101 | #     uni_labels: np.array
102 | #         unique labels
103 | #     alpha: np.array 
104 | #         dirichlet distribution concentration value 
105 | #         (can be from cell type proportions in ST)
106 | 
107 | #     Return
108 | #     ------
109 | #     tuple of picked cell types and proportions
110 | 
111 | #     '''
112 | #     # get number of different
113 | #     # cell types present
114 | #     n_labels = uni_labels.shape[0]
115 | 
116 | #     # sample number of types to be present at current spot
117 | #     # w/o having more types than cells
118 | #     n_types = dists.uniform.Uniform(low=1,
119 | #                                     high=min([n_labels, min_n_cells])).sample()
120 | 
121 | #     n_types = n_types.round().type(t.int)
122 | 
123 | #     # select which types to include
124 | #     pick_types = t.randperm(n_labels)[0:n_types]
125 | #     alpha = t.Tensor(np.array(alpha[pick_types]))
126 | 
127 | #     # select cell type proportions
128 | #     member_props = dists.Dirichlet(concentration=alpha * t.ones(n_types)).sample()
129 | #     return ((pick_types, member_props))
130 | 
131 | 
132 | # def assemble_spot(cnt, labels, n_cells, fraction, pick_types, member_props):
133 | #     '''
134 | #     Generate one synthetic ST spot
135 | 
136 | #     Parameters:
137 | #     -----------
138 | #     cnt: pd.DataFrame of single-cell count data --> [n_cells x n_genes] <--
139 | #     labels: pd.DataFrame of single-cell annotations [n_cells]
140 | #     n_cells: int number of cells to include in spot
141 | #     fraction: float or np.array 
142 | #         fraction of transcripts from each cell being 
143 | #         observed in ST-spot (gene budgets in model)
144 | #     pick_types: torch.Tensor of cell types to include in spot (output of pick_cell_types)
145 | #     member_props: torch.Tensor of the proportions of different cell types in spots (output of pick_cell_types)
146 | 
147 | #     Returns:
148 | #     --------
149 | #     Dictionary with expression data,
150 | #     proportion values and number of
151 | #     cells from each type at every
152 | #     spot
153 | #     '''
154 | #     # get unique labels found in single cell data
155 | #     uni_labels, uni_counts = np.unique(labels,
156 | #                                        return_counts=True)
157 | #     n_labels = uni_labels.shape[0]
158 | 
159 | #     assert np.all(uni_counts >= 30), "Insufficient number of cells"
160 | 
161 | #     # get no. of members of spot for each cell type
162 | #     members = t.zeros(n_labels).type(t.float)
163 | #     members[pick_types] = (n_cells * member_props).round()
164 | #     # get final proportion of each type
165 | #     props = members / members.sum()
166 | #     # convert members to integers
167 | #     members = members.type(t.int)
168 | #     # generate spot expression data
169 | #     spot_expr = t.zeros(cnt.shape[1]).type(t.float32)
170 | #     nUMIs = t.zeros((len(uni_labels))).type(t.float32)
171 | #     for z in range(len(uni_labels)):
172 | #         if members[z] > 0:
173 | #             idx = np.where(labels == uni_labels[z])[0]
174 | #             # pick random cells from type
175 | #             np.random.shuffle(idx)
176 | #             idx = idx[0:members[z]]
177 | #             # add fraction of transcripts to spot expression
178 | #             z_expr = t.tensor((cnt.iloc[idx, :] * fraction).sum(axis=0).round().astype(np.float32))
179 | #             nUMIs[z] = z_expr.sum()
180 | #             spot_expr += z_expr
181 | #     return {'expr': spot_expr,
182 | #             'proportions': props,
183 | #             'members': members,
184 | #             'umis': nUMIs
185 | #             }
186 | 
187 | 
188 | # def assemble_region(cnt, labels, n_cells_vec, alpha, fraction):
189 | #     '''
190 | #     Assemble ST-spots from a single synthetic region 
191 | #     i.e. with the same proportions of cell types in each spot
192 | 
193 | #     Parameters
194 | #     ----------
195 | #     n_cell_vec: vector of number of cells to mix for each synthetic spot
196 | #     alpha: np.array 
197 | #         dirichlet distribution concentration value 
198 | #         (can be from cell type proportions in ST)
199 | #     fraction: float or np.array 
200 | #         fraction of transcripts from each cell being 
201 | #         observed in ST-spot (gene budgets in model)
202 | #     '''
203 | 
204 | #     n_spots = len(n_cells_vec)
205 | 
206 | #     # get unique labels
207 | #     uni_labels = np.unique(labels.values)
208 | #     n_labels = uni_labels.shape[0]
209 | 
210 | #     # prepare matrices
211 | #     st_cnt = np.zeros((n_spots, cnt.shape[1]))
212 | #     st_prop = np.zeros((n_spots, n_labels))
213 | #     st_memb = np.zeros((n_spots, n_labels))
214 | #     st_umis = np.zeros((n_spots, n_labels))
215 | 
216 | #     # generate one spot at a time
217 | #     pick_types, member_props = pick_cell_types(uni_labels, alpha, min(n_cells_vec))
218 | #     #     np.random.seed(1337)
219 | #     #     t.manual_seed(1337)
220 | #     for spot in range(n_spots):
221 | #         spot_data = assemble_spot(cnt,
222 | #                                   labels,
223 | #                                   n_cells_vec[spot], fraction, pick_types, member_props
224 | #                                   )
225 | 
226 | #         st_cnt[spot, :] = spot_data['expr']
227 | #         st_prop[spot, :] = spot_data['proportions']
228 | #         st_memb[spot, :] = spot_data['members']
229 | #         st_umis[spot, :] = spot_data['umis']
230 | 
231 | #         index = pd.Index(['Spotx' + str(x + 1) for \
232 | #                           x in range(n_spots)])
233 | 
234 | #     # convert to pandas DataFrames
235 | #     st_cnt = pd.DataFrame(st_cnt,
236 | #                           index=index,
237 | #                           columns=cnt.columns,
238 | #                           )
239 | #     st_prop = pd.DataFrame(st_prop,
240 | #                            index=index,
241 | #                            columns=uni_labels,
242 | #                            )
243 | #     st_memb = pd.DataFrame(st_memb,
244 | #                            index=index,
245 | #                            columns=uni_labels,
246 | #                            )
247 | #     st_umis = pd.DataFrame(st_umis,
248 | #                            index=index,
249 | #                            columns=uni_labels,
250 | #                            )
251 | #     return {'counts': st_cnt,
252 | #             'proportions': st_prop,
253 | #             'members': st_memb,
254 | #             'umis': st_umis}
255 | 
256 | 
257 | # def assemble_st(cnt, labels, n_regions, n_cells_tot, alpha, fraction):
258 | #     '''
259 | #     Assemble synthetic ST data from count matrix and predicted 
260 | #     cell type labels for each single-cell. Regions are modelled as groups of spots with
261 | #     the same proportion of cell types (and roughly the same number of cells per spot). 
262 | 
263 | #     Parameters
264 | #     ----------
265 | #     n_spots: int 
266 | #         number of spots to simulate
267 | #     n_regions: int 
268 | #         number of regions in which spots should be divided
269 | #     alpha: np.array 
270 | #         dirichlet distribution concentration value 
271 | #         (can be from cell type proportions in ST)
272 | #     fraction: float or np.array 
273 | #         fraction of transcripts from each cell being 
274 | #         observed in ST-spot (gene budgets in model)
275 | 
276 | #     (if you don't want zonation you can just make as many regions as spots)
277 | #     '''
278 | #     # count total number of spots
279 | #     tot_spots = len(n_cells_tot)
280 | 
281 | #     # get unique labels
282 | #     uni_labels = np.unique(labels.values)
283 | #     n_labels = uni_labels.shape[0]
284 | 
285 | #     # assign spots to regions
286 | #     # avoding to have regions with no spots 
287 | #     if n_regions != tot_spots:
288 | #         region_labels = []
289 | #         while len(np.unique(region_labels)) != n_regions:
290 | #             region_labels = np.array(random.choices(range(n_regions), k=tot_spots))
291 | #     else:
292 | #         region_labels = np.array(range(n_regions))
293 | 
294 | #     # prepare matrices
295 | #     st_cnt = np.zeros((tot_spots, cnt.shape[1]))
296 | #     st_prop = np.zeros((tot_spots, n_labels))
297 | #     st_memb = np.zeros((tot_spots, n_labels))
298 | #     st_umis = np.zeros((tot_spots, n_labels))
299 | #     idx = 0
300 | 
301 | #     # sort number of cells to have ~ same number of cells per spot for each region
302 | #     n_cells_tot.sort()
303 | 
304 | #     # assemble one region at a time
305 | #     for reg in range(n_regions):
306 | #         print("making reg" + str(reg) + "...", flush=True)
307 | #         n_spots_reg = len(region_labels[region_labels == reg])
308 | #         n_cells_vec = n_cells_tot[idx:idx + n_spots_reg]
309 | #         reg_data = assemble_region(cnt, labels, n_cells_vec, alpha, fraction)
310 | 
311 | #         st_cnt[idx:idx + n_spots_reg, :] = reg_data['counts']
312 | #         st_prop[idx:idx + n_spots_reg, :] = reg_data['proportions']
313 | #         st_memb[idx:idx + n_spots_reg, :] = reg_data['members']
314 | #         st_umis[idx:idx + n_spots_reg, :] = reg_data['umis']
315 | #         idx = idx + n_spots_reg
316 | 
317 | #     index = pd.Index(['Spotx' + str(x + 1) for \
318 | #                       x in range(tot_spots)])
319 | #     # convert to pandas DataFrames
320 | #     st_cnt = pd.DataFrame(st_cnt,
321 | #                           index=index,
322 | #                           columns=cnt.columns,
323 | #                           )
324 | 
325 | #     st_prop = pd.DataFrame(st_prop,
326 | #                            index=index,
327 | #                            columns=uni_labels,
328 | #                            )
329 | #     st_memb = pd.DataFrame(st_memb,
330 | #                            index=index,
331 | #                            columns=uni_labels,
332 | #                            )
333 | #     st_umis = pd.DataFrame(st_umis,
334 | #                            index=index,
335 | #                            columns=uni_labels,
336 | #                            )
337 | #     return {'counts': st_cnt,
338 | #             'proportions': st_prop,
339 | #             'members': st_memb,
340 | #             'umis': st_umis,
341 | #             'regions': region_labels}
342 | 


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
1 | import .ST_simulation
2 | 
3 | __all__ = [
4 |     "ST_simulation",
5 | ]
6 | 


--------------------------------------------------------------------------------
/assemble_composition.py:
--------------------------------------------------------------------------------
 1 | ### Make ST datasets from single-cell data
 2 | import argparse
 3 | import pickle
 4 | 
 5 | from ST_simulation import *
 6 | 
 7 | parser = argparse.ArgumentParser()
 8 | # parser.add_argument('lbl_gen_file', type=str,
 9 | #                     help='path to label generation pickle file')
10 | # parser.add_argument('cnt_gen_file', type=str,
11 | #                     help='path to label generation pickle file')
12 | # parser.add_argument('design_csv', type=str,
13 | #                     help='path to design csv file')
14 | parser.add_argument('seed', type=int,
15 |                     help='random seed of split')
16 | parser.add_argument('--tot_spots', dest='tot_spots', type=int,
17 |                     default=1000,
18 |                     help='Total number of spots to simulate')
19 | parser.add_argument('--out_dir', dest='out_dir', type=str,
20 |                     default='/nfs/team283/ed6/simulation/lowdens_synthetic_ST_fewcells/',
21 |                     help='Output directory')
22 | parser.add_argument('--assemble_id', dest='assemble_id', type=int,
23 |                     default=1,
24 |                     help='ID of ST assembly')
25 | parser.add_argument('--annotation_col', dest='anno_col', type=str,
26 |                     default="annotation_1",
27 |                     help='Name of column to use in annotation file (default: annotation_1)')
28 | 
29 | args = parser.parse_args()
30 | 
31 | # lbl_gen_file = args.lbl_gen_file
32 | # count_gen_file = args.cnt_gen_file
33 | # design_file = args.design_csv
34 | seed = args.seed
35 | tot_spots = args.tot_spots
36 | out_dir = args.out_dir
37 | assemble_id = args.assemble_id
38 | anno_col = args.anno_col
39 | 
40 | ### Load input data ### 
41 | lbl_gen_file = out_dir + "labels_generation_" + str(seed) + ".p"
42 | count_gen_file = out_dir + "counts_generation_" + str(seed) + ".p"
43 | design_file = out_dir + "synthetic_ST_seed" + str(seed) + "_design.csv"
44 | 
45 | lbl_generation = pickle.load(open(lbl_gen_file, "rb"))
46 | cnt_generation = pickle.load(open(count_gen_file, "rb"))
47 | 
48 | uni_labels = lbl_generation[anno_col].unique()
49 | labels = lbl_generation
50 | cnt = cnt_generation
51 | 
52 | design_df = pd.read_csv(design_file, index_col=0)
53 | design_df = abs(design_df) ## For some reason some mean_ncells are set to -0.0 
54 | 
55 | # ### GENERATE GENE-SPECIFIC SCALING FACTOR ###
56 | 
57 | # gene_level_alpha = np.random.gamma(5, 5)
58 | # gene_level_beta = np.random.gamma(1, 5)
59 | # gene_level = np.random.gamma(gene_level_alpha, gene_level_beta, size=cnt.shape[1])
60 | 
61 | # # scale from 0 to 1 (to coincide to fractions)
62 | # gene_level_scaled = (gene_level - min(gene_level)) / (max(gene_level) - min(gene_level))
63 | 
64 | ### Assemble cell type composition
65 | spots_members = assemble_ct_composition(design_df, tot_spots, ncells_scale=1)
66 | 
67 | # st_cnt_df = assemble_st_2(cnt, labels, spots_members, gene_level_scaled)
68 | 
69 | ### SAVE OUTPUTS ###
70 | 
71 | synthetic_st = {"composition": spots_members}
72 | 
73 | for k, v in synthetic_st.items():
74 |     out_name = out_dir + "synthetic_ST_seed" + str(seed) + "_" + str(
75 |         assemble_id) + "_" + k + ".csv"
76 |     v.to_csv(out_name, sep=",", index=True, header=True)
77 | 


--------------------------------------------------------------------------------
/assemble_design.py:
--------------------------------------------------------------------------------
  1 | ### Make design of simulated ST datasets from single-cell data
  2 | import argparse
  3 | import pickle
  4 | import numpy as np 
  5 | import pandas as pd
  6 | 
  7 | parser = argparse.ArgumentParser()
  8 | # parser.add_argument('lbl_gen_file', type=str,
  9 | #                     help='path to label generation pickle file')
 10 | # parser.add_argument('cnt_gen_file', type=str,
 11 | #                     help='path to label generation pickle file')
 12 | parser.add_argument('seed', type=int,
 13 |                     help='random seed of split')
 14 | parser.add_argument('--tot_spots', dest='tot_spots', type=int,
 15 |                     default=1000,
 16 |                     help='Total number of spots to simulate')
 17 | parser.add_argument('--mean_high', dest='mean_high', type=float,
 18 |                     default=2.5,
 19 |                     help='Mean cell density for high-density cell types')
 20 | parser.add_argument('--mean_low', dest='mean_low', type=float,
 21 |                     default=0.8,
 22 |                     help='Mean cell density for low-density cell types')
 23 | parser.add_argument('--percent_uniform', dest='percent_uniform', type=float,
 24 |                     default=80,
 25 |                     help='Sparsity of uniform cell types (% non-zero spots of total spots)')
 26 | parser.add_argument('--percent_sparse', dest='percent_sparse', type=float,
 27 |                     default=10,
 28 |                     help='Sparsity of sparse cell types (% non-zero spots of total spots)')
 29 | parser.add_argument('--annotation_col', dest='anno_col', type=str,
 30 |                     default="annotation_1",
 31 |                     help='Name of column to use in annotation file (default: annotation_1)')
 32 | parser.add_argument('--out_dir', dest='out_dir', type=str,
 33 |                     default='/nfs/team283/ed6/simulation/lowdens_synthetic_ST_fewcells/',
 34 |                     help='Output directory')
 35 | parser.add_argument('--assemble_id', dest='assemble_id', type=int,
 36 |                     default=1,
 37 |                     help='ID of ST assembly')
 38 | 
 39 | args = parser.parse_args()
 40 | 
 41 | # lbl_gen_file = args.lbl_gen_file
 42 | # count_gen_file = args.cnt_gen_file
 43 | seed = args.seed
 44 | tot_spots = args.tot_spots
 45 | mean_high = args.mean_high
 46 | mean_low = args.mean_low
 47 | percent_uniform = args.percent_uniform
 48 | percent_sparse = args.percent_sparse
 49 | out_dir = args.out_dir
 50 | assemble_id = args.assemble_id
 51 | anno_col = args.anno_col
 52 | 
 53 | ### Load input data ### 
 54 | lbl_gen_file = out_dir + "labels_generation_" + str(seed) + ".p"
 55 | count_gen_file = out_dir + "counts_generation_" + str(seed) + ".p"
 56 | 
 57 | lbl_generation = pickle.load(open(lbl_gen_file, "rb"))
 58 | cnt_generation = pickle.load(open(count_gen_file, "rb"))
 59 | 
 60 | uni_labels = lbl_generation[anno_col].unique()
 61 | labels = lbl_generation
 62 | cnt = cnt_generation
 63 | 
 64 | ### Define uniform VS sparse cell types (w more sparse = 0)
 65 | uniform_ct = np.random.choice([0, 1], size=len(uni_labels), p=[0.8, 0.2])
 66 | 
 67 | #### Define low VS high density cell types (w more low density = 1)
 68 | uni_low = np.random.choice([0, 1], size=len(uni_labels[uniform_ct == 1]),  p=[0.2, 0.8])
 69 | reg_low = np.random.choice([0, 1], size=len(uni_labels[uniform_ct == 0]),  p=[0.2, 0.8])
 70 | 
 71 | design_df = pd.DataFrame({'uniform': uniform_ct}, index=uni_labels)
 72 | 
 73 | design_df['density'] = np.nan
 74 | design_df.loc[design_df.index[design_df.uniform == 1], 'density'] = uni_low
 75 | design_df.loc[design_df.index[design_df.uniform == 0], 'density'] = reg_low
 76 | 
 77 | ### Generate no of spots per cell type 
 78 | # Uniform ~ 60% of spots, sparse ~ 5% of spots
 79 | mean_unif = round((tot_spots / 100) * percent_uniform)
 80 | mean_sparse = round((tot_spots / 100) * percent_sparse)
 81 | sigma_unif = np.sqrt(mean_unif / 0.3)
 82 | sigma_sparse = np.sqrt(mean_sparse / 0.3)
 83 | 
 84 | shape_unif = mean_unif ** 2 / sigma_unif ** 2
 85 | scale_unif = sigma_unif ** 2 / mean_unif
 86 | shape_sparse = mean_sparse ** 2 / sigma_sparse ** 2
 87 | scale_sparse = sigma_sparse ** 2 / mean_sparse
 88 | 
 89 | unif_nspots = np.round(np.random.gamma(shape=shape_unif, scale=scale_unif, size=sum(design_df.uniform == 1)))
 90 | sparse_nspots = np.round(np.random.gamma(shape=shape_sparse, scale=scale_sparse, size=sum(design_df.uniform == 0)))
 91 | # if samples n spots is greater than total number of spots trim to the total
 92 | if (unif_nspots > tot_spots).sum() >= 1:
 93 |     unif_nspots[unif_nspots > tot_spots] = tot_spots
 94 | if (sparse_nspots > tot_spots).sum() >= 1:
 95 |     sparse_nspots[sparse_nspots > tot_spots] = tot_spots
 96 | 
 97 | 
 98 | design_df['nspots'] = np.nan
 99 | design_df.loc[design_df.index[design_df.uniform == 1], 'nspots'] = unif_nspots
100 | design_df.loc[design_df.index[design_df.uniform == 0], 'nspots'] = sparse_nspots
101 | 
102 | ### Generate avg density per spot per cell type
103 | sigma_low = np.sqrt(mean_low / 2)
104 | sigma_high = np.sqrt(mean_high / 2)
105 | 
106 | shape_low = mean_low ** 2 / sigma_low ** 2
107 | scale_low = sigma_low ** 2 / mean_low
108 | shape_high = mean_high ** 2 / sigma_high ** 2
109 | scale_high = sigma_high ** 2 / mean_high
110 | 
111 | low_ncells_mean = np.random.gamma(shape=shape_low, scale=scale_low, size=sum(design_df.density == 1))
112 | high_ncells_mean = np.random.gamma(shape=shape_high, scale=scale_high, size=sum(design_df.density == 0))
113 | 
114 | design_df['mean_ncells'] = np.nan
115 | design_df.loc[design_df.index[design_df.density == 1], 'mean_ncells'] = low_ncells_mean
116 | design_df.loc[design_df.index[design_df.density == 0], 'mean_ncells'] = high_ncells_mean
117 | 
118 | out_name = out_dir + "synthetic_ST_seed" + lbl_gen_file.split("_")[-1].rstrip(".p") + "_" + "design" + ".csv"
119 | design_df.to_csv(out_name, sep=",", index=True, header=True)
120 | 


--------------------------------------------------------------------------------
/assemble_st.py:
--------------------------------------------------------------------------------
 1 | ### Make ST datasets from single-cell data
 2 | import argparse
 3 | import pickle
 4 | from ST_simulation import *
 5 | 
 6 | parser = argparse.ArgumentParser()
 7 | parser.add_argument('seed', type=int,
 8 |                     help='random seed of split')
 9 | parser.add_argument('--out_dir', dest='out_dir', type=str,
10 |                     default='/nfs/team283/ed6/simulation/lowdens_synthetic_ST_fewcells/',
11 |                     help='Output directory')
12 | parser.add_argument('--assemble_id', dest='assemble_id', type=int,
13 |                     default=1,
14 |                     help='ID of ST assembly')
15 | parser.add_argument('--annotation_col', dest='anno_col', type=str,
16 |                     default="annotation_1",
17 |                     help='Name of column to use in annotation file (default: annotation_1)')
18 | 
19 | args = parser.parse_args()
20 | 
21 | # lbl_gen_file = args.lbl_gen_file
22 | # count_gen_file = args.cnt_gen_file
23 | # spots_members_file = args.spots_comp
24 | seed = args.seed
25 | out_dir = args.out_dir
26 | assemble_id = args.assemble_id
27 | # usecols = args.usecols
28 | anno_col = args.anno_col
29 | 
30 | ### Load input data ### 
31 | lbl_gen_file = out_dir + "labels_generation_" + str(seed) + ".p"
32 | count_gen_file = out_dir + "counts_generation_" + str(seed) + ".p"
33 | spots_members_file = out_dir + "synthetic_ST_seed" + str(seed) + "_" + str(assemble_id) + "_composition.csv"
34 | 
35 | lbl_generation = pickle.load(open(lbl_gen_file, "rb"))
36 | cnt_generation = pickle.load(open(count_gen_file, "rb"))
37 | spots_members = pd.read_csv(spots_members_file, index_col=0)
38 | 
39 | tot_spots = spots_members.shape[1]
40 | uni_labels = lbl_generation[anno_col].unique()
41 | labels = lbl_generation
42 | cnt = cnt_generation
43 | 
44 | # ### GENERATE GENE-SPECIFIC SCALING FACTOR ###
45 | 
46 | # gene_level_alpha = np.random.gamma(5, 5)
47 | # gene_level_beta = np.random.gamma(1, 5)
48 | # gene_level = np.random.gamma(gene_level_alpha, gene_level_beta, size=cnt.shape[1])
49 | 
50 | # # scale from 0 to 1 (to coincide to fractions)
51 | # gene_level_scaled = (gene_level - min(gene_level)) / (max(gene_level) - min(gene_level))
52 | 
53 | 
54 | st_cnt_df,st_umis_df = assemble_st_2(cnt, labels, spots_members)
55 | 
56 | ### SAVE OUTPUTS ###
57 | 
58 | synthetic_st = {"counts": st_cnt_df, "umis":st_umis_df}
59 | # out_name = spots_members_file.rstrip(".csv") + "_counts.csv"
60 | # st_cnt_df.to_csv(out_name, sep=",", index=True, header=True)
61 | 
62 | for k, v in synthetic_st.items():
63 |     out_name = out_dir + "synthetic_ST_seed" + str(seed) + "_" + str(
64 |         assemble_id) + "_" + k + ".csv"
65 |     v.to_csv(out_name, sep=",", index=True, header=True)
66 | 
67 | 


--------------------------------------------------------------------------------
/merge_synthetic_ST.py:
--------------------------------------------------------------------------------
 1 | ### Merge synthetic ST data ###
 2 | import argparse
 3 | import os
 4 | 
 5 | import pandas as pd
 6 | 
 7 | parser = argparse.ArgumentParser()
 8 | parser.add_argument('outdir', type=str,
 9 |                     help='path to synthetic ST directory')
10 | parser.add_argument('seed', type=int,
11 |                     help='random seed for generation')
12 | args = parser.parse_args()
13 | 
14 | outdir = args.outdir
15 | seed = args.seed
16 | 
17 | 
18 | def read_synthetic_ST(outdir, seed, id):
19 |     cnt_file = 'synthetic_ST_seed{0}_{1}_composition_counts.csv'.format(seed, id)
20 |     comp_file = cnt_file.split("_counts")[0] + ".csv"
21 |     cnt_df = pd.read_csv(os.path.join(outdir, cnt_file), index_col=0)
22 |     comp_df = pd.read_csv(os.path.join(outdir, comp_file), index_col=0)
23 |     return (cnt_df, comp_df)
24 | 
25 | 
26 | design_file = [x for x in os.listdir(outdir) if 'design.csv' in x and str(seed) in x][0]
27 | seed_files = [x for x in os.listdir(outdir) if str(seed) in x and x.endswith("counts.csv")]
28 | ids = [x.split("_")[3] for x in seed_files]
29 | 
30 | cnt_df, comp_df = read_synthetic_ST(outdir, seed, ids[0])
31 | for id in ids[1:]:
32 |     cnt_df1, comp_df1 = read_synthetic_ST(outdir, seed, id)
33 |     if cnt_df1.shape[1] == cnt_df.shape[1]:
34 |         cnt_df = pd.concat([cnt_df, cnt_df1])
35 |     if comp_df.shape[0] == comp_df.shape[0]:
36 |         comp_df = pd.concat([comp_df, comp_df1], 1)
37 | 
38 | cnt_df.reset_index(drop=True, inplace=True)
39 | cnt_df.index = ["Spotx" + str(x) for x in cnt_df.index]
40 | comp_df.columns = ["Spotx" + str(x) for x in range(comp_df.shape[1])]
41 | 
42 | cnt_df.to_csv(os.path.join(outdir, design_file.split("design")[0] + "counts.csv"))
43 | comp_df.to_csv(os.path.join(outdir, design_file.split("design")[0] + "composition.csv"))
44 | 


--------------------------------------------------------------------------------
/run_simulation2.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | seed=$1
4 | n_spots=$2
5 | id=$3
6 | 
7 | python assemble_composition_2.py ${seed} --tot_spots $n_spots --assemble_id $id
8 | python assemble_st_2.py ${seed} --assemble_id $id
9 | 


--------------------------------------------------------------------------------
/split_sc.py:
--------------------------------------------------------------------------------
  1 | ### SPLIT SINGLE-CELL DATASET IN GENERATION AND VALIDATION SET ###
  2 | 
  3 | import argparse
  4 | import pickle
  5 | import random
  6 | import anndata
  7 | import scanpy as sc
  8 | import numpy as np
  9 | import pandas as pd
 10 | 
 11 | parser = argparse.ArgumentParser()
 12 | parser.add_argument('h5ad_file', type=str,
 13 |                     help='path to h5ad file with raw single-cell data')
 14 | parser.add_argument('annotation_file', type=str,
 15 |                     help='path to csv file with cell annotations')
 16 | parser.add_argument('--annotation_col', dest='anno_col', type=str,
 17 |                     default="annotation_1",
 18 |                     help='Name of column to use in annotation file (default: annotation_1)')
 19 | parser.add_argument('--out_dir', dest='out_dir', type=str,
 20 |                     default="/nfs/team283/ed6/simulation/lowdens_synthetic_ST_fewcells/",
 21 |                     help='Output directory')
 22 | args = parser.parse_args()
 23 | 
 24 | # adata_file = "/nfs/team205/vk7/sanger_projects/cell2location/notebooks/data/mouse_viseum_snrna/rawdata/all_cells_20200625.h5ad"
 25 | # annotation_file = "/nfs/team205/vk7/sanger_projects/cell2location/notebooks/results/mouse_viseum_snrna/snRNA_annotation_20200229.csv"
 26 | # anno_col = "annotation_1"
 27 | # out_dir = "/nfs/team283/ed6/simulation/lowdens_synthetic_ST_fewcells/"
 28 | 
 29 | adata_file = args.h5ad_file
 30 | annotation_file = args.annotation_file
 31 | anno_col = args.anno_col
 32 | out_dir = args.out_dir
 33 | 
 34 | ### Load input single-cell data and annotations ###
 35 | 
 36 | adata_raw = sc.read_h5ad(adata_file)
 37 | 
 38 | ## Cell type annotations
 39 | labels = pd.read_csv(annotation_file, index_col=0)
 40 | 
 41 | # # Select genes used for clustering
 42 | # adata_raw = adata_raw[:, [x for x in adata_raw.var_names if x in adata_snrna.var_names]]
 43 | 
 44 | adata_df = pd.DataFrame(adata_raw.X.T.toarray(), columns=adata_raw.obs_names, index=adata_raw.var_names)
 45 | adata_df = adata_df.T
 46 | adata_df.index.name = "cell"
 47 | 
 48 | ### Subset to cells with label ###
 49 | adata_df = adata_df.loc[labels.index, :]
 50 | 
 51 | ### Split generation and validation set ###
 52 | 
 53 | sc_cnt = adata_df
 54 | sc_lbl = pd.DataFrame(labels[anno_col])
 55 | 
 56 | # match count and label data
 57 | inter = sc_cnt.index.intersection(sc_lbl.index)
 58 | 
 59 | sc_lbl = sc_lbl.loc[inter, :]
 60 | sc_cnt = sc_cnt.loc[inter, :]
 61 | 
 62 | labels = sc_lbl.iloc[:, 0].values
 63 | 
 64 | # get unique labels
 65 | uni_labs, uni_counts = np.unique(labels, return_counts=True)
 66 | 
 67 | # only keep types with more than 50 cells
 68 | keep_types = uni_counts > 40
 69 | keep_cells = np.isin(labels, uni_labs[keep_types])
 70 | 
 71 | labels = labels[keep_cells]
 72 | sc_cnt = sc_cnt.iloc[keep_cells, :]
 73 | sc_lbl = sc_lbl.iloc[keep_cells, :]
 74 | 
 75 | uni_labs, uni_counts = np.unique(labels, return_counts=True)
 76 | n_types = uni_labs.shape[0]
 77 | 
 78 | seeds = random.sample(range(1000), 3)
 79 | 
 80 | for seed in seeds:
 81 |     random.seed(seed)
 82 |     print("Seed " + str(seed))
 83 |     # get member indices for each set
 84 |     idx_generation = []
 85 |     idx_validation = []
 86 |     for z in range(n_types):
 87 |         tmp_idx = np.where(labels == uni_labs[z])[0]
 88 |         n_generation = int(round(tmp_idx.shape[0] / 2))
 89 |         smp_gen = random.sample(list(tmp_idx), k=n_generation)
 90 |         smp_val = tmp_idx[np.isin(tmp_idx, smp_gen, invert=True)]
 91 |         idx_generation += smp_gen
 92 |         idx_validation += smp_val.tolist()
 93 |     idx_generation.sort()
 94 |     idx_validation.sort()
 95 |     # make sure no members overlap between sets
 96 |     assert len(set(idx_generation).intersection(set(idx_validation))) == 0, \
 97 |         "validation and generation set are not orthogonal"
 98 |     # assemble sets from indices
 99 |     cnt_validation = sc_cnt.iloc[idx_validation, :]
100 |     cnt_generation = sc_cnt.iloc[idx_generation, :]
101 |     lbl_validation = sc_lbl.iloc[idx_validation, :]
102 |     lbl_generation = sc_lbl.iloc[idx_generation, :]
103 |     pickle.dump(lbl_generation,
104 |                 open(out_dir + "labels_generation_" + str(seed) + ".p", "wb"))
105 |     pickle.dump(cnt_generation,
106 |                 open(out_dir + "counts_generation_" + str(seed) + ".p", "wb"))
107 |     pickle.dump(lbl_validation,
108 |                 open(out_dir + "labels_validation_" + str(seed) + ".p", "wb"))
109 |     pickle.dump(cnt_validation,
110 |                 open(out_dir + "counts_validation_" + str(seed) + ".p", "wb"))
111 | 


--------------------------------------------------------------------------------