├── .gitignore ├── .ipynb_checkpoints ├── README-checkpoint.md └── ST_simulation-checkpoint.py ├── README.md ├── ST_simulation.py ├── __init__.py ├── assemble_composition.py ├── assemble_design.py ├── assemble_st.py ├── merge_synthetic_ST.py ├── run_simulation2.sh └── split_sc.py /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints 2 | __pycache__ 3 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/README-checkpoint.md: -------------------------------------------------------------------------------- 1 | ## Simulation of Spatial Transcriptomics spots from single-cell reference 2 | 3 | 4 | This repo provides a collection of scripts used to generate simulated spatial transcriptomics data as a mixture of single-cell transcriptomics profiles (adapting code and model from [Andersson et al. 2019](https://www.biorxiv.org/content/10.1101/2019.12.13.874495v1)). Parameters are chosen to simulate the characteristics of 10X Genomics Visium chips. 5 | 12 | ### Run simulation 13 | 14 | Initial input: 15 | 16 | - AnnData object of raw counts per single-cells (saved as `h5ad` file) 17 | - A table of cell type annotations per cell that we want to deconvolve (saved as `csv` file) 18 | 19 | **(Step 1) Split single-cell dataset:** we split the cells in the single-cell dataset in a 'generation set', that will be used to simulate the ST spots, and a 'validation' set, that will be used to train the deconvolution models that we want to benchmark. From the command line: 20 | 21 | ``` 22 | python split_sc.py --annotation_col annotation_1 --out_dir 23 | ``` 24 | 25 | Output: generation and validation count matrices and cell type annotations are saved as `pickle` files, with a random seed identifying the split. 26 | 27 | **(Step 2) Build design matrix**: in this step we define which cell types are (A) low/high density and (B) Uniformly present in all the spots or localized in few spots (regional). To generate synthetic spots with ~10 cells per spot (as seen with nuclear segmentation on Visium spots) we reccommend setting the mean number of cells per spot per cell type < 5. 28 | 29 | ``` 30 | n_spots=100 31 | seed=$(ls labels_generation* | sed 's/.*_//' | sed 's/.p//') 32 | python ST_simulation/assemble_design.py \ 33 | $seed \ 34 | --tot_spots $n_spots --mean_high 3 --mean_low 1 \ 35 | --out_dir 36 | ``` 37 | 38 | Output: `synthetic_ST_seed${seed}_${assemble_id}_design.csv` contains the design used for the simulation: 39 | 40 | | **Column** | **Data** | 41 | |-------------|--------------------------------------------------------------------------------------------------| 42 | | uniform | is the cell type uniformly located across spots (1) or localized in a small subset of spots (0) | 43 | | density | is the cell type present in a spot at low density (1) or high density (0) | 44 | | nspots | total number of spots in which the cell type is located | 45 | | mean_ncells | mean number of cells per spot | 46 | 47 | **(Step 3) Assemble cell type composition per spot:** based on the design matrix, we define how many cell types per An assemble ID is used to identify 48 | ``` 49 | id=1 50 | python cell2location/pycell2location/ST_simulation/assemble_composition.py \ 51 | $seed \ 52 | --tot_spots $n_spots --assemble_id $id 53 | ``` 54 | 55 | Output: `synthetic_ST_seed${seed}_${assemble_id}_composition.csv` contains the number of cells per cell type in each spot, for benchmarking deconvolution models 56 | 57 | **(Step 4) Assemble simulated ST spots** 58 | ``` 59 | python assemble_st.py ${seed} --assemble_id $id 60 | ``` 61 | 62 | Output: 63 | 64 | - `synthetic_ST_seed${seed}_${assemble_id}_counts.csv` contains the count matrix for the simulated ST spots 65 | - `synthetic_ST_seed${seed}_${assemble_id}_umis.csv` contains the number of UMIs per cell type in each spot, for benchmarking deconvolution methods that model number of UMIs 66 | 67 | 68 | ### Speeding up the simulation process 69 | 70 | The current implementation is not optimized for speed, it takes ~ 2 minutes to assemble 100 spots. While I might consider writing a parallel version if needed, at the moment my suggestion to simulate thousands of spots is to assemble the design matrix once (step 1 above), then run steps 2 and 3 many times using wrapper 71 | ``` 72 | run_simulation2.sh 73 | ``` 74 | then merge in one object 75 | ``` 76 | python merge_synthetic_ST.py . $seed 77 | ``` 78 | 79 | 80 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/ST_simulation-checkpoint.py: -------------------------------------------------------------------------------- 1 | import random 2 | 3 | import numpy as np 4 | import pandas as pd 5 | import torch as t 6 | import torch.distributions as dists 7 | 8 | 9 | ## --- Version 2: cell densities --- ## 10 | 11 | def assemble_ct_composition(design_df, tot_spots, ncells_scale=5): 12 | ''' 13 | Parameters 14 | ---------- 15 | design_df: pd.DataFrame containing number of spots (nspots) and mean n of 16 | cells per spot (mean_ncell) per cell type 17 | tot_spots: int 18 | total number of spots to simulate 19 | Return 20 | ------ 21 | pd.DataFrame of cell types x spots with no of cells 22 | ''' 23 | spots_members = pd.DataFrame(columns=range(tot_spots), 24 | index=design_df.index) 25 | ## Cell types to spot 26 | for i in range(len(design_df.nspots)): 27 | l = ([0] * (tot_spots - int(design_df.nspots[i]))) + ([1] * int(design_df.nspots[i])) 28 | l = random.sample(l, k=tot_spots) 29 | spots_members.iloc[i] = pd.Series(l) 30 | 31 | ## No of cells per spot 32 | ncells = [ 33 | np.round(np.random.gamma(design_df.loc[ct,].mean_ncells, ncells_scale, size=int(design_df.loc[ct,].nspots))) for 34 | ct in design_df.index] 35 | for i in range(spots_members.shape[0]): 36 | spots_members.iloc[i, spots_members.columns[spots_members.iloc[i] == 1]] = ncells[i] 37 | return (spots_members) 38 | 39 | 40 | def assemble_spot_2(cnt, labels, members): 41 | # uni_labels = members.index 42 | # spot_expr = t.zeros(cnt.shape[1]).type(t.float32) 43 | # for z in range(len(uni_labels)): 44 | # if members[z] > 0: 45 | # idx = np.where(labels == uni_labels[z])[0] 46 | # # pick random cells from type 47 | # np.random.shuffle(idx) 48 | # idx = idx[0:int(members[z])] 49 | # # add fraction of transcripts to spot expression 50 | # z_expr = t.tensor((cnt.iloc[idx, :] * fraction).sum(axis=0).round().astype(np.float32)) 51 | # spot_expr += z_expr 52 | uni_labels = members.index 53 | spot_expr = t.zeros(cnt.shape[1]).type(t.float32) 54 | nUMIs = t.zeros((len(uni_labels))).type(t.float32) 55 | for z in range(len(uni_labels)): 56 | if members[z] > 0: 57 | idx = np.where(labels == uni_labels[z])[0] 58 | # pick random cells from type 59 | np.random.shuffle(idx) 60 | idx = idx[0:int(members[z])] 61 | # add transcripts to spot expression 62 | z_expr = t.tensor((cnt.iloc[idx, :]).sum(axis=0).round().astype(np.float32)) 63 | spot_expr += z_expr 64 | nUMIs[z] = z_expr.sum() 65 | return (spot_expr, nUMIs) 66 | 67 | def assemble_st_2(cnt, labels, spots_members): 68 | tot_spots = spots_members.shape[1] 69 | st_cnt = np.zeros((tot_spots, cnt.shape[1])) 70 | st_umis = np.zeros((tot_spots, spots_members.shape[0])) 71 | for spot in range(tot_spots): 72 | print("making spot no." + str(spot) + "...", flush=True) 73 | spot_data = assemble_spot_2(cnt, labels, spots_members.iloc[:, spot]) 74 | st_cnt[spot, :] = spot_data[0] 75 | st_umis[spot, :] = spot_data[1] 76 | # convert to pandas DataFrames 77 | index = pd.Index(['Spotx' + str(x + 1) for \ 78 | x in range(tot_spots)]) 79 | st_cnt = pd.DataFrame(st_cnt, 80 | index=index, 81 | columns=cnt.columns, 82 | ) 83 | st_umis = pd.DataFrame(st_umis, 84 | index=index, 85 | columns=spots_members.index, 86 | ) 87 | return (st_cnt, st_umis) 88 | 89 | 90 | # ## --- Version 1: proportions --- ## 91 | 92 | # def pick_cell_types(uni_labels, alpha, min_n_cells): 93 | # ''' 94 | # Pick cell types to include in synthetic spots with proportions from 95 | # Dirichlet distribution. 96 | 97 | # Parameters 98 | # ---------- 99 | # uni_labels: np.array 100 | # unique labels 101 | # alpha: np.array 102 | # dirichlet distribution concentration value 103 | # (can be from cell type proportions in ST) 104 | 105 | # Return 106 | # ------ 107 | # tuple of picked cell types and proportions 108 | 109 | # ''' 110 | # # get number of different 111 | # # cell types present 112 | # n_labels = uni_labels.shape[0] 113 | 114 | # # sample number of types to be present at current spot 115 | # # w/o having more types than cells 116 | # n_types = dists.uniform.Uniform(low=1, 117 | # high=min([n_labels, min_n_cells])).sample() 118 | 119 | # n_types = n_types.round().type(t.int) 120 | 121 | # # select which types to include 122 | # pick_types = t.randperm(n_labels)[0:n_types] 123 | # alpha = t.Tensor(np.array(alpha[pick_types])) 124 | 125 | # # select cell type proportions 126 | # member_props = dists.Dirichlet(concentration=alpha * t.ones(n_types)).sample() 127 | # return ((pick_types, member_props)) 128 | 129 | 130 | # def assemble_spot(cnt, labels, n_cells, fraction, pick_types, member_props): 131 | # ''' 132 | # Generate one synthetic ST spot 133 | 134 | # Parameters: 135 | # ----------- 136 | # cnt: pd.DataFrame of single-cell count data --> [n_cells x n_genes] <-- 137 | # labels: pd.DataFrame of single-cell annotations [n_cells] 138 | # n_cells: int number of cells to include in spot 139 | # fraction: float or np.array 140 | # fraction of transcripts from each cell being 141 | # observed in ST-spot (gene budgets in model) 142 | # pick_types: torch.Tensor of cell types to include in spot (output of pick_cell_types) 143 | # member_props: torch.Tensor of the proportions of different cell types in spots (output of pick_cell_types) 144 | 145 | # Returns: 146 | # -------- 147 | # Dictionary with expression data, 148 | # proportion values and number of 149 | # cells from each type at every 150 | # spot 151 | # ''' 152 | # # get unique labels found in single cell data 153 | # uni_labels, uni_counts = np.unique(labels, 154 | # return_counts=True) 155 | # n_labels = uni_labels.shape[0] 156 | 157 | # assert np.all(uni_counts >= 30), "Insufficient number of cells" 158 | 159 | # # get no. of members of spot for each cell type 160 | # members = t.zeros(n_labels).type(t.float) 161 | # members[pick_types] = (n_cells * member_props).round() 162 | # # get final proportion of each type 163 | # props = members / members.sum() 164 | # # convert members to integers 165 | # members = members.type(t.int) 166 | # # generate spot expression data 167 | # spot_expr = t.zeros(cnt.shape[1]).type(t.float32) 168 | # nUMIs = t.zeros((len(uni_labels))).type(t.float32) 169 | # for z in range(len(uni_labels)): 170 | # if members[z] > 0: 171 | # idx = np.where(labels == uni_labels[z])[0] 172 | # # pick random cells from type 173 | # np.random.shuffle(idx) 174 | # idx = idx[0:members[z]] 175 | # # add fraction of transcripts to spot expression 176 | # z_expr = t.tensor((cnt.iloc[idx, :] * fraction).sum(axis=0).round().astype(np.float32)) 177 | # nUMIs[z] = z_expr.sum() 178 | # spot_expr += z_expr 179 | # return {'expr': spot_expr, 180 | # 'proportions': props, 181 | # 'members': members, 182 | # 'umis': nUMIs 183 | # } 184 | 185 | 186 | # def assemble_region(cnt, labels, n_cells_vec, alpha, fraction): 187 | # ''' 188 | # Assemble ST-spots from a single synthetic region 189 | # i.e. with the same proportions of cell types in each spot 190 | 191 | # Parameters 192 | # ---------- 193 | # n_cell_vec: vector of number of cells to mix for each synthetic spot 194 | # alpha: np.array 195 | # dirichlet distribution concentration value 196 | # (can be from cell type proportions in ST) 197 | # fraction: float or np.array 198 | # fraction of transcripts from each cell being 199 | # observed in ST-spot (gene budgets in model) 200 | # ''' 201 | 202 | # n_spots = len(n_cells_vec) 203 | 204 | # # get unique labels 205 | # uni_labels = np.unique(labels.values) 206 | # n_labels = uni_labels.shape[0] 207 | 208 | # # prepare matrices 209 | # st_cnt = np.zeros((n_spots, cnt.shape[1])) 210 | # st_prop = np.zeros((n_spots, n_labels)) 211 | # st_memb = np.zeros((n_spots, n_labels)) 212 | # st_umis = np.zeros((n_spots, n_labels)) 213 | 214 | # # generate one spot at a time 215 | # pick_types, member_props = pick_cell_types(uni_labels, alpha, min(n_cells_vec)) 216 | # # np.random.seed(1337) 217 | # # t.manual_seed(1337) 218 | # for spot in range(n_spots): 219 | # spot_data = assemble_spot(cnt, 220 | # labels, 221 | # n_cells_vec[spot], fraction, pick_types, member_props 222 | # ) 223 | 224 | # st_cnt[spot, :] = spot_data['expr'] 225 | # st_prop[spot, :] = spot_data['proportions'] 226 | # st_memb[spot, :] = spot_data['members'] 227 | # st_umis[spot, :] = spot_data['umis'] 228 | 229 | # index = pd.Index(['Spotx' + str(x + 1) for \ 230 | # x in range(n_spots)]) 231 | 232 | # # convert to pandas DataFrames 233 | # st_cnt = pd.DataFrame(st_cnt, 234 | # index=index, 235 | # columns=cnt.columns, 236 | # ) 237 | # st_prop = pd.DataFrame(st_prop, 238 | # index=index, 239 | # columns=uni_labels, 240 | # ) 241 | # st_memb = pd.DataFrame(st_memb, 242 | # index=index, 243 | # columns=uni_labels, 244 | # ) 245 | # st_umis = pd.DataFrame(st_umis, 246 | # index=index, 247 | # columns=uni_labels, 248 | # ) 249 | # return {'counts': st_cnt, 250 | # 'proportions': st_prop, 251 | # 'members': st_memb, 252 | # 'umis': st_umis} 253 | 254 | 255 | # def assemble_st(cnt, labels, n_regions, n_cells_tot, alpha, fraction): 256 | # ''' 257 | # Assemble synthetic ST data from count matrix and predicted 258 | # cell type labels for each single-cell. Regions are modelled as groups of spots with 259 | # the same proportion of cell types (and roughly the same number of cells per spot). 260 | 261 | # Parameters 262 | # ---------- 263 | # n_spots: int 264 | # number of spots to simulate 265 | # n_regions: int 266 | # number of regions in which spots should be divided 267 | # alpha: np.array 268 | # dirichlet distribution concentration value 269 | # (can be from cell type proportions in ST) 270 | # fraction: float or np.array 271 | # fraction of transcripts from each cell being 272 | # observed in ST-spot (gene budgets in model) 273 | 274 | # (if you don't want zonation you can just make as many regions as spots) 275 | # ''' 276 | # # count total number of spots 277 | # tot_spots = len(n_cells_tot) 278 | 279 | # # get unique labels 280 | # uni_labels = np.unique(labels.values) 281 | # n_labels = uni_labels.shape[0] 282 | 283 | # # assign spots to regions 284 | # # avoding to have regions with no spots 285 | # if n_regions != tot_spots: 286 | # region_labels = [] 287 | # while len(np.unique(region_labels)) != n_regions: 288 | # region_labels = np.array(random.choices(range(n_regions), k=tot_spots)) 289 | # else: 290 | # region_labels = np.array(range(n_regions)) 291 | 292 | # # prepare matrices 293 | # st_cnt = np.zeros((tot_spots, cnt.shape[1])) 294 | # st_prop = np.zeros((tot_spots, n_labels)) 295 | # st_memb = np.zeros((tot_spots, n_labels)) 296 | # st_umis = np.zeros((tot_spots, n_labels)) 297 | # idx = 0 298 | 299 | # # sort number of cells to have ~ same number of cells per spot for each region 300 | # n_cells_tot.sort() 301 | 302 | # # assemble one region at a time 303 | # for reg in range(n_regions): 304 | # print("making reg" + str(reg) + "...", flush=True) 305 | # n_spots_reg = len(region_labels[region_labels == reg]) 306 | # n_cells_vec = n_cells_tot[idx:idx + n_spots_reg] 307 | # reg_data = assemble_region(cnt, labels, n_cells_vec, alpha, fraction) 308 | 309 | # st_cnt[idx:idx + n_spots_reg, :] = reg_data['counts'] 310 | # st_prop[idx:idx + n_spots_reg, :] = reg_data['proportions'] 311 | # st_memb[idx:idx + n_spots_reg, :] = reg_data['members'] 312 | # st_umis[idx:idx + n_spots_reg, :] = reg_data['umis'] 313 | # idx = idx + n_spots_reg 314 | 315 | # index = pd.Index(['Spotx' + str(x + 1) for \ 316 | # x in range(tot_spots)]) 317 | # # convert to pandas DataFrames 318 | # st_cnt = pd.DataFrame(st_cnt, 319 | # index=index, 320 | # columns=cnt.columns, 321 | # ) 322 | 323 | # st_prop = pd.DataFrame(st_prop, 324 | # index=index, 325 | # columns=uni_labels, 326 | # ) 327 | # st_memb = pd.DataFrame(st_memb, 328 | # index=index, 329 | # columns=uni_labels, 330 | # ) 331 | # st_umis = pd.DataFrame(st_umis, 332 | # index=index, 333 | # columns=uni_labels, 334 | # ) 335 | # return {'counts': st_cnt, 336 | # 'proportions': st_prop, 337 | # 'members': st_memb, 338 | # 'umis': st_umis, 339 | # 'regions': region_labels} 340 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Simulation of Spatial Transcriptomics spots from single-cell reference 2 | 3 | 4 | This repo provides a collection of scripts used to generate simulated spatial transcriptomics data as a mixture of single-cell transcriptomics profiles (adapting code and model from [Andersson et al. 2019](https://www.biorxiv.org/content/10.1101/2019.12.13.874495v1)). Parameters are chosen to simulate the characteristics of 10X Genomics Visium chips. 5 | 12 | ### Run simulation 13 | 14 | Initial input: 15 | 16 | - AnnData object of raw counts per single-cells (saved as `h5ad` file) 17 | - A table of cell type annotations per cell that we want to deconvolve (saved as `csv` file) 18 | 19 | **Step 1: Split single-cell dataset:** we split the cells in the single-cell dataset in a 'generation set', that will be used to simulate the ST spots, and a 'validation' set, that will be used to train the deconvolution models that we want to benchmark. From the command line: 20 | 21 | ``` 22 | python split_sc.py --annotation_col annotation_1 --out_dir 23 | ``` 24 | 25 | Output: generation and validation count matrices and cell type annotations are saved as `pickle` files, with a random seed identifying the split. 26 | 27 | **Step 2: Build design matrix**: in this step we define which cell types are (A) low/high density and (B) Uniformly present in all the spots or localized in few spots (regional). To generate synthetic spots with ~10 cells per spot (as seen with nuclear segmentation on Visium spots) we reccommend setting the mean number of cells per spot per cell type < 5. 28 | 29 | ``` 30 | n_spots=100 31 | seed=$(ls labels_generation* | sed 's/.*_//' | sed 's/.p//') 32 | python ST_simulation/assemble_design.py \ 33 | $seed \ 34 | --tot_spots $n_spots --mean_high 3 --mean_low 1 \ 35 | --out_dir 36 | ``` 37 | 38 | Output: `synthetic_ST_seed${seed}_${assemble_id}_design.csv` contains the design used for the simulation: 39 | 40 | | **Column** | **Data** | 41 | |-------------|--------------------------------------------------------------------------------------------------| 42 | | uniform | is the cell type uniformly located across spots (1) or localized in a small subset of spots (0) | 43 | | density | is the cell type present in a spot at low density (1) or high density (0) | 44 | | nspots | total number of spots in which the cell type is located | 45 | | mean_ncells | mean number of cells per spot | 46 | 47 | **Step 3: Assemble cell type composition per spot:** based on the design matrix, we define the cell type composition of each spot i.e. how many cells per cell type 48 | are in each spot. An assemble ID is used to identify the assembly (we assemble many composition matrices with the same design). 49 | ``` 50 | id=1 51 | python cell2location/pycell2location/ST_simulation/assemble_composition.py \ 52 | $seed \ 53 | --tot_spots $n_spots --assemble_id $id 54 | ``` 55 | 56 | Output: `synthetic_ST_seed${seed}_${assemble_id}_composition.csv` contains the number of cells per cell type in each spot, for benchmarking deconvolution models. 57 | 58 | **(Step 4) Assemble simulated ST spots** 59 | ``` 60 | python assemble_st.py ${seed} --assemble_id $id 61 | ``` 62 | 63 | Output: 64 | 65 | - `synthetic_ST_seed${seed}_${assemble_id}_counts.csv` contains the count matrix for the simulated ST spots 66 | - `synthetic_ST_seed${seed}_${assemble_id}_umis.csv` contains the number of UMIs per cell type in each spot, for benchmarking deconvolution methods that model number of UMIs 67 | 68 | 69 | ### Speeding up the simulation 70 | 71 | The current implementation is not optimized for speed, it takes ~ 2 minutes to assemble 100 spots. At the moment my suggestion to simulate thousands of spots is to assemble the design matrix once (step 1 above), then run steps 2 and 3 many times using wrapper 72 | ``` 73 | run_simulation2.sh 74 | ``` 75 | then merge in one object 76 | ``` 77 | python merge_synthetic_ST.py . $seed 78 | ``` 79 | 80 | 81 | -------------------------------------------------------------------------------- /ST_simulation.py: -------------------------------------------------------------------------------- 1 | import random 2 | 3 | import numpy as np 4 | import pandas as pd 5 | import torch as t 6 | 7 | 8 | ## --- Version 2: cell densities --- ## 9 | 10 | def assemble_ct_composition(design_df, tot_spots, ncells_scale=5): 11 | """ 12 | Parameters 13 | ---------- 14 | design_df: pd.DataFrame containing number of spots (nspots) and mean n of 15 | cells per spot (mean_ncell) per cell type 16 | tot_spots: int 17 | total number of spots to simulate 18 | Return 19 | ------ 20 | pd.DataFrame of cell types x spots with no of cells 21 | """ 22 | spots_members = pd.DataFrame(columns=range(tot_spots), 23 | index=design_df.index) 24 | ## Cell types to spot 25 | for i in range(len(design_df.nspots)): 26 | l = ([0] * (tot_spots - int(design_df.nspots[i]))) + ([1] * int(design_df.nspots[i])) 27 | l = random.sample(l, k=tot_spots) 28 | spots_members.iloc[i] = pd.Series(l) 29 | 30 | ## No of cells per spot 31 | ncells = [ 32 | #np.round(np.random.gamma(design_df.loc[ct, ].mean_ncells, ncells_scale, 33 | # size=int(design_df.loc[ct, ].nspots))) 34 | np.random.poisson(design_df.loc[ct, ].mean_ncells, 35 | size=int(design_df.loc[ct, ].nspots)) 36 | for ct in design_df.index] 37 | for i in range(spots_members.shape[0]): 38 | spots_members.iloc[i, spots_members.columns[spots_members.iloc[i] == 1]] = ncells[i] 39 | return spots_members 40 | 41 | 42 | def assemble_spot_2(cnt, labels, members): 43 | """ 44 | Parameters 45 | ---------- 46 | design_df: pd.DataFrame containing number of spots (nspots) and mean n of 47 | cells per spot (mean_ncell) per cell type 48 | tot_spots: int 49 | total number of spots to simulate 50 | Return 51 | ------ 52 | pd.DataFrame of cell types x spots with no of cells 53 | """ 54 | uni_labels = members.index 55 | spot_expr = t.zeros(cnt.shape[1]).type(t.float32) 56 | nUMIs = t.zeros((len(uni_labels))).type(t.float32) 57 | for z in range(len(uni_labels)): 58 | if members[z] > 0: 59 | idx = np.where(labels == uni_labels[z])[0] 60 | # pick random cells from type 61 | np.random.shuffle(idx) 62 | idx = idx[0:int(members[z])] 63 | # add transcripts to spot expression 64 | z_expr = t.tensor((cnt.iloc[idx, :]).sum(axis=0).round().astype(np.float32)) 65 | spot_expr += z_expr 66 | nUMIs[z] = z_expr.sum() 67 | return (spot_expr, nUMIs) 68 | 69 | 70 | def assemble_st_2(cnt, labels, spots_members): 71 | tot_spots = spots_members.shape[1] 72 | st_cnt = np.zeros((tot_spots, cnt.shape[1])) 73 | st_umis = np.zeros((tot_spots, spots_members.shape[0])) 74 | for spot in range(tot_spots): 75 | print("making spot no." + str(spot) + "...", flush=True) 76 | spot_data = assemble_spot_2(cnt, labels, spots_members.iloc[:, spot]) 77 | st_cnt[spot, :] = spot_data[0] 78 | st_umis[spot, :] = spot_data[1] 79 | # convert to pandas DataFrames 80 | index = pd.Index(['Spotx' + str(x + 1) for \ 81 | x in range(tot_spots)]) 82 | st_cnt = pd.DataFrame(st_cnt, 83 | index=index, 84 | columns=cnt.columns, 85 | ) 86 | st_umis = pd.DataFrame(st_umis, 87 | index=index, 88 | columns=spots_members.index, 89 | ) 90 | return (st_cnt, st_umis) 91 | 92 | # ## --- Version 1: proportions --- ## 93 | 94 | # def pick_cell_types(uni_labels, alpha, min_n_cells): 95 | # ''' 96 | # Pick cell types to include in synthetic spots with proportions from 97 | # Dirichlet distribution. 98 | 99 | # Parameters 100 | # ---------- 101 | # uni_labels: np.array 102 | # unique labels 103 | # alpha: np.array 104 | # dirichlet distribution concentration value 105 | # (can be from cell type proportions in ST) 106 | 107 | # Return 108 | # ------ 109 | # tuple of picked cell types and proportions 110 | 111 | # ''' 112 | # # get number of different 113 | # # cell types present 114 | # n_labels = uni_labels.shape[0] 115 | 116 | # # sample number of types to be present at current spot 117 | # # w/o having more types than cells 118 | # n_types = dists.uniform.Uniform(low=1, 119 | # high=min([n_labels, min_n_cells])).sample() 120 | 121 | # n_types = n_types.round().type(t.int) 122 | 123 | # # select which types to include 124 | # pick_types = t.randperm(n_labels)[0:n_types] 125 | # alpha = t.Tensor(np.array(alpha[pick_types])) 126 | 127 | # # select cell type proportions 128 | # member_props = dists.Dirichlet(concentration=alpha * t.ones(n_types)).sample() 129 | # return ((pick_types, member_props)) 130 | 131 | 132 | # def assemble_spot(cnt, labels, n_cells, fraction, pick_types, member_props): 133 | # ''' 134 | # Generate one synthetic ST spot 135 | 136 | # Parameters: 137 | # ----------- 138 | # cnt: pd.DataFrame of single-cell count data --> [n_cells x n_genes] <-- 139 | # labels: pd.DataFrame of single-cell annotations [n_cells] 140 | # n_cells: int number of cells to include in spot 141 | # fraction: float or np.array 142 | # fraction of transcripts from each cell being 143 | # observed in ST-spot (gene budgets in model) 144 | # pick_types: torch.Tensor of cell types to include in spot (output of pick_cell_types) 145 | # member_props: torch.Tensor of the proportions of different cell types in spots (output of pick_cell_types) 146 | 147 | # Returns: 148 | # -------- 149 | # Dictionary with expression data, 150 | # proportion values and number of 151 | # cells from each type at every 152 | # spot 153 | # ''' 154 | # # get unique labels found in single cell data 155 | # uni_labels, uni_counts = np.unique(labels, 156 | # return_counts=True) 157 | # n_labels = uni_labels.shape[0] 158 | 159 | # assert np.all(uni_counts >= 30), "Insufficient number of cells" 160 | 161 | # # get no. of members of spot for each cell type 162 | # members = t.zeros(n_labels).type(t.float) 163 | # members[pick_types] = (n_cells * member_props).round() 164 | # # get final proportion of each type 165 | # props = members / members.sum() 166 | # # convert members to integers 167 | # members = members.type(t.int) 168 | # # generate spot expression data 169 | # spot_expr = t.zeros(cnt.shape[1]).type(t.float32) 170 | # nUMIs = t.zeros((len(uni_labels))).type(t.float32) 171 | # for z in range(len(uni_labels)): 172 | # if members[z] > 0: 173 | # idx = np.where(labels == uni_labels[z])[0] 174 | # # pick random cells from type 175 | # np.random.shuffle(idx) 176 | # idx = idx[0:members[z]] 177 | # # add fraction of transcripts to spot expression 178 | # z_expr = t.tensor((cnt.iloc[idx, :] * fraction).sum(axis=0).round().astype(np.float32)) 179 | # nUMIs[z] = z_expr.sum() 180 | # spot_expr += z_expr 181 | # return {'expr': spot_expr, 182 | # 'proportions': props, 183 | # 'members': members, 184 | # 'umis': nUMIs 185 | # } 186 | 187 | 188 | # def assemble_region(cnt, labels, n_cells_vec, alpha, fraction): 189 | # ''' 190 | # Assemble ST-spots from a single synthetic region 191 | # i.e. with the same proportions of cell types in each spot 192 | 193 | # Parameters 194 | # ---------- 195 | # n_cell_vec: vector of number of cells to mix for each synthetic spot 196 | # alpha: np.array 197 | # dirichlet distribution concentration value 198 | # (can be from cell type proportions in ST) 199 | # fraction: float or np.array 200 | # fraction of transcripts from each cell being 201 | # observed in ST-spot (gene budgets in model) 202 | # ''' 203 | 204 | # n_spots = len(n_cells_vec) 205 | 206 | # # get unique labels 207 | # uni_labels = np.unique(labels.values) 208 | # n_labels = uni_labels.shape[0] 209 | 210 | # # prepare matrices 211 | # st_cnt = np.zeros((n_spots, cnt.shape[1])) 212 | # st_prop = np.zeros((n_spots, n_labels)) 213 | # st_memb = np.zeros((n_spots, n_labels)) 214 | # st_umis = np.zeros((n_spots, n_labels)) 215 | 216 | # # generate one spot at a time 217 | # pick_types, member_props = pick_cell_types(uni_labels, alpha, min(n_cells_vec)) 218 | # # np.random.seed(1337) 219 | # # t.manual_seed(1337) 220 | # for spot in range(n_spots): 221 | # spot_data = assemble_spot(cnt, 222 | # labels, 223 | # n_cells_vec[spot], fraction, pick_types, member_props 224 | # ) 225 | 226 | # st_cnt[spot, :] = spot_data['expr'] 227 | # st_prop[spot, :] = spot_data['proportions'] 228 | # st_memb[spot, :] = spot_data['members'] 229 | # st_umis[spot, :] = spot_data['umis'] 230 | 231 | # index = pd.Index(['Spotx' + str(x + 1) for \ 232 | # x in range(n_spots)]) 233 | 234 | # # convert to pandas DataFrames 235 | # st_cnt = pd.DataFrame(st_cnt, 236 | # index=index, 237 | # columns=cnt.columns, 238 | # ) 239 | # st_prop = pd.DataFrame(st_prop, 240 | # index=index, 241 | # columns=uni_labels, 242 | # ) 243 | # st_memb = pd.DataFrame(st_memb, 244 | # index=index, 245 | # columns=uni_labels, 246 | # ) 247 | # st_umis = pd.DataFrame(st_umis, 248 | # index=index, 249 | # columns=uni_labels, 250 | # ) 251 | # return {'counts': st_cnt, 252 | # 'proportions': st_prop, 253 | # 'members': st_memb, 254 | # 'umis': st_umis} 255 | 256 | 257 | # def assemble_st(cnt, labels, n_regions, n_cells_tot, alpha, fraction): 258 | # ''' 259 | # Assemble synthetic ST data from count matrix and predicted 260 | # cell type labels for each single-cell. Regions are modelled as groups of spots with 261 | # the same proportion of cell types (and roughly the same number of cells per spot). 262 | 263 | # Parameters 264 | # ---------- 265 | # n_spots: int 266 | # number of spots to simulate 267 | # n_regions: int 268 | # number of regions in which spots should be divided 269 | # alpha: np.array 270 | # dirichlet distribution concentration value 271 | # (can be from cell type proportions in ST) 272 | # fraction: float or np.array 273 | # fraction of transcripts from each cell being 274 | # observed in ST-spot (gene budgets in model) 275 | 276 | # (if you don't want zonation you can just make as many regions as spots) 277 | # ''' 278 | # # count total number of spots 279 | # tot_spots = len(n_cells_tot) 280 | 281 | # # get unique labels 282 | # uni_labels = np.unique(labels.values) 283 | # n_labels = uni_labels.shape[0] 284 | 285 | # # assign spots to regions 286 | # # avoding to have regions with no spots 287 | # if n_regions != tot_spots: 288 | # region_labels = [] 289 | # while len(np.unique(region_labels)) != n_regions: 290 | # region_labels = np.array(random.choices(range(n_regions), k=tot_spots)) 291 | # else: 292 | # region_labels = np.array(range(n_regions)) 293 | 294 | # # prepare matrices 295 | # st_cnt = np.zeros((tot_spots, cnt.shape[1])) 296 | # st_prop = np.zeros((tot_spots, n_labels)) 297 | # st_memb = np.zeros((tot_spots, n_labels)) 298 | # st_umis = np.zeros((tot_spots, n_labels)) 299 | # idx = 0 300 | 301 | # # sort number of cells to have ~ same number of cells per spot for each region 302 | # n_cells_tot.sort() 303 | 304 | # # assemble one region at a time 305 | # for reg in range(n_regions): 306 | # print("making reg" + str(reg) + "...", flush=True) 307 | # n_spots_reg = len(region_labels[region_labels == reg]) 308 | # n_cells_vec = n_cells_tot[idx:idx + n_spots_reg] 309 | # reg_data = assemble_region(cnt, labels, n_cells_vec, alpha, fraction) 310 | 311 | # st_cnt[idx:idx + n_spots_reg, :] = reg_data['counts'] 312 | # st_prop[idx:idx + n_spots_reg, :] = reg_data['proportions'] 313 | # st_memb[idx:idx + n_spots_reg, :] = reg_data['members'] 314 | # st_umis[idx:idx + n_spots_reg, :] = reg_data['umis'] 315 | # idx = idx + n_spots_reg 316 | 317 | # index = pd.Index(['Spotx' + str(x + 1) for \ 318 | # x in range(tot_spots)]) 319 | # # convert to pandas DataFrames 320 | # st_cnt = pd.DataFrame(st_cnt, 321 | # index=index, 322 | # columns=cnt.columns, 323 | # ) 324 | 325 | # st_prop = pd.DataFrame(st_prop, 326 | # index=index, 327 | # columns=uni_labels, 328 | # ) 329 | # st_memb = pd.DataFrame(st_memb, 330 | # index=index, 331 | # columns=uni_labels, 332 | # ) 333 | # st_umis = pd.DataFrame(st_umis, 334 | # index=index, 335 | # columns=uni_labels, 336 | # ) 337 | # return {'counts': st_cnt, 338 | # 'proportions': st_prop, 339 | # 'members': st_memb, 340 | # 'umis': st_umis, 341 | # 'regions': region_labels} 342 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- 1 | import .ST_simulation 2 | 3 | __all__ = [ 4 | "ST_simulation", 5 | ] 6 | -------------------------------------------------------------------------------- /assemble_composition.py: -------------------------------------------------------------------------------- 1 | ### Make ST datasets from single-cell data 2 | import argparse 3 | import pickle 4 | 5 | from ST_simulation import * 6 | 7 | parser = argparse.ArgumentParser() 8 | # parser.add_argument('lbl_gen_file', type=str, 9 | # help='path to label generation pickle file') 10 | # parser.add_argument('cnt_gen_file', type=str, 11 | # help='path to label generation pickle file') 12 | # parser.add_argument('design_csv', type=str, 13 | # help='path to design csv file') 14 | parser.add_argument('seed', type=int, 15 | help='random seed of split') 16 | parser.add_argument('--tot_spots', dest='tot_spots', type=int, 17 | default=1000, 18 | help='Total number of spots to simulate') 19 | parser.add_argument('--out_dir', dest='out_dir', type=str, 20 | default='/nfs/team283/ed6/simulation/lowdens_synthetic_ST_fewcells/', 21 | help='Output directory') 22 | parser.add_argument('--assemble_id', dest='assemble_id', type=int, 23 | default=1, 24 | help='ID of ST assembly') 25 | parser.add_argument('--annotation_col', dest='anno_col', type=str, 26 | default="annotation_1", 27 | help='Name of column to use in annotation file (default: annotation_1)') 28 | 29 | args = parser.parse_args() 30 | 31 | # lbl_gen_file = args.lbl_gen_file 32 | # count_gen_file = args.cnt_gen_file 33 | # design_file = args.design_csv 34 | seed = args.seed 35 | tot_spots = args.tot_spots 36 | out_dir = args.out_dir 37 | assemble_id = args.assemble_id 38 | anno_col = args.anno_col 39 | 40 | ### Load input data ### 41 | lbl_gen_file = out_dir + "labels_generation_" + str(seed) + ".p" 42 | count_gen_file = out_dir + "counts_generation_" + str(seed) + ".p" 43 | design_file = out_dir + "synthetic_ST_seed" + str(seed) + "_design.csv" 44 | 45 | lbl_generation = pickle.load(open(lbl_gen_file, "rb")) 46 | cnt_generation = pickle.load(open(count_gen_file, "rb")) 47 | 48 | uni_labels = lbl_generation[anno_col].unique() 49 | labels = lbl_generation 50 | cnt = cnt_generation 51 | 52 | design_df = pd.read_csv(design_file, index_col=0) 53 | design_df = abs(design_df) ## For some reason some mean_ncells are set to -0.0 54 | 55 | # ### GENERATE GENE-SPECIFIC SCALING FACTOR ### 56 | 57 | # gene_level_alpha = np.random.gamma(5, 5) 58 | # gene_level_beta = np.random.gamma(1, 5) 59 | # gene_level = np.random.gamma(gene_level_alpha, gene_level_beta, size=cnt.shape[1]) 60 | 61 | # # scale from 0 to 1 (to coincide to fractions) 62 | # gene_level_scaled = (gene_level - min(gene_level)) / (max(gene_level) - min(gene_level)) 63 | 64 | ### Assemble cell type composition 65 | spots_members = assemble_ct_composition(design_df, tot_spots, ncells_scale=1) 66 | 67 | # st_cnt_df = assemble_st_2(cnt, labels, spots_members, gene_level_scaled) 68 | 69 | ### SAVE OUTPUTS ### 70 | 71 | synthetic_st = {"composition": spots_members} 72 | 73 | for k, v in synthetic_st.items(): 74 | out_name = out_dir + "synthetic_ST_seed" + str(seed) + "_" + str( 75 | assemble_id) + "_" + k + ".csv" 76 | v.to_csv(out_name, sep=",", index=True, header=True) 77 | -------------------------------------------------------------------------------- /assemble_design.py: -------------------------------------------------------------------------------- 1 | ### Make design of simulated ST datasets from single-cell data 2 | import argparse 3 | import pickle 4 | import numpy as np 5 | import pandas as pd 6 | 7 | parser = argparse.ArgumentParser() 8 | # parser.add_argument('lbl_gen_file', type=str, 9 | # help='path to label generation pickle file') 10 | # parser.add_argument('cnt_gen_file', type=str, 11 | # help='path to label generation pickle file') 12 | parser.add_argument('seed', type=int, 13 | help='random seed of split') 14 | parser.add_argument('--tot_spots', dest='tot_spots', type=int, 15 | default=1000, 16 | help='Total number of spots to simulate') 17 | parser.add_argument('--mean_high', dest='mean_high', type=float, 18 | default=2.5, 19 | help='Mean cell density for high-density cell types') 20 | parser.add_argument('--mean_low', dest='mean_low', type=float, 21 | default=0.8, 22 | help='Mean cell density for low-density cell types') 23 | parser.add_argument('--percent_uniform', dest='percent_uniform', type=float, 24 | default=80, 25 | help='Sparsity of uniform cell types (% non-zero spots of total spots)') 26 | parser.add_argument('--percent_sparse', dest='percent_sparse', type=float, 27 | default=10, 28 | help='Sparsity of sparse cell types (% non-zero spots of total spots)') 29 | parser.add_argument('--annotation_col', dest='anno_col', type=str, 30 | default="annotation_1", 31 | help='Name of column to use in annotation file (default: annotation_1)') 32 | parser.add_argument('--out_dir', dest='out_dir', type=str, 33 | default='/nfs/team283/ed6/simulation/lowdens_synthetic_ST_fewcells/', 34 | help='Output directory') 35 | parser.add_argument('--assemble_id', dest='assemble_id', type=int, 36 | default=1, 37 | help='ID of ST assembly') 38 | 39 | args = parser.parse_args() 40 | 41 | # lbl_gen_file = args.lbl_gen_file 42 | # count_gen_file = args.cnt_gen_file 43 | seed = args.seed 44 | tot_spots = args.tot_spots 45 | mean_high = args.mean_high 46 | mean_low = args.mean_low 47 | percent_uniform = args.percent_uniform 48 | percent_sparse = args.percent_sparse 49 | out_dir = args.out_dir 50 | assemble_id = args.assemble_id 51 | anno_col = args.anno_col 52 | 53 | ### Load input data ### 54 | lbl_gen_file = out_dir + "labels_generation_" + str(seed) + ".p" 55 | count_gen_file = out_dir + "counts_generation_" + str(seed) + ".p" 56 | 57 | lbl_generation = pickle.load(open(lbl_gen_file, "rb")) 58 | cnt_generation = pickle.load(open(count_gen_file, "rb")) 59 | 60 | uni_labels = lbl_generation[anno_col].unique() 61 | labels = lbl_generation 62 | cnt = cnt_generation 63 | 64 | ### Define uniform VS sparse cell types (w more sparse = 0) 65 | uniform_ct = np.random.choice([0, 1], size=len(uni_labels), p=[0.8, 0.2]) 66 | 67 | #### Define low VS high density cell types (w more low density = 1) 68 | uni_low = np.random.choice([0, 1], size=len(uni_labels[uniform_ct == 1]), p=[0.2, 0.8]) 69 | reg_low = np.random.choice([0, 1], size=len(uni_labels[uniform_ct == 0]), p=[0.2, 0.8]) 70 | 71 | design_df = pd.DataFrame({'uniform': uniform_ct}, index=uni_labels) 72 | 73 | design_df['density'] = np.nan 74 | design_df.loc[design_df.index[design_df.uniform == 1], 'density'] = uni_low 75 | design_df.loc[design_df.index[design_df.uniform == 0], 'density'] = reg_low 76 | 77 | ### Generate no of spots per cell type 78 | # Uniform ~ 60% of spots, sparse ~ 5% of spots 79 | mean_unif = round((tot_spots / 100) * percent_uniform) 80 | mean_sparse = round((tot_spots / 100) * percent_sparse) 81 | sigma_unif = np.sqrt(mean_unif / 0.3) 82 | sigma_sparse = np.sqrt(mean_sparse / 0.3) 83 | 84 | shape_unif = mean_unif ** 2 / sigma_unif ** 2 85 | scale_unif = sigma_unif ** 2 / mean_unif 86 | shape_sparse = mean_sparse ** 2 / sigma_sparse ** 2 87 | scale_sparse = sigma_sparse ** 2 / mean_sparse 88 | 89 | unif_nspots = np.round(np.random.gamma(shape=shape_unif, scale=scale_unif, size=sum(design_df.uniform == 1))) 90 | sparse_nspots = np.round(np.random.gamma(shape=shape_sparse, scale=scale_sparse, size=sum(design_df.uniform == 0))) 91 | # if samples n spots is greater than total number of spots trim to the total 92 | if (unif_nspots > tot_spots).sum() >= 1: 93 | unif_nspots[unif_nspots > tot_spots] = tot_spots 94 | if (sparse_nspots > tot_spots).sum() >= 1: 95 | sparse_nspots[sparse_nspots > tot_spots] = tot_spots 96 | 97 | 98 | design_df['nspots'] = np.nan 99 | design_df.loc[design_df.index[design_df.uniform == 1], 'nspots'] = unif_nspots 100 | design_df.loc[design_df.index[design_df.uniform == 0], 'nspots'] = sparse_nspots 101 | 102 | ### Generate avg density per spot per cell type 103 | sigma_low = np.sqrt(mean_low / 2) 104 | sigma_high = np.sqrt(mean_high / 2) 105 | 106 | shape_low = mean_low ** 2 / sigma_low ** 2 107 | scale_low = sigma_low ** 2 / mean_low 108 | shape_high = mean_high ** 2 / sigma_high ** 2 109 | scale_high = sigma_high ** 2 / mean_high 110 | 111 | low_ncells_mean = np.random.gamma(shape=shape_low, scale=scale_low, size=sum(design_df.density == 1)) 112 | high_ncells_mean = np.random.gamma(shape=shape_high, scale=scale_high, size=sum(design_df.density == 0)) 113 | 114 | design_df['mean_ncells'] = np.nan 115 | design_df.loc[design_df.index[design_df.density == 1], 'mean_ncells'] = low_ncells_mean 116 | design_df.loc[design_df.index[design_df.density == 0], 'mean_ncells'] = high_ncells_mean 117 | 118 | out_name = out_dir + "synthetic_ST_seed" + lbl_gen_file.split("_")[-1].rstrip(".p") + "_" + "design" + ".csv" 119 | design_df.to_csv(out_name, sep=",", index=True, header=True) 120 | -------------------------------------------------------------------------------- /assemble_st.py: -------------------------------------------------------------------------------- 1 | ### Make ST datasets from single-cell data 2 | import argparse 3 | import pickle 4 | from ST_simulation import * 5 | 6 | parser = argparse.ArgumentParser() 7 | parser.add_argument('seed', type=int, 8 | help='random seed of split') 9 | parser.add_argument('--out_dir', dest='out_dir', type=str, 10 | default='/nfs/team283/ed6/simulation/lowdens_synthetic_ST_fewcells/', 11 | help='Output directory') 12 | parser.add_argument('--assemble_id', dest='assemble_id', type=int, 13 | default=1, 14 | help='ID of ST assembly') 15 | parser.add_argument('--annotation_col', dest='anno_col', type=str, 16 | default="annotation_1", 17 | help='Name of column to use in annotation file (default: annotation_1)') 18 | 19 | args = parser.parse_args() 20 | 21 | # lbl_gen_file = args.lbl_gen_file 22 | # count_gen_file = args.cnt_gen_file 23 | # spots_members_file = args.spots_comp 24 | seed = args.seed 25 | out_dir = args.out_dir 26 | assemble_id = args.assemble_id 27 | # usecols = args.usecols 28 | anno_col = args.anno_col 29 | 30 | ### Load input data ### 31 | lbl_gen_file = out_dir + "labels_generation_" + str(seed) + ".p" 32 | count_gen_file = out_dir + "counts_generation_" + str(seed) + ".p" 33 | spots_members_file = out_dir + "synthetic_ST_seed" + str(seed) + "_" + str(assemble_id) + "_composition.csv" 34 | 35 | lbl_generation = pickle.load(open(lbl_gen_file, "rb")) 36 | cnt_generation = pickle.load(open(count_gen_file, "rb")) 37 | spots_members = pd.read_csv(spots_members_file, index_col=0) 38 | 39 | tot_spots = spots_members.shape[1] 40 | uni_labels = lbl_generation[anno_col].unique() 41 | labels = lbl_generation 42 | cnt = cnt_generation 43 | 44 | # ### GENERATE GENE-SPECIFIC SCALING FACTOR ### 45 | 46 | # gene_level_alpha = np.random.gamma(5, 5) 47 | # gene_level_beta = np.random.gamma(1, 5) 48 | # gene_level = np.random.gamma(gene_level_alpha, gene_level_beta, size=cnt.shape[1]) 49 | 50 | # # scale from 0 to 1 (to coincide to fractions) 51 | # gene_level_scaled = (gene_level - min(gene_level)) / (max(gene_level) - min(gene_level)) 52 | 53 | 54 | st_cnt_df,st_umis_df = assemble_st_2(cnt, labels, spots_members) 55 | 56 | ### SAVE OUTPUTS ### 57 | 58 | synthetic_st = {"counts": st_cnt_df, "umis":st_umis_df} 59 | # out_name = spots_members_file.rstrip(".csv") + "_counts.csv" 60 | # st_cnt_df.to_csv(out_name, sep=",", index=True, header=True) 61 | 62 | for k, v in synthetic_st.items(): 63 | out_name = out_dir + "synthetic_ST_seed" + str(seed) + "_" + str( 64 | assemble_id) + "_" + k + ".csv" 65 | v.to_csv(out_name, sep=",", index=True, header=True) 66 | 67 | -------------------------------------------------------------------------------- /merge_synthetic_ST.py: -------------------------------------------------------------------------------- 1 | ### Merge synthetic ST data ### 2 | import argparse 3 | import os 4 | 5 | import pandas as pd 6 | 7 | parser = argparse.ArgumentParser() 8 | parser.add_argument('outdir', type=str, 9 | help='path to synthetic ST directory') 10 | parser.add_argument('seed', type=int, 11 | help='random seed for generation') 12 | args = parser.parse_args() 13 | 14 | outdir = args.outdir 15 | seed = args.seed 16 | 17 | 18 | def read_synthetic_ST(outdir, seed, id): 19 | cnt_file = 'synthetic_ST_seed{0}_{1}_composition_counts.csv'.format(seed, id) 20 | comp_file = cnt_file.split("_counts")[0] + ".csv" 21 | cnt_df = pd.read_csv(os.path.join(outdir, cnt_file), index_col=0) 22 | comp_df = pd.read_csv(os.path.join(outdir, comp_file), index_col=0) 23 | return (cnt_df, comp_df) 24 | 25 | 26 | design_file = [x for x in os.listdir(outdir) if 'design.csv' in x and str(seed) in x][0] 27 | seed_files = [x for x in os.listdir(outdir) if str(seed) in x and x.endswith("counts.csv")] 28 | ids = [x.split("_")[3] for x in seed_files] 29 | 30 | cnt_df, comp_df = read_synthetic_ST(outdir, seed, ids[0]) 31 | for id in ids[1:]: 32 | cnt_df1, comp_df1 = read_synthetic_ST(outdir, seed, id) 33 | if cnt_df1.shape[1] == cnt_df.shape[1]: 34 | cnt_df = pd.concat([cnt_df, cnt_df1]) 35 | if comp_df.shape[0] == comp_df.shape[0]: 36 | comp_df = pd.concat([comp_df, comp_df1], 1) 37 | 38 | cnt_df.reset_index(drop=True, inplace=True) 39 | cnt_df.index = ["Spotx" + str(x) for x in cnt_df.index] 40 | comp_df.columns = ["Spotx" + str(x) for x in range(comp_df.shape[1])] 41 | 42 | cnt_df.to_csv(os.path.join(outdir, design_file.split("design")[0] + "counts.csv")) 43 | comp_df.to_csv(os.path.join(outdir, design_file.split("design")[0] + "composition.csv")) 44 | -------------------------------------------------------------------------------- /run_simulation2.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | seed=$1 4 | n_spots=$2 5 | id=$3 6 | 7 | python assemble_composition_2.py ${seed} --tot_spots $n_spots --assemble_id $id 8 | python assemble_st_2.py ${seed} --assemble_id $id 9 | -------------------------------------------------------------------------------- /split_sc.py: -------------------------------------------------------------------------------- 1 | ### SPLIT SINGLE-CELL DATASET IN GENERATION AND VALIDATION SET ### 2 | 3 | import argparse 4 | import pickle 5 | import random 6 | import anndata 7 | import scanpy as sc 8 | import numpy as np 9 | import pandas as pd 10 | 11 | parser = argparse.ArgumentParser() 12 | parser.add_argument('h5ad_file', type=str, 13 | help='path to h5ad file with raw single-cell data') 14 | parser.add_argument('annotation_file', type=str, 15 | help='path to csv file with cell annotations') 16 | parser.add_argument('--annotation_col', dest='anno_col', type=str, 17 | default="annotation_1", 18 | help='Name of column to use in annotation file (default: annotation_1)') 19 | parser.add_argument('--out_dir', dest='out_dir', type=str, 20 | default="/nfs/team283/ed6/simulation/lowdens_synthetic_ST_fewcells/", 21 | help='Output directory') 22 | args = parser.parse_args() 23 | 24 | # adata_file = "/nfs/team205/vk7/sanger_projects/cell2location/notebooks/data/mouse_viseum_snrna/rawdata/all_cells_20200625.h5ad" 25 | # annotation_file = "/nfs/team205/vk7/sanger_projects/cell2location/notebooks/results/mouse_viseum_snrna/snRNA_annotation_20200229.csv" 26 | # anno_col = "annotation_1" 27 | # out_dir = "/nfs/team283/ed6/simulation/lowdens_synthetic_ST_fewcells/" 28 | 29 | adata_file = args.h5ad_file 30 | annotation_file = args.annotation_file 31 | anno_col = args.anno_col 32 | out_dir = args.out_dir 33 | 34 | ### Load input single-cell data and annotations ### 35 | 36 | adata_raw = sc.read_h5ad(adata_file) 37 | 38 | ## Cell type annotations 39 | labels = pd.read_csv(annotation_file, index_col=0) 40 | 41 | # # Select genes used for clustering 42 | # adata_raw = adata_raw[:, [x for x in adata_raw.var_names if x in adata_snrna.var_names]] 43 | 44 | adata_df = pd.DataFrame(adata_raw.X.T.toarray(), columns=adata_raw.obs_names, index=adata_raw.var_names) 45 | adata_df = adata_df.T 46 | adata_df.index.name = "cell" 47 | 48 | ### Subset to cells with label ### 49 | adata_df = adata_df.loc[labels.index, :] 50 | 51 | ### Split generation and validation set ### 52 | 53 | sc_cnt = adata_df 54 | sc_lbl = pd.DataFrame(labels[anno_col]) 55 | 56 | # match count and label data 57 | inter = sc_cnt.index.intersection(sc_lbl.index) 58 | 59 | sc_lbl = sc_lbl.loc[inter, :] 60 | sc_cnt = sc_cnt.loc[inter, :] 61 | 62 | labels = sc_lbl.iloc[:, 0].values 63 | 64 | # get unique labels 65 | uni_labs, uni_counts = np.unique(labels, return_counts=True) 66 | 67 | # only keep types with more than 50 cells 68 | keep_types = uni_counts > 40 69 | keep_cells = np.isin(labels, uni_labs[keep_types]) 70 | 71 | labels = labels[keep_cells] 72 | sc_cnt = sc_cnt.iloc[keep_cells, :] 73 | sc_lbl = sc_lbl.iloc[keep_cells, :] 74 | 75 | uni_labs, uni_counts = np.unique(labels, return_counts=True) 76 | n_types = uni_labs.shape[0] 77 | 78 | seeds = random.sample(range(1000), 3) 79 | 80 | for seed in seeds: 81 | random.seed(seed) 82 | print("Seed " + str(seed)) 83 | # get member indices for each set 84 | idx_generation = [] 85 | idx_validation = [] 86 | for z in range(n_types): 87 | tmp_idx = np.where(labels == uni_labs[z])[0] 88 | n_generation = int(round(tmp_idx.shape[0] / 2)) 89 | smp_gen = random.sample(list(tmp_idx), k=n_generation) 90 | smp_val = tmp_idx[np.isin(tmp_idx, smp_gen, invert=True)] 91 | idx_generation += smp_gen 92 | idx_validation += smp_val.tolist() 93 | idx_generation.sort() 94 | idx_validation.sort() 95 | # make sure no members overlap between sets 96 | assert len(set(idx_generation).intersection(set(idx_validation))) == 0, \ 97 | "validation and generation set are not orthogonal" 98 | # assemble sets from indices 99 | cnt_validation = sc_cnt.iloc[idx_validation, :] 100 | cnt_generation = sc_cnt.iloc[idx_generation, :] 101 | lbl_validation = sc_lbl.iloc[idx_validation, :] 102 | lbl_generation = sc_lbl.iloc[idx_generation, :] 103 | pickle.dump(lbl_generation, 104 | open(out_dir + "labels_generation_" + str(seed) + ".p", "wb")) 105 | pickle.dump(cnt_generation, 106 | open(out_dir + "counts_generation_" + str(seed) + ".p", "wb")) 107 | pickle.dump(lbl_validation, 108 | open(out_dir + "labels_validation_" + str(seed) + ".p", "wb")) 109 | pickle.dump(cnt_validation, 110 | open(out_dir + "counts_validation_" + str(seed) + ".p", "wb")) 111 | --------------------------------------------------------------------------------