├── .github
    ├── .gitignore
    └── workflows
    │   └── R-CMD-check.yaml
├── .gitignore
├── DESCRIPTION
├── NAMESPACE
├── NEWS.md
├── R
    ├── balanced_dataset.R
    ├── building_networks.R
    ├── neighborhood_analysis.R
    ├── post_RCTD.R
    ├── purification.R
    ├── split_dataset.R
    ├── utils.R
    └── visualization.R
├── README.md
├── doc
    ├── Run_RCTD_and_SPLIT_on_Xenium.R
    ├── Run_RCTD_and_SPLIT_on_Xenium.Rmd
    └── Run_RCTD_and_SPLIT_on_Xenium.html
├── inst
    └── CITATION
└── vignettes
    ├── Run_RCTD_and_SPLIT_on_Xenium.Rmd
    └── plots
        └── SPLIT_schema.png


/.github/.gitignore:
--------------------------------------------------------------------------------
1 | *.html
2 | 


--------------------------------------------------------------------------------
/.github/workflows/R-CMD-check.yaml:
--------------------------------------------------------------------------------
 1 | # Workflow derived from https://github.com/r-lib/actions/tree/v2/examples
 2 | # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help
 3 | on:
 4 |   push:
 5 |     branches: [main, master]
 6 |     paths-ignore:
 7 |       - 'README.md'
 8 |       - 'README.Rmd'
 9 |       - 'docs/**'
10 |       - '*.md'
11 |       - '*.Rproj'
12 |       - '.gitignore'
13 |       - '.github/ISSUE_TEMPLATE/**'
14 |   pull_request:
15 | 
16 | name: R-CMD-check.yaml
17 | 
18 | permissions: read-all
19 | 
20 | jobs:
21 |   R-CMD-check:
22 |     runs-on: ubuntu-latest
23 |     env:
24 |       GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
25 |       R_KEEP_PKG_SOURCE: yes
26 |     steps:
27 |       - uses: actions/checkout@v4
28 | 
29 |       - uses: r-lib/actions/setup-r@v2
30 |         with:
31 |           use-public-rspm: true
32 | 
33 |       - uses: r-lib/actions/setup-r-dependencies@v2
34 |         with:
35 |           extra-packages: any::rcmdcheck
36 |           needs: check
37 | 
38 |       - uses: r-lib/actions/check-r-package@v2
39 |         with:
40 |           upload-snapshots: true
41 |           error-on: '"error"'
42 |           args: 'c("--no-build-vignettes", "--no-manual", "--no-stop-on-warning")'
43 |           build_args: 'c("--no-build-vignettes", "--no-manual","--compact-vignettes=gs+qpdf")'
44 | 
45 | 
46 | 
47 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # R-specific files
 2 | .Rhistory
 3 | .Rapp.history
 4 | .RData
 5 | .Ruserdata
 6 | *.Rproj.user
 7 | 
 8 | # RStudio files
 9 | .Rproj
10 | *.Rproj.user/
11 | 
12 | # Compiled files
13 | *.o
14 | *.a
15 | *.dll
16 | *.so
17 | *.dylib
18 | 
19 | # Package binary files
20 | *.tar.gz
21 | *.zip
22 | 
23 | # R package build artifacts
24 | /inst/doc/
25 | /inst/examples/
26 | /inst/test/
27 | /man/*.Rd
28 | 
29 | # Temporary files
30 | *.log
31 | *.tmp
32 | *.bak
33 | *.swp
34 | 
35 | # .Rbuildignore file should also be included
36 | /.Rbuildignore
37 | 
38 | # Other artifacts generated by devtools and roxygen
39 | .Rbuildignore
40 | dev/
41 | 
42 | # Test files
43 | /tests/testthat/*.Rout
44 | 
45 | # RStudio project files
46 | *.Rproj
47 | *.Rproj.user/
48 | 
49 | # VSCode settings
50 | .vscode/
51 | 
52 | # Other directories and files
53 | .DS_Store
54 | Thumbs.db
55 | .Rproj.user
56 | #/doc/
57 | /Meta/
58 | 


--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
 1 | Package: SPLIT
 2 | Type: Package
 3 | Title: Spatial Purification of Layered Intracellular Transcripts
 4 | Version: 0.1.0
 5 | Authors@R: c(
 6 |     person(
 7 |       "Mariia", "Bilous",
 8 |       email = "Mariia.Bilous@chuv.ch",
 9 |       role = c("aut", "cre")
10 |     )
11 |   )
12 | Description: 
13 |   The SPLIT (Spatial Purification of Layered Intracellular Transcripts) package provides 
14 |   tools to purify spatial transcriptomics data using decomposition results from the RCTD 
15 |   (Robust Cell Type Decomposition) algorithm. It offers functionality to adjust cell-type 
16 |   profiles by removing contributions attributed to background contamination. SPLIT also 
17 |   computes contamination scores and estimates local diffusion of contaminating profiles 
18 |   to help distinguish between contamination and phenotypic variation. Additionally, 
19 |   the package includes methods to refine cell-type annotations based on transcriptomic 
20 |   neighborhood homogeneity. The resulting output contains profiles with reduced 
21 |   contamination and increased cell-type specificity.
22 | License: MIT
23 | Encoding: UTF-8
24 | LazyData: true
25 | Depends: R (>= 4.0.0)
26 | Imports:
27 |     Matrix,
28 |     dplyr,
29 |     forcats,
30 |     entropy,
31 |     spacexr,
32 |     ggplot2,
33 |     scatterpie,
34 |     Seurat,
35 |     BiocParallel,
36 |     igraph
37 | Suggests:
38 |     testthat,
39 |     knitr,
40 |     rmarkdown
41 | VignetteBuilder: knitr
42 | RoxygenNote: 7.3.2
43 | 


--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
 1 | # Generated by roxygen2: do not edit by hand
 2 | 
 3 | export(add_annotation_from_neighbors)
 4 | export(add_cell_types_neighborhood_weights)
 5 | export(add_individual_w1_of_second_type_in_neighborhood)
 6 | export(add_individual_w2_of_second_type_in_neighborhood)
 7 | export(add_infiltration_metrics_to_neighborhood)
 8 | export(add_max_weight_of_spilling_type_in_neighborhood)
 9 | export(add_neigborhood_nCount_of_spilling_cell_type)
10 | export(add_neigborhood_weight_composition_of_spilling_cell_type)
11 | export(add_neighborhood_annotation_certainty)
12 | export(add_neighborhood_weight_composition)
13 | export(add_rctd_to_neighborhood)
14 | export(add_spatial_metric)
15 | export(add_transcriptomics_metric)
16 | export(balance_raw_and_purified_data_by_score)
17 | export(balance_raw_and_purified_data_by_spot_class)
18 | export(balance_split)
19 | export(build_spatial_network)
20 | export(build_transcriptomics_network)
21 | export(compute_neighborhood)
22 | export(crop_pie_df_to_cell)
23 | export(decompose_doublet)
24 | export(get_pieplot_df)
25 | export(get_singlet_data)
26 | export(neighborhood_analysis_to_metadata)
27 | export(plot_pie)
28 | export(plot_pie_around_cell)
29 | export(plot_pie_by_coordinates)
30 | export(purify)
31 | export(purify_counts_with_rctd)
32 | export(run_post_process_RCTD)
33 | export(save_pieplot)
34 | export(split)
35 | import(BiocParallel)
36 | import(Seurat)
37 | import(dplyr)
38 | import(spacexr)
39 | importFrom(dplyr,"%>%")
40 | importFrom(dplyr,filter)
41 | importFrom(dplyr,select)
42 | importFrom(entropy,entropy)
43 | importFrom(ggplot2,aes)
44 | importFrom(ggplot2,annotate)
45 | importFrom(ggplot2,coord_fixed)
46 | importFrom(ggplot2,ggplot)
47 | importFrom(ggplot2,guides)
48 | importFrom(ggplot2,scale_color_manual)
49 | importFrom(ggplot2,scale_fill_manual)
50 | importFrom(ggplot2,theme_void)
51 | importFrom(scatterpie,geom_scatterpie)
52 | 


--------------------------------------------------------------------------------
/NEWS.md:
--------------------------------------------------------------------------------
 1 | # SPLIT 0.1.0
 2 | 
 3 | Initial release of the SPLIT (Spatial Purification of Layered Intracellular Transcripts) package.
 4 | 
 5 | ## Features
 6 | 
 7 | -   Purifies spatial transcriptomics data by removing background contamination using RCTD deconvolution results.
 8 | -   Computes local contamination scores and models background diffusion.
 9 | -   Refines cell-type annotations based on neighborhood transcriptomic homogeneity (SPLIT-shift).
10 | -   Outputs a purified count matrix with improved cell-type specificity.
11 | -   Scores computed by SPLIT can be used to identify highly contaminating cell types and assess cell co-localization patterns.
12 | 
13 | ⚠️ **IMPORTANT:**\
14 | SPLIT currently requires **doublet-mode** RCTD results from the original [spacexr GitHub repository](https://github.com/dmcable/spacexr) or its faster [HD fork](https://github.com/jpromeror/spacexr/tree/HD), **not** from the newly released [Bioconductor version](https://www.bioconductor.org/packages/release/bioc/html/spacexr.html).\
15 | **Compatibility with Bioconductor's spacexr is coming soon.**
16 | 


--------------------------------------------------------------------------------
/R/balanced_dataset.R:
--------------------------------------------------------------------------------
  1 | #' @keywords internal
  2 | swap_expr <- rlang::expr({
  3 |   cells_to_swap_label <- meta_data %>%
  4 |     filter(!first_type_neighborhood_agreement & !first_type_class_neighborhood_agreement) %>%
  5 |     filter(second_type_class == first_type_class_neighborhood) %>%
  6 |     rownames()
  7 | 
  8 |   meta_data <- meta_data %>%
  9 |     mutate(across(c(first_type, second_type, weight_first_type, weight_second_type),
 10 |                   .fns = list(before_swap = identity))) %>%  # Create backup in one step
 11 |     mutate(
 12 |       swap = rownames(.) %in% cells_to_swap_label,  # Logical column for swapping
 13 |       first_type = if_else(swap, second_type_before_swap, first_type),
 14 |       second_type = if_else(swap, first_type_before_swap, second_type),
 15 |       weight_first_type = if_else(swap, weight_second_type_before_swap, weight_first_type),
 16 |       weight_second_type = if_else(swap, weight_first_type_before_swap, weight_second_type)
 17 |     )
 18 | 
 19 |   # Replace corrected counts for swapped cells
 20 |   cells_to_swap_correcred_profile <- cells_to_swap_label[cells_to_swap_label %in% colnames(xe_purified)] #intersect(cells_to_swap_label, cells_to_replace_with_purified)
 21 |   count_matrix[, cells_to_swap_correcred_profile] <-
 22 |     GetAssayData(xe_raw, assay = default_assay, layer = "counts")[common_genes, cells_to_swap_correcred_profile] -
 23 |     GetAssayData(xe_purified, assay = default_assay, layer = "counts")[common_genes, cells_to_swap_correcred_profile]
 24 |   count_matrix[count_matrix < 0] <- 0
 25 | })
 26 | 
 27 | #' Balance raw and purified data by merging high quality raw data and otherwise purified data
 28 | #'
 29 | #' Merges raw and purified data into one dataset by keeping raw counts
 30 | #' for high quality cells and replacing contaminated cells with their purified profiles. Reject cells are removed
 31 | #'
 32 | #'
 33 | #' @param xe_raw raw seurat object
 34 | #' @param xe_purified purified seurat object (post \code{purify_counts_with_rctd})
 35 | #' @param threshold value below which cell is considered as contaminated and is replaced by purified profile.  For the moment, it's a single value, but should accept cell-type-specific vector later on #TODO
 36 | #' @param score_name name of the param to threshold on
 37 | #' @param DO_swap_lables A logical indicating whether to swap first and second cell types for cells which label does not agree with its transcriptomic neighborhod lable
 38 | #'
 39 | #' @export
 40 | 
 41 | balance_raw_and_purified_data_by_score <- function(
 42 |     xe_raw,
 43 |     xe_purified,
 44 |     threshold = .15,
 45 |     score_name = c("neighborhood_weights_second_type", "second_type_neighbors_N",  "second_type_neighbors_no_reject_N"),
 46 |     spot_class_key = "spot_class",
 47 |     DO_swap_lables = FALSE,
 48 |     default_assay = "Xenium"
 49 | ){
 50 |   score_name <- score_name[1]
 51 |   if(!score_name %in% colnames(xe_raw@meta.data)){
 52 |     stop("score", score_name, "is not available in `xe_raw`, please compute it first!")
 53 |   }
 54 | 
 55 |   if(!"first_type" %in% colnames(xe_raw@meta.data)){
 56 |     stop("`first_type` is not available in `xe_raw`, please compute it first!")
 57 |   }
 58 | 
 59 |   if(!spot_class_key %in% colnames(xe_raw@meta.data)){
 60 |     stop("`spot_class` is not available in `xe_raw`, please compute it first!")
 61 |   }
 62 | 
 63 |   cells_to_replace_with_purified <-
 64 |     xe_raw@meta.data %>%
 65 |     filter((.data[[spot_class_key]] != "reject" & .data[[score_name]] > threshold) |
 66 |              .data[[spot_class_key]] == "doublet_uncertain") %>%
 67 |     rownames()
 68 | 
 69 |   cells_to_remove <-
 70 |     xe_raw@meta.data %>%
 71 |     filter(.data[[spot_class_key]] == "reject") %>%
 72 |     rownames()
 73 | 
 74 |   cells_to_keep_raw <- setdiff(colnames(xe_raw), c(cells_to_replace_with_purified, cells_to_remove))
 75 |   cells_to_replace_with_purified <- setdiff(cells_to_replace_with_purified, cells_to_remove)
 76 |   # restrict to cells present in purified data
 77 |   cells_to_replace_with_purified <- cells_to_replace_with_purified[cells_to_replace_with_purified %in% colnames(xe_purified)]
 78 | 
 79 |   xe_raw$purification_status <- "raw"
 80 |   xe_raw@meta.data[cells_to_replace_with_purified,"purification_status"] <- xe_purified@meta.data[cells_to_replace_with_purified,"purification_status"]
 81 |   xe_raw@meta.data[cells_to_remove,"purification_status"] <- "removed"
 82 | 
 83 |   common_genes <- intersect(rownames(xe_raw), rownames(xe_purified))
 84 |   count_matrix <- cbind(GetAssayData(xe_raw, assay = default_assay, layer = "counts")[common_genes, cells_to_keep_raw],
 85 |                         GetAssayData(xe_purified, assay = default_assay, layer = "counts")[common_genes, cells_to_replace_with_purified])
 86 | 
 87 |   meta_data = xe_raw@meta.data[colnames(count_matrix),]
 88 | 
 89 |   if(DO_swap_lables){
 90 |     eval(swap_expr)  # Expands the swap logic in-place
 91 |   } else {
 92 |     meta_data$swap <- FALSE
 93 |   }
 94 | 
 95 |   xe_balanced <- CreateSeuratObject(counts = count_matrix, assay = default_assay, meta.data = meta_data)
 96 | 
 97 |   return(xe_balanced)
 98 | }
 99 | 
100 | #' Balance raw and purified data using spot class categories
101 | #'
102 | #' Merges raw and purified data into one dataset by retaining raw counts
103 | #' for singlet cells and replacing other cell categories with their purified profiles.
104 | #' Reject cells are removed from the dataset.
105 | #'
106 | #' @param xe_raw A raw Seurat object containing the original spatial transcriptomics data.
107 | #' @param xe_purified A purified Seurat object, typically post \code{purify_counts_with_rctd}, containing corrected cell profiles.
108 | #' @param spot_class_key A character string specifying the metadata column name in \code{xe_raw} that indicates the spot classification (default: `"spot_class"`).
109 | #' @param DO_swap_lables A logical indicating whether to swap first and second cell types for cells which label does not agree with its transcriptomic neighborhod lable
110 | #'
111 | #' @return A balanced Seurat object with a merged count matrix where:
112 | #' \itemize{
113 | #'   \item Singlet cells retain their raw counts.
114 | #'   \item Cells classified as non-singlet (but not rejected) are replaced with purified counts.
115 | #'   \item Rejected cells are removed.
116 | #' }
117 | #'
118 | #' @details The function categorizes cells into three groups:
119 | #' \itemize{
120 | #'   \item Cells classified as `"singlet"` retain their raw data.
121 | #'   \item Cells classified as anything other than `"singlet"` (excluding `"reject"`) are replaced with their purified profiles.
122 | #'   \item Cells classified as `"reject"` are completely removed from the dataset.
123 | #' }
124 | #'
125 | #' @export
126 | 
127 | balance_raw_and_purified_data_by_spot_class <- function(
128 |     xe_raw,
129 |     xe_purified,
130 |     spot_class_key = "spot_class",
131 |     DO_swap_lables = FALSE,
132 |     default_assay = "Xenium"
133 | ){
134 | 
135 |   if(!spot_class_key %in% colnames(xe_raw@meta.data)){
136 |     stop("spot_class_key", spot_class_key, "is not available in `xe_raw`, please compute it first!")
137 |   }
138 | 
139 |   if(!"first_type" %in% colnames(xe_raw@meta.data)){
140 |     stop("`first_type` is not available in `xe_raw`, please compute it first!")
141 |   }
142 | 
143 |   cells_to_remove <-
144 |     xe_raw@meta.data %>%
145 |     filter(.data[[spot_class_key]] == "reject") %>%
146 |     rownames()
147 | 
148 |   cells_to_replace_with_purified <-
149 |     xe_raw@meta.data %>%
150 |     filter(.data[[spot_class_key]] != "singlet") %>%
151 |     rownames()
152 |   cells_to_replace_with_purified <- setdiff(cells_to_replace_with_purified, cells_to_remove)
153 | 
154 |   cells_to_keep_raw <-
155 |     xe_raw@meta.data %>%
156 |     filter(.data[[spot_class_key]] == "singlet") %>%
157 |     rownames()
158 | 
159 | 
160 |   xe_raw$purification_status <- "raw"
161 |   xe_raw@meta.data[cells_to_replace_with_purified,"purification_status"] <- xe_purified@meta.data[cells_to_replace_with_purified,"purification_status"]
162 |   xe_raw@meta.data[cells_to_remove,"purification_status"] <- "removed"
163 | 
164 |   common_genes <- intersect(rownames(xe_raw), rownames(xe_purified))
165 |   count_matrix <- cbind(GetAssayData(xe_raw, assay = default_assay, layer = "counts")[common_genes, cells_to_keep_raw],
166 |                         GetAssayData(xe_purified, assay = default_assay, layer = "counts")[common_genes, cells_to_replace_with_purified])
167 | 
168 |   meta_data = xe_raw@meta.data[colnames(count_matrix),]
169 | 
170 |   if(DO_swap_lables){
171 |     eval(swap_expr)  # Expands the swap logic in-place
172 |   } else {
173 |     meta_data$swap <- FALSE
174 |   }
175 | 
176 |   xe_balanced <- CreateSeuratObject(counts = count_matrix, assay = default_assay, meta.data = meta_data)
177 |   return(xe_balanced)
178 | }
179 | 
180 | 


--------------------------------------------------------------------------------
/R/building_networks.R:
--------------------------------------------------------------------------------
  1 | #' Compute Neighborhood Graph with Optional Pruning
  2 | #'
  3 | #' This function computes or retrieves a k-nearest neighbors (KNN) graph from a given object,
  4 | #' using a specified dimensional reduction or features. Optionally, it prunes edges based on
  5 | #' a specified distance threshold.
  6 | #'
  7 | #' @param obj An object (e.g., Seurat object) containing data to compute the KNN graph.
  8 | #' @param neighbors_name A character string specifying the name of an existing neighbors object
  9 | #'   in `obj`. If `NULL`, neighbors are computed.
 10 | #' @param k_knn An integer specifying the number of nearest neighbors to use. Default is 20.
 11 | #' @param reduction A character string specifying the dimensional reduction to use
 12 | #'   (e.g., "pca", "spatial"). Default is "pca".
 13 | #' @param dims A numeric vector specifying the dimensions to use for the reduction. Default is `1:50`.
 14 | #' @param features A character vector of feature names to use for neighbor computation. Default is `NULL`.
 15 | #' @param graph_name A character string specifying the name of the graph to store in `obj`. Default is `"transcriptomics_knn"`.
 16 | #' @param DO_prune A logical value indicating whether to prune edges in the graph based on a distance threshold. Default is `FALSE`.
 17 | #' @param rad_pruning A numeric value specifying the maximum distance for retaining edges. Edges with distances greater than this value will be pruned. Default is `Inf`.
 18 | #' @param ... Additional parameters passed to `Seurat::FindNeighbors`.
 19 | #'
 20 | #' @return A list containing the following elements:
 21 | #' \describe{
 22 | #'   \item{`graph`}{An igraph object representing the (possibly pruned) KNN graph.}
 23 | #'   \item{`nn_idx`}{A matrix of nearest-neighbor indices.}
 24 | #'   \item{`nn_dist`}{A matrix of distances to the nearest neighbors.}
 25 | #'   \item{`cell_id`}{A vector of cell IDs corresponding to the rows of `nn_idx` and `nn_dist`.}
 26 | #' }
 27 | #'
 28 | #' @details
 29 | #' If `DO_prune` is `TRUE`, the function removes edges from the graph where the
 30 | #' distance exceeds the `rad_pruning` threshold. A warning is issued if `rad_pruning` is
 31 | #' larger than the maximum distance in the graph, as no edges will be pruned in that case.
 32 | #'
 33 | #' @examples
 34 | #' \dontrun{
 35 | #' # Example usage with a Seurat object
 36 | #' result <- compute_neighborhood(obj = seurat_obj, reduction = "pca", k_knn = 15)
 37 | #' plot(result$graph)
 38 | #'
 39 | #' # Example with pruning
 40 | #' result <- compute_neighborhood(
 41 | #'   obj = seurat_obj,
 42 | #'   reduction = "pca",
 43 | #'   k_knn = 15,
 44 | #'   DO_prune = TRUE,
 45 | #'   rad_pruning = 0.2
 46 | #' )
 47 | #' plot(result$graph)
 48 | #' }
 49 | #'
 50 | #' @export
 51 | #'
 52 | 
 53 | compute_neighborhood <- function(
 54 |     obj,
 55 |     neighbors_name = NULL,
 56 |     k_knn = 20,
 57 |     reduction = "pca",
 58 |     dims = 1:50,
 59 |     features = NULL,
 60 |     graph_name = "transcriptomics_knn",
 61 |     DO_prune = FALSE,
 62 |     rad_pruning = Inf,
 63 |     ...
 64 | ) {
 65 |   # Check if neighbors exist or compute them
 66 |   knn_neighbors <- if (!is.null(neighbors_name)) {
 67 |     if (!(neighbors_name %in% obj@neighbors)) {
 68 |       stop(paste("Neighbors", neighbors_name, "do not exist. Compute one or set `neighbors_name` to `NULL` to re-compute neighbors."))
 69 |     }
 70 |     obj@neighbors[[neighbors_name]]
 71 |   } else {
 72 |     obj <- Seurat::FindNeighbors(
 73 |       obj,
 74 |       reduction = reduction,
 75 |       dims = dims,
 76 |       features = features,
 77 |       k = k_knn,
 78 |       return.neighbor = TRUE,
 79 |       graph.name = graph_name,
 80 |       ...
 81 |     )
 82 |     obj@neighbors[[graph_name]]
 83 |   }
 84 | 
 85 |   # Extract adjacency list and clean up NAs
 86 |   adjacency_knn <- apply(knn_neighbors@nn.idx[, -1, drop = FALSE], 1, function(x) x[!is.na(x)])
 87 | 
 88 |   # Create graph from adjacency list
 89 |   graph  <- igraph::graph_from_adj_list(adjacency_knn)
 90 |   n_edge <- igraph::ecount(graph)
 91 | 
 92 |   max_dist <- max(knn_neighbors@nn.dist)
 93 | 
 94 |   # Pruning
 95 |   if(DO_prune){
 96 |     if(rad_pruning > max_dist){
 97 |       warning("No puning as `rad_pruning` is larger than any distance in the graph ")
 98 |     } else {
 99 | 
100 |       # prune edges with large distance (remove distant neighbors)
101 |       knn_neighbors@nn.idx[knn_neighbors@nn.dist > rad_pruning]  <- NA
102 |       knn_neighbors@nn.dist[knn_neighbors@nn.dist > rad_pruning] <- NA
103 | 
104 |       adjacency_knn <- apply(knn_neighbors@nn.idx[, -1, drop = FALSE], 1, function(x) x[!is.na(x)])
105 | 
106 |       graph         <- igraph::graph_from_adj_list(adjacency_knn)
107 |       n_edge_prune  <- igraph::ecount(graph)
108 |       delta_n_edge  <- n_edge-n_edge_prune
109 |       message(paste("N =", delta_n_edge, "(", round(100*delta_n_edge/n_edge), "%) edges were pruned"))
110 |     }
111 |   }
112 | 
113 |   # Prepare results
114 |   result <- list(
115 |     graph = graph,
116 |     nn_idx = knn_neighbors@nn.idx,
117 |     nn_dist = knn_neighbors@nn.dist,
118 |     cell_id = knn_neighbors@cell.names
119 |   )
120 | 
121 |   return(result)
122 | }
123 | 
124 | #' Build Spatial Network
125 | #'
126 | #' A wrapper function around `compute_neighborhood` to create a spatial KNN graph
127 | #' using spatial reduction data.
128 | #'
129 | #' @param obj An object (e.g., Seurat object) containing spatial data.
130 | #' @param ... Additional parameters passed to `compute_neighborhood`.
131 | #'
132 | #' @return A list containing the following elements:
133 | #' \describe{
134 | #'   \item{`graph`}{An igraph object representing the KNN graph.}
135 | #'   \item{`nn_idx`}{A matrix of nearest-neighbor indices.}
136 | #'   \item{`nn_dist`}{A matrix of distances to the nearest neighbors.}
137 | #'   \item{`cell_id`}{A vector of cell IDs corresponding to the rows of `nn_idx` and `nn_dist`.}
138 | #' }
139 | #'
140 | #' @examples
141 | #' \dontrun{
142 | #' # Example usage with a Seurat object containing spatial data
143 | #' spatial_result <- build_spatial_network(obj = seurat_obj)
144 | #' plot(spatial_result$graph)
145 | #' }
146 | #'
147 | #' @export
148 | 
149 | build_spatial_network <- function(obj, reduction = "spatial", dims = 1:2, graph_name = "spatial_knn", DO_prune = T, rad_pruning = 30, ...) {
150 |   compute_neighborhood(obj = obj, reduction = reduction, dims = dims, DO_prune = DO_prune, rad_pruning = rad_pruning, ...)
151 | }
152 | 
153 | #' Build Transcriptomics Network
154 | #'
155 | #' A wrapper function around `compute_neighborhood` to create a transcriptomics KNN graph
156 | #' using transcriptomics reduction data.
157 | #'
158 | #' @param obj An object (e.g., Seurat object) containing transcriptomics data.
159 | #' @param ... Additional parameters passed to `compute_neighborhood`.
160 | #'
161 | #' @return A list containing the following elements:
162 | #' \describe{
163 | #'   \item{`graph`}{An igraph object representing the KNN graph.}
164 | #'   \item{`nn_idx`}{A matrix of nearest-neighbor indices.}
165 | #'   \item{`nn_dist`}{A matrix of distances to the nearest neighbors.}
166 | #'   \item{`cell_id`}{A vector of cell IDs corresponding to the rows of `nn_idx` and `nn_dist`.}
167 | #' }
168 | #'
169 | #' @examples
170 | #' \dontrun{
171 | #' # Example usage with a Seurat object containing transcriptomics data
172 | #' transcriptomics_result <- build_transcriptomics_network(obj = seurat_obj, k_knn = 15)
173 | #' plot(spatial_result$graph)
174 | #' }
175 | #' @export
176 | build_transcriptomics_network <- function(obj, reduction = "pca", graph_name = "transcriptomics_knn", ...){
177 |   compute_neighborhood(obj = obj, reduction = reduction, graph_name = graph_name, ...)
178 | }
179 | 
180 | 


--------------------------------------------------------------------------------
/R/neighborhood_analysis.R:
--------------------------------------------------------------------------------
  1 | #' Add RCTD Results to Neighborhood Graph
  2 | #'
  3 | #' This function integrates RCTD (Robust Cell Type Decomposition) results into a neighborhood graph.
  4 | #' It ensures that the RCTD results align with the graph's cells and formats them for downstream use.
  5 | #'
  6 | #' @param graph A list representing a neighborhood graph, typically created by \code{compute_neighborhood}.
  7 | #'   Must contain \code{nn_idx} (nearest neighbor indices) and \code{cell_id} (cell identifiers).
  8 | #' @param rctd An RCTD object containing cell decomposition results. The results are expected in
  9 | #'   \code{rctd@results$results_df}.
 10 | #'
 11 | #' @return A list containing the original neighborhood graph and a matrix for each column in
 12 | #'   \code{rctd@results$results_df}, where rows correspond to cells and columns correspond to neighbors.
 13 | #'
 14 | #' @details
 15 | #' The function extracts the decomposition results from the RCTD object and maps them to the
 16 | #' neighborhood graph structure, ensuring consistency in cell alignment. It then organizes
 17 | #' the results into matrices where each matrix corresponds to a specific result column from the RCTD results.
 18 | #'
 19 | #' @examples
 20 | #' \dontrun{
 21 | #' # Assuming `graph` is a neighborhood graph and `rctd` is an RCTD object
 22 | #' updated_graph <- add_rctd_res_to_neighborhood(graph = graph, rctd = rctd)
 23 | #' }
 24 | #'
 25 | #' @export
 26 | 
 27 | add_rctd_to_neighborhood <- function(
 28 |     graph,
 29 |     rctd
 30 | ){
 31 |   # Ensure RCTD results align with graph cells
 32 |   results_df <- rctd@results$results_df[graph$cell_id, ] # Match cells
 33 |   rownames(results_df) <- graph$cell_id
 34 | 
 35 |   # Reshape results to align with nearest neighbor indices
 36 |   results_df <- results_df[as.vector(graph$nn_idx), ]
 37 |   num_neighbors <- ncol(graph$nn_idx)
 38 | 
 39 |   # Prepare results as a list of matrices
 40 |   result_list <- lapply(colnames(results_df), function(column_name) {
 41 |     matrix(results_df[[column_name]], ncol = num_neighbors)
 42 |   })
 43 | 
 44 |   # Combine graph and results
 45 |   names(result_list) <- colnames(results_df)
 46 |   return(c(graph, result_list))
 47 | }
 48 | 
 49 | 
 50 | #' Add Neighborhood Infiltration Metrics to Spatial Neighborhood
 51 | #'
 52 | #' This function calculates various infiltration metrics for cell neighborhoods
 53 | #' within a spatial context. It computes metrics related to the number and type
 54 | #' of neighbors annotated to a cell's first or second type, as well as infiltration
 55 | #' metrics based on cell classes.
 56 | #'
 57 | #' @param neighborhood A list containing spatial neighborhood data. This should
 58 | #'   include matrices for `first_type`, `second_type`, `nn.idx`, and `spot_class`.
 59 | #'
 60 | #' @return A modified version of the input `neighborhood` list with additional
 61 | #'   metrics. The added fields include:
 62 | #'   \item{`total_neighbors_N`}{Total number of neighbors for each cell.}
 63 | #'   \item{`total_singlets_neighbors_N`}{Number of singlet neighbors for each cell.}
 64 | #'   \item{`second_type_neighbors`}{Indices of neighbors annotated to the second cell type.}
 65 | #'   \item{`second_type_neighbors_N`}{Count of neighbors annotated to the second cell type.}
 66 | #'   \item{`second_type_singlets_neighbors`}{Indices of singlet neighbors annotated to the second cell type.}
 67 | #'   \item{`second_type_singlets_neighbors_N`}{Count of singlet neighbors annotated to the second cell type.}
 68 | #'   \item{`first_type_neighbors`}{Indices of neighbors annotated to the first cell type.}
 69 | #'   \item{`first_type_neighbors_N`}{Count of neighbors annotated to the first cell type.}
 70 | #'   \item{`first_type_singlets_neighbors`}{Indices of singlet neighbors annotated to the first cell type.}
 71 | #'   \item{`first_type_singlets_neighbors_N`}{Count of singlet neighbors annotated to the first cell type.}
 72 | #'   \item{`second_type_class_neighbors`}{Indices of neighbors annotated to the second cell type class.}
 73 | #'   \item{`second_type_class_neighbors_N`}{Count of neighbors annotated to the second cell type class.}
 74 | #'   \item{`first_type_class_neighbors`}{Indices of neighbors annotated to the first cell type class.}
 75 | #'   \item{`first_type_class_neighbors_N`}{Count of neighbors annotated to the first cell type class.}
 76 | #'   \item{`same_second_type_neighbors`}{Indices of neighbors annotated to the same second cell type.}
 77 | #'   \item{`same_second_type_neighbors_N`}{Count of neighbors annotated to the same second cell type.}
 78 | #'
 79 | #'
 80 | #' @export
 81 | #'
 82 | add_infiltration_metrics_to_neighborhood <- function(neighborhood) {
 83 | 
 84 | 
 85 |   # Compute the total number of neighbors
 86 |   neighborhood$total_neighbors_N <- rowSums(!is.na(neighborhood$nn_idx)) - 1
 87 |   neighborhood$annotated_neighbors_N <- rowSums(!is.na(neighborhood$first_type)) - 1
 88 | 
 89 |   # Compute total number of singlet neighbors
 90 |   neighborhood$total_singlets_neighbors_N <- rowSums(neighborhood$spot_class == "singlet", na.rm = TRUE)
 91 | 
 92 |   ## Generate vector of annotation of cell's neighbors ignoring reject cells
 93 |   neighborhood$first_type_no_reject <- neighborhood$first_type
 94 |   neighborhood$first_type_no_reject[neighborhood$spot_class == "reject"] <- "NA"
 95 | 
 96 |   # Create first_type_fist_element_second_type (with first element as second_type)
 97 |   # to facilitate identification of the number of the second cell type in the neighborhood
 98 |   neighborhood$first_type_fist_element_second_type <- cbind(neighborhood$second_type[, 1], neighborhood$first_type[, -1])
 99 |   neighborhood$first_type_fist_element_second_type_no_reject <- cbind(neighborhood$second_type[, 1], neighborhood$first_type_no_reject[, -1])
100 | 
101 | 
102 |   # Helper function to calculate neighbors for a given type
103 |   get_neighbors <- function(neighborhood_col) {
104 |     apply(neighborhood_col,
105 |           1,
106 |           FUN = function(x) {
107 |             which(x[-1] == x[1]) + 1
108 |           })
109 |   }
110 | 
111 |   # SECOND type neighbors: indices where neighbors are annotated to the residual second type
112 |   neighborhood$second_type_neighbors <- get_neighbors(neighborhood$first_type_fist_element_second_type)
113 |   neighborhood$second_type_neighbors_N <- sapply(neighborhood$second_type_neighbors, length)
114 | 
115 |   neighborhood$second_type_neighbors_no_reject <- get_neighbors(neighborhood$first_type_fist_element_second_type_no_reject)
116 |   neighborhood$second_type_neighbors_no_reject_N <- sapply(neighborhood$second_type_neighbors_no_reject, length)
117 | 
118 |   # SECOND type singlet neighbors: neighbors annotated to the residual cell type and are singlets
119 |   neighborhood$second_type_singlets_neighbors <- sapply(
120 |     seq(nrow(neighborhood$first_type_fist_element_second_type)),
121 |     FUN = function(i){
122 |       x = neighborhood$first_type_fist_element_second_type[i,]
123 |       y = neighborhood$spot_class[i,]
124 |       which((x[-1] == x[1]) & (y[-1] == "singlet")) + 1
125 |     })
126 | 
127 |   neighborhood$second_type_singlets_neighbors_N <- sapply(neighborhood$second_type_singlets_neighbors, length)
128 | 
129 |   # Compute infiltration of second type CLASS in the neighborhood
130 |   neighborhood$first_type_class_fist_element_second_type_class <- cbind(neighborhood$second_type_class[, 1], neighborhood$first_type_class[, -1])
131 |   neighborhood$second_type_class_neighbors <- get_neighbors(neighborhood$first_type_class_fist_element_second_type_class)
132 |   neighborhood$second_type_class_neighbors_N <- sapply(neighborhood$second_type_class_neighbors, length)
133 | 
134 |   # FIRST type neighbors: indices where neighbors are annotated to the main first type
135 |   neighborhood$first_type_neighbors <- get_neighbors(neighborhood$first_type)
136 |   neighborhood$first_type_neighbors_N <- sapply(neighborhood$first_type_neighbors, length)
137 | 
138 |   # FIRST type singlet neighbors: neighbors annotated to the main cell type and are singlets
139 |   neighborhood$first_type_singlets_neighbors <- sapply(
140 |     seq(nrow(neighborhood$first_type)),
141 |     FUN = function(i){
142 |       x = neighborhood$first_type[i,]
143 |       y = neighborhood$spot_class[i,]
144 |       which((x[-1] == x[1]) & (y[-1] == "singlet")) + 1
145 |     })
146 |   neighborhood$first_type_singlets_neighbors_N <- sapply(neighborhood$first_type_singlets_neighbors, length)
147 | 
148 |   # Compute infiltration of first type CLASS in the neighborhood
149 |   neighborhood$first_type_class_neighbors <- get_neighbors(neighborhood$first_type_class)
150 |   neighborhood$first_type_class_neighbors_N <- sapply(neighborhood$first_type_class_neighbors, length)
151 | 
152 |   # SAME SECOND type neighbors: indices where neighbors are annotated to the same second type
153 |   neighborhood$same_second_type_neighbors   <- get_neighbors(neighborhood$second_type)
154 |   neighborhood$same_second_type_neighbors_N <- sapply(neighborhood$same_second_type_neighbors, length)
155 | 
156 |   return(neighborhood)
157 | }
158 | 
159 | 
160 | #' Add Neighborhood Weight Composition
161 | #'
162 | #' This function computes the spatial composition of a neighborhood by calculating the weights of different cell types.
163 | #' It adds the computed composition as a new column (`neighborhood_weight_composition`) to the input `neighborhood` object.
164 | #'
165 | #' @param neighborhood A list containing information about the neighborhood, including `nn_idx` (indices of neighbors),
166 | #' `first_type` and `second_type` (cell types of the neighbors), and `weight_first_type` and `weight_second_type` (weights of the corresponding cell types).
167 | #'
168 | #' @return A modified `neighborhood` object with an added column `neighborhood_weight_composition`, which is a matrix where each row corresponds
169 | #' to the vector of cell-type weights for a neighborhood.
170 | #'
171 | #' @importFrom dplyr %>%
172 | #' @export
173 | #'
174 | add_neighborhood_weight_composition <- function(
175 |     neighborhood
176 | ){
177 | 
178 |   # Neighborhood composition
179 |   compute_neighborhood_weight_composition <- function(
180 |     nbhd
181 |   ){
182 |     composition <- rep(0, length(cell_types))
183 |     names(composition) <- cell_types
184 |     nbhd <- nbhd[!is.na(neighborhood$first_type[nbhd, 1])] # exclude cells not annotated by RCTD
185 |     nbhd <- nbhd[-1]
186 | 
187 |     if(length(nbhd) == 0)
188 |       return(composition)
189 | 
190 |     for(i in nbhd){
191 |       ft <- neighborhood$first_type[i, 1]
192 |       st <- neighborhood$second_type[i, 1]
193 |       w1 <- neighborhood$weight_first_type[i, 1]
194 |       w2 <- neighborhood$weight_second_type[i, 1]
195 |       if(!is.na(st)){
196 |         composition[ft] <- composition[ft] + w1
197 |         composition[st] <- composition[st] + w2
198 |       } else {
199 |         composition[ft] <- composition[ft] + 1
200 |       }
201 |     }
202 |     #composition <- unname(composition)
203 |     return(composition/sum(composition))
204 |   }
205 | 
206 |   # Extract unique cell types
207 |   cell_types <- neighborhood$first_type %>% as.vector() %>% unique() %>% sort()
208 |   # Store `neighborhood_weight_composition` -- the global weight composition of the neighborhood
209 |   neighborhood$neighborhood_weight_composition <- apply(neighborhood$nn_idx, 1, compute_neighborhood_weight_composition, simplify = T) %>% t()
210 | 
211 |   return(neighborhood)
212 | }
213 | 
214 | 
215 | #' Add Neighborhood Weights for Cell Types
216 | #'
217 | #' This function calculates the weights of the first and second cell types for each neighborhood based on the neighborhood composition.
218 | #' It assigns a weight of 0 to the second cell type if it is `NA`.
219 | #'
220 | #' @param neighborhood A list containing the neighborhood data, including:
221 | #'   - `neighborhood_weight_composition`: A matrix where each row corresponds to a neighborhood, and columns represent cell types with their respective weights.
222 | #'   - `first_type`: A matrix or vector indicating the first cell type for each neighborhood.
223 | #'   - `second_type`: A matrix or vector indicating the second cell type for each neighborhood (can contain `NA` values).
224 | #'
225 | #' @return The input `neighborhood` object with an additional element `neighborhood_weights`,
226 | #'   a matrix of dimensions `nrow(neighborhood$neighborhood_weight_composition) x 2`.
227 | #'   Each row contains the weights of the `first_type` and `second_type` for the corresponding neighborhood.
228 | #'
229 | #' @export
230 | add_cell_types_neighborhood_weights <- function(
231 |     neighborhood
232 | ){
233 | 
234 |   neighborhood_weights <- sapply(
235 |     1:nrow(neighborhood$neighborhood_weight_composition),
236 |     FUN = function(i){
237 |       ft <- neighborhood$first_type[i,1]
238 |       st <- neighborhood$second_type[i,1]
239 | 
240 |       if(!is.na(st)){
241 |         neighborhood_weights <- c(neighborhood$neighborhood_weight_composition[i,ft], neighborhood$neighborhood_weight_composition[i, st])
242 |       } else{
243 |         if(!is.na(ft)){
244 |           neighborhood_weights <- c(neighborhood$neighborhood_weight_composition[i,ft], 0)
245 |         } else {
246 |           neighborhood_weights <- c(NA, NA)
247 |         }
248 |       }
249 |     }
250 |   )
251 |   neighborhood_weights <- neighborhood_weights %>% t()
252 |   colnames(neighborhood_weights) <- c("first_type", "second_type")
253 | 
254 |   neighborhood$neighborhood_weights_first_type  <- neighborhood_weights[, "first_type"]
255 |   neighborhood$neighborhood_weights_second_type <- neighborhood_weights[, "second_type"]
256 |   return(neighborhood)
257 | }
258 | 
259 | 
260 | #' Add Weights for First Type of Second Type Neighbors
261 | #'
262 | #' This function calculates the weight of the first cell type (`w1`) for each second-type neighbor in a neighborhood.
263 | #' It adds this information to the `neighborhood` object as a new element.
264 | #'
265 | #' @param neighborhood A list containing the neighborhood data, including:
266 | #'   - `second_type_neighbors`: A list where each element contains indices of neighbors corresponding to the second type.
267 | #'   - `weight_first_type`: A matrix where rows correspond to neighborhoods, and columns represent weights of the first type for each neighbor.
268 | #'
269 | #' @return The input `neighborhood` object with an additional element,
270 | #'   - `w1_second_type_in_neighborhood`, which is a vector of weights for the first cell type
271 | #'   corresponding to each second-type neighbor.
272 | #'   - `sum_w1_second_type_in_neighborhood` : A vector with sum of `w1_second_type_in_neighborhood`
273 | #'
274 | #' @export
275 | 
276 | add_individual_w1_of_second_type_in_neighborhood <- function(
277 |     neighborhood
278 | ){
279 |   neighborhood$w1_second_type_in_neighborhood <- sapply(1:length(neighborhood$second_type_neighbors), function(i){
280 |     neighborhood$weight_first_type[i, neighborhood$second_type_neighbors[[i]]]
281 |   })
282 | 
283 |   neighborhood$sum_w1_second_type_in_neighborhood <- lapply(neighborhood$w1_second_type_in_neighborhood, sum) %>% unlist()
284 | 
285 |   return(neighborhood)
286 | }
287 | 
288 | # Functions to retrieve from `xenium_analysis_pipeline` project
289 | 
290 | # add scores: max w1 as second type
291 | #' Add Weights of Second Cell Type's Second Annotation in the Neighborhood
292 | #'
293 | #' This function calculates the weights associated with the second type's second annotation
294 | #' (w2) for neighbors of the same second cell type within a spatial neighborhood.
295 | #'
296 | #' @param neighborhood A list containing spatial neighborhood data. This list must include:
297 | #'   - `same_second_type_neighbors`: Indices of neighbors annotated to the same second cell type.
298 | #'   - `weight_second_type`: A matrix of weights for the second cell type annotations.
299 | #'
300 | #' @return The modified `neighborhood` list with an added field:
301 | #'   - `w2_second_type_in_neighborhood`: A list where each entry contains the `w2` weights
302 | #'     of neighbors annotated to the same second cell type for each cell.
303 | #'   - `sum_w2_second_type_in_neighborhood` : A vector with sum of `w2_second_type_in_neighborhood`
304 | #'
305 | #' @export
306 | add_individual_w2_of_second_type_in_neighborhood <- function(
307 |     neighborhood
308 | ){
309 |   neighborhood$w2_second_type_in_neighborhood <- sapply(1:length(neighborhood$same_second_type_neighbors), function(i){
310 |     neighborhood$weight_second_type[i, neighborhood$same_second_type_neighbors[[i]]]
311 |   })
312 | 
313 |   neighborhood$sum_w2_second_type_in_neighborhood <- lapply(neighborhood$w2_second_type_in_neighborhood, sum) %>% unlist()
314 | 
315 |   return(neighborhood)
316 | }
317 | 
318 | #' Add Maximum Weight of Spilling Cell Type in the Neighborhood
319 | #'
320 | #' This function computes the maximum weight of a spilling cell type
321 | #' (i.e., the largest value among `w1` and `w2` weights for neighbors of the same second cell type)
322 | #' within a spatial neighborhood.
323 | #'
324 | #' @param neighborhood A list containing spatial neighborhood data. This list must include:
325 | #'   - `same_second_type_neighbors`: Indices of neighbors annotated to the same second cell type.
326 | #'   - `w1_second_type_in_neighborhood`: A list of `w1` weights for the same second cell type neighbors.
327 | #'   - `w2_second_type_in_neighborhood`: A list of `w2` weights for the same second cell type neighbors.
328 | #'
329 | #' @return The modified `neighborhood` list with an added field:
330 | #'   - `max_weight_of_spilling_type_in_neighborhood`: A numeric vector containing the maximum
331 | #'     weight of the spilling cell type for each cell in the neighborhood.
332 | #'
333 | #' @export
334 | add_max_weight_of_spilling_type_in_neighborhood <- function(
335 |     neighborhood
336 | ){
337 |   N_cells <- length(neighborhood$same_second_type_neighbors)
338 |   neighborhood$max_weight_of_spilling_type_in_neighborhood <- sapply(1:N_cells, function(i){
339 |     max(0, neighborhood$w1_second_type_in_neighborhood[[i]], neighborhood$w2_second_type_in_neighborhood[[i]])
340 |   })
341 |   return(neighborhood)
342 | }
343 | 
344 | 
345 | #' Add Neighborhood Weight Composition of Spilling Cell Type
346 | #'
347 | #' This function computes and adds several metrics related to the weight composition of spilling
348 | #' cell types within a spatial neighborhood. It integrates multiple helper functions to calculate
349 | #' neighborhood composition, cell type weights, and the maximum weight of spilling cell types.
350 | #'
351 | #' @param neighborhood A list containing spatial neighborhood data. The `neighborhood` list
352 | #'   should include the required fields for the following computations:
353 | #'   - `add_neighborhood_weight_composition`: Calculates neighborhood cell-type weight composition.
354 | #'   - `add_cell_types_neighborhood_weights`: Extracts weights for the first and second cell types.
355 | #'   - `add_individual_w1_of_second_type_in_neighborhood`: Extracts individual `w1` weights of
356 | #'     the second type neighbors.
357 | #'   - `add_individual_w2_of_second_type_in_neighborhood`: Extracts individual `w2` weights of
358 | #'     the second type neighbors.
359 | #'   - `add_max_weight_of_spilling_type_in_neighborhood`: Computes the maximum weight of spilling
360 | #'     cell types in the neighborhood.
361 | #'
362 | #' @return The modified `neighborhood` list, augmented with additional metrics:
363 | #'   - `neighborhood_weight_composition`: Cell-type weight composition in the neighborhood.
364 | #'   - `neighborhood_weights`: Weights of the first and second cell types.
365 | #'   - `w1_second_type_in_neighborhood`: List of `w1` weights for the second cell type neighbors.
366 | #'   - `w2_second_type_in_neighborhood`: List of `w2` weights for the second cell type neighbors.
367 | #'   - `max_weight_of_spilling_type_in_neighborhood`: Maximum weight of spilling cell types in
368 | #'     the neighborhood.
369 | #'
370 | #' @seealso
371 | #' - \code{\link{add_neighborhood_weight_composition}}
372 | #' - \code{\link{add_cell_types_neighborhood_weights}}
373 | #' - \code{\link{add_individual_w1_of_second_type_in_neighborhood}}
374 | #' - \code{\link{add_individual_w2_of_second_type_in_neighborhood}}
375 | #' - \code{\link{add_max_weight_of_spilling_type_in_neighborhood}}
376 | #'
377 | #' @export
378 | 
379 | add_neigborhood_weight_composition_of_spilling_cell_type <- function(
380 |     neighborhood
381 | ){
382 |   neighborhood <- add_neighborhood_weight_composition(neighborhood = neighborhood)
383 |   neighborhood <- add_cell_types_neighborhood_weights(neighborhood = neighborhood)
384 |   neighborhood <- add_individual_w1_of_second_type_in_neighborhood(neighborhood = neighborhood)
385 |   neighborhood <- add_individual_w2_of_second_type_in_neighborhood(neighborhood = neighborhood)
386 |   neighborhood <- add_max_weight_of_spilling_type_in_neighborhood(neighborhood = neighborhood)
387 |   return(neighborhood)
388 | }
389 | 
390 | #' Add Neighborhood nCount Metrics for Spilling Cell Types
391 | #'
392 | #' This function calculates the sum of `nCount` values for cells in a spatial neighborhood,
393 | #' as well as the sum of `nCount` values specifically for cells of the spilling cell type
394 | #' within each neighborhood.
395 | #'
396 | #' @param neighborhood A list containing spatial neighborhood data. The list should include:
397 | #'   \itemize{
398 | #'     \item{\code{nn_idx}}{A matrix of indices representing neighborhood relationships. The first column represents the focal cell, and subsequent columns represent its neighbors.}
399 | #'     \item{\code{second_type_neighbors}}{A list where each element contains indices of neighbors annotated to the second cell type for the corresponding focal cell.}
400 | #'     \item{\code{nCount}}{A list of numeric vectors where each vector contains `nCount` values for cells in the dataset.}
401 | #'   }
402 | #'
403 | #' @return A modified version of the input \code{neighborhood} list with the following additional fields:
404 | #'   \itemize{
405 | #'     \item{\code{sum_nCount_neighborhood}}{A numeric vector containing the sum of `nCount` values for all neighbors of each focal cell.}
406 | #'     \item{\code{sum_nCount_neighborhood_spilling_type}}{A numeric vector containing the sum of `nCount` values for neighbors of each focal cell that belong to the spilling cell type.}
407 | #'   }
408 | #'
409 | #' @export
410 | 
411 | add_neigborhood_nCount_of_spilling_cell_type <- function(
412 |     neighborhood
413 | ){
414 |   N_cells <- length(neighborhood$second_type_neighbors)
415 |   neighborhood$sum_nCount_neighborhood <- sapply(1:N_cells, function(i){
416 |     sum(neighborhood$nCount[i, -1])
417 |   })
418 | 
419 |   neighborhood$sum_nCount_neighborhood_spilling_type <- sapply(1:N_cells, function(i){
420 |     sum(neighborhood$nCount[i, neighborhood$second_type_neighbors[[i]]])
421 |   })
422 | 
423 |   return(neighborhood)
424 | }
425 | 
426 | 
427 | #' Add Spatial Metrics to Neighborhood
428 | #'
429 | #' This function enhances a spatial neighborhood structure by integrating RCTD results and adding various spatial metrics.
430 | #' It incorporates infiltration metrics, weight compositions, and neighborhood count metrics, facilitating in-depth spatial analysis.
431 | #'
432 | #' @param spatial_neighborhood A list representing the spatial neighborhood data. It should include fields required for neighborhood and infiltration analysis.
433 | #' @param rctd An RCTD object containing cell type decomposition results.
434 | #'
435 | #' @return A modified spatial neighborhood list enriched with additional metrics, including:
436 | #' \itemize{
437 | #'   \item Infiltration metrics.
438 | #'   \item Weight composition of spilling cell types.
439 | #'   \item Sum of \code{nCount} values for neighborhood and spilling cell types.
440 | #' }
441 | #'
442 | #' @details This function sequentially:
443 | #' \enumerate{
444 | #'   \item Integrates RCTD results into the spatial neighborhood.
445 | #'   \item Calculates infiltration metrics based on cell type and class.
446 | #'   \item Computes the weight composition for spilling cell types.
447 | #'   \item Adds neighborhood \code{nCount} metrics for spilling cell types.
448 | #' }
449 | #'
450 | #' @seealso
451 | #' \itemize{
452 | #'   \item \code{\link{add_rctd_to_neighborhood}}: Integrates RCTD results into the neighborhood.
453 | #'   \item \code{\link{add_infiltration_metrics_to_neighborhood}}: Computes infiltration metrics.
454 | #'   \item \code{\link{add_neigborhood_weight_composition_of_spilling_cell_type}}: Adds weight composition for spilling cell types.
455 | #'   \item \code{\link{add_neigborhood_nCount_of_spilling_cell_type}}: Computes neighborhood \code{nCount} metrics.
456 | #' }
457 | #'
458 | #' @export
459 | #'
460 | 
461 | 
462 | add_spatial_metric <- function(spatial_neighborhood, rctd){
463 | 
464 |   spatial_neighborhood <- add_rctd_to_neighborhood(graph = spatial_neighborhood, rctd = rctd)
465 |   spatial_neighborhood <- add_infiltration_metrics_to_neighborhood(neighborhood = spatial_neighborhood)
466 |   spatial_neighborhood <- add_neigborhood_weight_composition_of_spilling_cell_type(neighborhood = spatial_neighborhood)
467 |   spatial_neighborhood <- add_neigborhood_nCount_of_spilling_cell_type(neighborhood = spatial_neighborhood)
468 |   return(spatial_neighborhood)
469 | }
470 | 
471 | 
472 | #' Annotate Transcriptomics Neighborhood with Dominant Cell Types
473 | #'
474 | #' This function annotates each neighborhood in a transcriptomics dataset by determining 
475 | #' the most frequent cell type (or class) for each neighborhood category (`first_type` and `second_type`).
476 | #'
477 | #' @param neighborhood A data frame or list containing neighborhood-level transcriptomics data. 
478 | #'        It must contain `first_type`, `second_type`, and optionally `first_type_class` and `second_type_class`.
479 | #'
480 | #' @return The updated `neighborhood` object with the following added columns:
481 | #'   \itemize{
482 | #'     \item `first_type_neighborhood`: The most frequent first type in the neighborhood.
483 | #'     \item `first_type_class_neighborhood`: The most frequent first type class in the neighborhood (if available).
484 | #'     \item `second_type_neighborhood`: The most frequent second type in the neighborhood.
485 | #'     \item `second_type_class_neighborhood`: The most frequent second type class in the neighborhood (if available).
486 | #'   }
487 | #'
488 | #' @details The function applies a helper function `find_most_frequent()` that determines 
489 | #' the most frequently occurring value in a given set of neighborhood annotations. If the `first_type_class`
490 | #' or `second_type_class` column is missing, a warning is issued, and the respective computation is skipped.
491 | #'
492 | #' @export
493 | #'
494 | add_annotation_from_neighbors <- function(neighborhood) {
495 |   
496 |   # Helper function to determine the most frequent annotation in a neighborhood
497 |   find_most_frequent <- function(values) {
498 |     value_counts <- values[-1]
499 |     most_frequent <- table(value_counts, useNA = "ifany") %>% which.max %>% names %>% .[1]
500 |     return(most_frequent)
501 |   }
502 |   
503 |   # Annotate first type neighborhood
504 |   neighborhood$first_type_neighborhood <- apply(
505 |     neighborhood$first_type,
506 |     1, 
507 |     find_most_frequent
508 |   )
509 |   
510 |   neighborhood$first_type_neighborhood_agreement <- neighborhood$first_type[,1] == neighborhood$first_type_neighborhood
511 |   
512 |   # Annotate first type class neighborhood if available
513 |   if ("first_type_class" %in% names(neighborhood)) {
514 |     neighborhood$first_type_class_neighborhood <- apply(
515 |       neighborhood$first_type_class,
516 |       1, 
517 |       find_most_frequent
518 |     )
519 |     
520 |     neighborhood$first_type_class_neighborhood_agreement <- neighborhood$first_type_class[,1] == neighborhood$first_type_class_neighborhood
521 |     
522 |   } else {
523 |     warning("`first_type_class` does not exist in the neighborhood object, `first_type_class_neighborhood` was not computed")
524 |   }
525 |   
526 |   # Annotate second type neighborhood
527 |   neighborhood$second_type_neighborhood <- apply(
528 |     neighborhood$second_type,
529 |     1, 
530 |     find_most_frequent
531 |   )
532 |   
533 |   neighborhood$second_type_neighborhood_agreement <- neighborhood$second_type[,1] == neighborhood$second_type_neighborhood
534 |   
535 |   
536 |   # Annotate second type class neighborhood if available
537 |   if ("second_type_class" %in% names(neighborhood)) {
538 |     neighborhood$second_type_class_neighborhood <- apply(
539 |       neighborhood$second_type_class,
540 |       1, 
541 |       find_most_frequent
542 |     )
543 |     
544 |     neighborhood$second_type_class_neighborhood_agreement <- neighborhood$second_type_class[,1] == neighborhood$second_type_class_neighborhood
545 |     
546 |   } else {
547 |     warning("`second_type_class` does not exist in the neighborhood object, `second_type_class_neighborhood` was not computed")
548 |   }
549 |   
550 |   return(neighborhood)
551 | }
552 | 
553 | 
554 | #' Annotate Neighborhood with Certainty Scores
555 | #'
556 | #' This function calculates a certainty score for each neighborhood based on 
557 | #' the entropy of its cell type composition. The certainty score ranges from 0 
558 | #' (high entropy, diverse cell types) to 1 (low entropy, dominated by a single cell type).
559 | #'
560 | #' @param neighborhood 
561 | #'   A data frame or list containing neighborhood-level cell type data. It must 
562 | #'   contain `first_type` and `second_type`, and optionally 
563 | #'   `first_type_class` and `second_type_class`.
564 | #'
565 | #' @return 
566 | #'   The updated `neighborhood` object with the following additional columns:
567 | #'   \itemize{
568 | #'     \item `first_type_neighborhood_certainty`: Certainty score for the first type.
569 | #'     \item `first_type_class_neighborhood_certainty`: Certainty score for the first type class (if available).
570 | #'     \item `second_type_neighborhood_certainty`: Certainty score for the second type.
571 | #'     \item `second_type_class_neighborhood_certainty`: Certainty score for the second type class (if available).
572 | #'   }
573 | #'
574 | #' @details 
575 | #'   Certainty is computed as:
576 | #'   \deqn{1 - \frac{H}{\log(N)}}{
577 | #'   1 - entropy / log(N)
578 | #'   }
579 | #'   where \eqn{H} is the entropy of the neighborhood's cell type distribution, 
580 | #'   and \eqn{N} is the total number of unique cell types. A higher score 
581 | #'   indicates a more homogenous neighborhood.
582 | #'
583 | #'   If `first_type_class` or `second_type_class` is missing, a warning is issued 
584 | #'   and the respective certainty score is not computed.
585 | #'
586 | #' @importFrom entropy entropy
587 | #' @export
588 | #'
589 | 
590 | add_neighborhood_annotation_certainty <- function(
591 |     neighborhood
592 | ) {
593 |   
594 |   # Helper function to compute normalized certainty score based on entropy
595 |   neighborhood_normalized_certainty <- function(x) {
596 |     y <- x[-1]
597 |     fr <- table(y) %>% as.numeric()
598 |     res <- 1 - entropy::entropy(fr) / log(length(y)) 
599 |     # More accurate denominator: log(min(length(y), N_cell_types))
600 |     return(res)
601 |   }
602 |   
603 |   # Compute certainty scores for first type neighborhoods
604 |   neighborhood$first_type_neighborhood_certainty <- apply(
605 |     neighborhood$first_type,
606 |     1, 
607 |     neighborhood_normalized_certainty
608 |   )
609 |   
610 |   # Compute certainty scores for first type class neighborhoods if available
611 |   if ("first_type_class" %in% names(neighborhood)) {
612 |     neighborhood$first_type_class_neighborhood_certainty <- apply(
613 |       neighborhood$first_type_class,
614 |       1, 
615 |       neighborhood_normalized_certainty
616 |     )
617 |   } else {
618 |     warning("`first_type_class` does not exist in the neighborhood object, `first_type_class_neighborhood_certainty` was not computed")
619 |   }
620 |   
621 |   # Compute certainty scores for second type neighborhoods
622 |   neighborhood$second_type_neighborhood_certainty <- apply(
623 |     neighborhood$second_type,
624 |     1, 
625 |     FUN = neighborhood_normalized_certainty
626 |   )
627 |   
628 |   # Compute certainty scores for second type class neighborhoods if available
629 |   if ("second_type_class" %in% names(neighborhood)) {
630 |     neighborhood$second_type_class_neighborhood_certainty <- apply(
631 |       neighborhood$second_type_class,
632 |       1, 
633 |       FUN = neighborhood_normalized_certainty
634 |     )
635 |   } else {
636 |     warning("`second_type_class` does not exist in the neighborhood object, `second_type_class_neighborhood_certainty` was not computed")
637 |   }
638 |   
639 |   return(neighborhood)
640 | }
641 | 
642 | 
643 | 
644 | #' Add Transcriptomics-Based Metrics to Neighborhood Data
645 | #'
646 | #' This function enriches a transcriptomics neighborhood dataset by incorporating 
647 | #' additional metrics from an `rctd` object, propagating annotations from neighboring cells, 
648 | #' and computing neighborhood annotation certainty scores.
649 | #'
650 | #' @param transcriptomics_neighborhood 
651 | #'   A data frame or list representing the transcriptomics neighborhood, 
652 | #'   containing spatial relationships between cells.
653 | #' @param rctd 
654 | #'   An object containing **RCTD** (Robust Cell Type Decomposition) results, 
655 | #'   which provide cell type compositions inferred from spatial transcriptomics data.
656 | #'
657 | #' @return 
658 | #'   The updated `transcriptomics_neighborhood` object with additional transcriptomics-based metrics:
659 | #'   \itemize{
660 | #'     \item **RCTD-derived cell type information** integrated into the neighborhood.
661 | #'     \item **Propagated annotations from neighboring cells.**
662 | #'     \item **Certainty scores** quantifying the confidence of neighborhood-level annotations.
663 | #'   }
664 | #'
665 | #' @details 
666 | #'   The function performs the following steps:
667 | #'   \enumerate{
668 | #'     \item **Integrates RCTD data** into the transcriptomics neighborhood using `add_rctd_to_neighborhood()`.
669 | #'     \item **Propagates annotations from neighboring cells** using `add_annotation_from_neighbors()`.
670 | #'     \item **Computes neighborhood annotation certainty** using `add_neighborhood_annotation_certainty()`.
671 | #'   }
672 | #'
673 | #' @seealso 
674 | #'   \code{\link{add_rctd_to_neighborhood}}, 
675 | #'   \code{\link{add_annotation_from_neighbors}}, 
676 | #'   \code{\link{add_neighborhood_annotation_certainty}}
677 | #'
678 | #' @export
679 | #'
680 | 
681 | add_transcriptomics_metric <- function(transcriptomics_neighborhood, rctd) {
682 |   
683 |   # Add RCTD data to the transcriptomics neighborhood
684 |   transcriptomics_neighborhood <- add_rctd_to_neighborhood(
685 |     graph = transcriptomics_neighborhood, 
686 |     rctd = rctd
687 |   )
688 |   
689 |   # Propagate annotations from neighboring cells
690 |   transcriptomics_neighborhood <- add_annotation_from_neighbors(
691 |     neighborhood = transcriptomics_neighborhood
692 |   )
693 |   
694 |   # Compute neighborhood annotation certainty
695 |   transcriptomics_neighborhood <- add_neighborhood_annotation_certainty(
696 |     neighborhood = transcriptomics_neighborhood
697 |   )
698 |   
699 |   return(transcriptomics_neighborhood)
700 | }
701 | 
702 | 
703 | 
704 | #' Convert Neighborhood Analysis to Metadata
705 | #'
706 | #' This function extracts specific elements from a neighborhood list and
707 | #' converts them into a data frame format suitable for further analysis or
708 | #' integration with metadata. It processes variables that are either vectors
709 | #' or matrices, retaining relevant columns and applying necessary transformations.
710 | #'
711 | #' @param neighborhood A list containing neighborhood data, with various
712 | #'   variables, including vectors and matrices.
713 | #'
714 | #' @return A data frame with the following columns:
715 | #'   - Variables from the input `neighborhood` list that are either vectors
716 | #'     or matrices, with matrices reduced to their first column.
717 | #'   - The data frame is indexed by the `cell_id` column from the input list.
718 | #'
719 | #' @details
720 | #'   - The function checks for the class of each element in the `neighborhood`
721 | #'     list and processes vectors and matrices.
722 | #'   - For matrices, only the first column is kept.
723 | #'   - A data frame is created from the selected variables, using `cell_id`
724 | #'     as the row names.
725 | #'
726 | #' @export
727 | #'
728 | 
729 | neighborhood_analysis_to_metadata <- function(
730 |     neighborhood
731 | ){
732 |   var_names <- names(neighborhood)
733 |   var_class <- lapply(neighborhood, function(x){class(x)[1]}) %>% unlist()
734 |   is_vct    <- var_class[which(!var_class %in% c("matrix", "list", "igraph"))] %>% names()
735 | 
736 |   variables_to_keep <- c(is_vct)
737 |   neighborhood_df <- neighborhood[variables_to_keep]
738 |   neighborhood_df <- as.data.frame(neighborhood_df, row.names = neighborhood_df$cell_id)
739 | 
740 |   return(neighborhood_df)
741 | 
742 | }
743 | 


--------------------------------------------------------------------------------
/R/post_RCTD.R:
--------------------------------------------------------------------------------
  1 | #' Updates `score_mat` of the `Run.RCTD` output
  2 | #'
  3 | #' Adds `singlet_scores` as a diagonal and removes cell types with a low weight in full cell-type decomposition.
  4 | #'
  5 | #' This function modifies the `score_mat` by adding `singlet_scores` as a diagonal matrix and removes cell types
  6 | #' from the candidate list if their corresponding weight is below a user-defined threshold. The updated
  7 | #' `score_mat` and `singlet_scores` are saved in the `rctd` object.
  8 | #'
  9 | #' @param rctd An object resulting from \link[spacerx]{Run.RCTD}.
 10 | #' @param min_weight A threshold (numeric) to keep cell types as candidates. Cell types with a weight below this
 11 | #'        threshold are removed from the `score_mat`. Default is 0.01, which is the same as in `Run.RCTD`.
 12 | #' @param verbose Logical. If `TRUE`, the function will print messages about removed low-weight cell types.
 13 | #'        Default is `FALSE`.
 14 | #'
 15 | #' @return An updated `rctd` object with modified `score_mat` and `singlet_scores`.
 16 | #'
 17 | 
 18 | 
 19 | update_score_mat_RCTD <- function(
 20 |     rctd,
 21 |     min_weight = .01, # same as in run.RCTD, use higher values to remove non-relevant candidates
 22 |     verbose = FALSE,
 23 |     BPPARAM = bpparam(),
 24 |     n_workers = NULL
 25 | ){
 26 |   score_mat        <- rctd@results$score_mat
 27 |   weights          <- rctd@results$weights
 28 |   singlet_scores   <- rctd@results$singlet_scores
 29 | 
 30 |   # update score_mat by adding `singlet_scores` as a diagonal and then removing
 31 |   # candidate cell types that do not have sufficient weight
 32 | 
 33 |   cell_types <- colnames(weights)
 34 | 
 35 |   if (is.null(n_workers)) {
 36 |     n_workers <- multicoreWorkers() - 1
 37 |   }
 38 |   param <- MulticoreParam(workers = n_workers)
 39 | 
 40 |   result_list <- bplapply(seq_along(score_mat), function(i) {
 41 |     smat <- as.matrix(score_mat[[i]])
 42 |     diag(smat) <- singlet_scores[[i]]
 43 | 
 44 |     new_candidates <- cell_types[weights[i, ] > min_weight]
 45 |     cur_ct <- colnames(smat)
 46 |     keep_ct <- intersect(cur_ct, new_candidates)
 47 | 
 48 |     smat_sub <- smat[keep_ct, keep_ct, drop = FALSE]
 49 |     sscore_sub <- singlet_scores[[i]][keep_ct]
 50 | 
 51 |     list(score_mat = smat_sub, singlet_scores = sscore_sub)
 52 |   }, BPPARAM = BPPARAM)
 53 | 
 54 |   score_mat_xe <- lapply(result_list, function(x) x$score_mat)
 55 |   singlet_scores_xe <- lapply(result_list, function(x) x$singlet_scores)
 56 | 
 57 |   rctd@results$score_mat_xe <- score_mat_xe
 58 |   rctd@results$singlet_scores_xe <- singlet_scores_xe
 59 |   return(rctd)
 60 | }
 61 | 
 62 | #' Correct First Type in Very Confident Singlets
 63 | #'
 64 | #' This function updates the `first_type` and `spot_class` fields in the results of an RCTD object for cells classified as highly confident singlets. It ensures accurate cell-type annotations for singlets with only one candidate cell type and updates related fields accordingly.
 65 | #'
 66 | #' @param rctd An `RCTD` object containing spatial transcriptomics annotation results. The object must have the `singlet_scores_xe` field in `rctd@results`. Run `update_score_mat()` prior to using this function if the field is missing.
 67 | #'
 68 | #' @return An updated `RCTD` object with the following fields modified in `rctd@results$results_df_xe`:
 69 | #' \itemize{
 70 | #'   \item \code{first_type}: Updated for confident singlets to reflect correct annotation.
 71 | #'   \item \code{second_type}: Set to \code{NA} for confident singlets.
 72 | #'   \item \code{spot_class}: Updated to "singlet" for highly confident cells.
 73 | #'   \item \code{max_doublet_weight}: Added to capture the maximum weight for doublets.
 74 | #'   \item \code{n_candidates}: Number of candidate cell types for each spot.
 75 | #'   \item \code{rctd_weights_entropy}: Entropy of rctd weights in full cell-type decomposition.
 76 | #'
 77 | #' }
 78 | #'
 79 | #' @details
 80 | #' The function identifies cells with only one candidate cell type (i.e., highly confident singlets). It corrects the `first_type` assignment, removes the `second_type` (assignes to NA), and updates the `spot_class` to "singlet." This ensures proper classification of confident cells while leaving other cells unchanged.
 81 | #'
 82 | #' @examples
 83 | #' \dontrun{
 84 | #' rctd <- update_score_mat(rctd)
 85 | #' rctd <- correct_singlets(rctd)
 86 | #' }
 87 | #'
 88 | #' @import dplyr
 89 | 
 90 | 
 91 | correct_singlets <- function(
 92 |     rctd
 93 | ){
 94 |   if(!("singlet_scores_xe" %in% names(rctd@results)))
 95 |     stop("No `singlet_scores_xe` field in rctd@results, run `update_score_mat()` first!")
 96 | 
 97 |   if("scond_type" %in% colnames(rctd@results$results_df)){
 98 |     rctd@results$results_df$second_type <- rctd@results$results_df$scond_type
 99 |   }
100 |   df <- rctd@results$results_df
101 | 
102 |   # select cells with only one candidate cell type
103 |   len_mat <- lapply(rctd@results$singlet_scores_xe, FUN = length) %>% as.numeric() # number of cell type candidates (dont use singlet score, as that one has not been updated)
104 |   confident_singlet_idx <- which(len_mat == 1)
105 |   no_cell_type_idx <- which(len_mat == 0)
106 |   rejects_idx <- which(rctd@results$results_df$spot_class == "reject")
107 | 
108 |   #names(rctd@results$singlet_scores_xe) <- rctd@results$results_df %>% rownames()
109 | 
110 |   #get their singlet scores
111 |   argmin_singlet_score <- sapply(rctd@results$singlet_scores_xe,
112 |                                  function(x){
113 |                                    r <- x %>% which.min() %>% names()
114 |                                    if(is.null(r))
115 |                                      r <- NA
116 |                                    return(r)
117 |                                    }, simplify = T) %>% unlist()
118 | 
119 |   # update first type to make sure highly confined cells have correct annotation
120 |   first_type_updated <- rctd@results$results_df$first_type
121 |   first_type_updated[confident_singlet_idx] <- argmin_singlet_score[confident_singlet_idx]
122 |   first_type_updated[no_cell_type_idx] <- NA_character_
123 | 
124 |   # replace second type with NA as now RCTD assigns it to a random cell type (usually the first of the available cell types)
125 |   second_type_updated <- rctd@results$results_df$second_type
126 |   second_type_updated[confident_singlet_idx] <- NA_character_
127 |   second_type_updated[no_cell_type_idx] <- NA_character_
128 | 
129 |   # update spot class of highly confident cells to singlets
130 |   spot_class_upd <- rctd@results$results_df$spot_class
131 |   spot_class_upd[setdiff(confident_singlet_idx, rejects_idx)] <- "singlet"
132 | 
133 |   spot_class_upd[no_cell_type_idx] <- "reject"
134 |   spot_class_upd <- ordered(spot_class_upd, levels = c("reject", "doublet_uncertain", "doublet_certain", "singlet"))
135 |   max_doublet_weight <- apply(rctd@results$weights_doublet, 1, max) %>% unname()
136 | 
137 |   # compute entropy of rctd weight score
138 |   weights <- rctd@results$weights
139 |   weights[weights<0] <- 0
140 |   weights <- spacexr::normalize_weights(weights)
141 |   weights_entr <- apply(weights, 1, entropy::entropy) %>% unname()
142 | 
143 |   df <- rctd@results$results_df %>%
144 |     mutate(
145 |       first_type = first_type_updated,
146 |       second_type = second_type_updated,
147 |       max_doublet_weight = max_doublet_weight,
148 |       spot_class = spot_class_upd,
149 |       n_candidates = len_mat,
150 |       rctd_weights_entropy = weights_entr
151 |     )
152 | 
153 |   rctd@results$results_df_xe <- df
154 |   return(rctd)
155 | }
156 | 
157 | 
158 | #' Update scores in RCTD results data frame
159 | #'
160 | #' This function updates existing scores in `rctd@results$results_df_xe`, computes new scores, and adds flags to facilitate further analysis. The function integrates additional metadata, refines singlet scores for cell type annotations, computes score differences, and checks for relationships between cell types.
161 | #'
162 | #' @param rctd An `RCTD` object containing spatial transcriptomics data and results.
163 | #'
164 | #' @return An updated `RCTD` object with modified and new fields in `rctd@results$results_df_xe`, including:
165 | #' \itemize{
166 | #'   \item \code{singlet_score_first}: Singlet score for the cell type assigned to `first_type`.
167 | #'   \item \code{singlet_score_second}: Singlet score for the cell type assigned to `second_type` (if applicable).
168 | #'   \item \code{delta_singlet_score_first_second}: Difference between \code{singlet_score_first} and \code{singlet_score_second}.
169 | #'   \item \code{score_diff}: Difference between \code{singlet_score_first} and \code{min_score}.
170 | #'   \item \code{delta_singlet_score}: Difference between the smallest and second smallest singlet scores.
171 | #'   \item \code{delta_singlet_score_class}: Similar to \code{delta_singlet_score}, but excludes scores from the same class as \code{first_type}.
172 | #'   \item \code{weight_first_type}: Weight for the cell type assigned to `first_type`.
173 | #'   \item \code{weight_second_type}: Weight for the cell type assigned to `second_type` (if applicable).
174 | #'   \item \code{same_class}: Logical flag indicating whether `first_type` and `second_type` belong to the same class.
175 | #' }
176 | #'
177 | #' @details
178 | #' The function refines singlet scores for `first_type` and `second_type`, computes score differences, and introduces new metrics such as `delta_singlet_score` and `delta_singlet_score_class`. These metrics help distinguish between highly confident cell types and ambiguous cases. Additionally, the function evaluates whether `first_type` and `second_type` belong to the same class, using an internal class mapping.
179 | #'
180 | 
181 | update_scores_RCTD <- function(rctd, lite = TRUE){
182 | 
183 |   df <- rctd@results$results_df_xe
184 | 
185 |   if(!lite){
186 |     # Ensure first_type and second_type are available in the data frame
187 |     df <- df %>% mutate(
188 |       singlet_score_first = sapply(1:nrow(df), function(i) {
189 |         ft <- df$first_type[i] %>% as.vector()
190 |         return(rctd@results$singlet_scores[[i]][ft] %>% unname())
191 |       }),
192 |       singlet_score_second = sapply(1:nrow(df), function(i) {
193 |         st <- df$second_type[i] %>% as.vector()
194 |         if (is.na(st)) return(NA)
195 |         return(rctd@results$singlet_scores[[i]][st] %>% unname())
196 |       }),
197 |       delta_singlet_score_first_second = singlet_score_second - singlet_score_first
198 |     )
199 | 
200 |     # Update score_diff to be consistent with the new singlet_score
201 |     df <- df %>% mutate(
202 |       score_diff = df$singlet_score_first - df$min_score,  # Use this one
203 |       score_diff_old = df$singlet_score - df$min_score,  # Keep this to track
204 |       delta_singlet_score_original_first_class = singlet_score_first - singlet_score  # Difference between singlet_score and singlet_score_first
205 |     )
206 |   }
207 | 
208 |   # Add weight_first_type, weight_second_type
209 |   df <- df %>% mutate(
210 |     weight_first_type = rctd@results$weights_doublet[,"first_type"] %>% unname(),
211 |     weight_second_type = rctd@results$weights_doublet[,"second_type"] %>% unname()
212 |   )
213 | 
214 |   if(!lite){
215 |     # Calculate delta_singlet_score: difference between first and second smallest singlet scores
216 |     sorted_singlet_scores <- sapply(rctd@results$singlet_scores, FUN = function(x) sort(x, decreasing = F))
217 |     delta_singlet_score <- sapply(sorted_singlet_scores, FUN = function(x) {
218 |       if (length(x) == 1) return(Inf)
219 |       return(x[2] - x[1])
220 |     })
221 |     df$delta_singlet_score <- delta_singlet_score
222 | 
223 |     # Calculate delta_singlet_score_class (ignore same class elements when computing delta score)
224 |     sorted_singlet_scores_class <- sapply(sorted_singlet_scores, function(x) {
225 |       class_vec <- rctd@internal_vars$class_df[names(x), "class"]
226 |       mask <- class_vec != class_vec[1]  # Keep elements where their class != class[1]
227 |       mask[1] <- TRUE  # Keep the first element
228 |       return(x[mask])
229 |     })
230 | 
231 |     delta_singlet_score_class <- sapply(sorted_singlet_scores_class, FUN = function(x) {
232 |       if (length(x) == 1) return(0)
233 |       return(x[2] - x[1])
234 |     })
235 | 
236 |     df$delta_singlet_score_class <- delta_singlet_score_class
237 |   }
238 | 
239 |   # Check whether first_type and second_type come from the same class
240 |   df <- df %>%
241 |     mutate(
242 |       first_type_class = factor(
243 |         rctd@internal_vars$class_df[df$first_type %>% as.vector(), "class"],
244 |         levels = unique(rctd@internal_vars$class_df$class)
245 |       ),
246 |       second_type_class = factor(
247 |         rctd@internal_vars$class_df[df$second_type %>% as.vector(), "class"],
248 |         levels = unique(rctd@internal_vars$class_df$class)
249 |       ),
250 |       same_class =  first_type_class == second_type_class
251 |     )
252 | 
253 |   rctd@results$results_df_xe <- df
254 |   return(rctd)
255 | }
256 | 
257 | 
258 | #' Normalize `score_diff` to Compensate for Feature Count and Update `spot_class`
259 | #'
260 | #' This function normalizes `score_diff` in the RCTD results to account for the positive correlation between the number of features (`nFeature`) and the goodness-of-fit metrics (`singlet_score` and `min_score`). The normalization helps to reduce the bias introduced by feature count and adjusts the `spot_class` accordingly.
261 | #'
262 | #' @param rctd An `RCTD` object containing spatial transcriptomics data and results.
263 | #' @param nFeature_doublet_threshold Numeric. Threshold for `score_diff_normalized` to classify a spot as a singlet. Default is 0.5.
264 | #' @param nFeature a vector of nFeatures for `rctd@results$results_df_xe` rows, if not provided, will be computed from `rctd@spatialRNA@counts`
265 | #' @param nCount a vector of nCount for `rctd@results$results_df_xe` rows, if not provided, will be computed from `rctd@spatialRNA@counts`
266 | #'
267 | #' @return An updated `RCTD` object with the following fields added or updated in `rctd@results$results_df_xe`:
268 | #' \itemize{
269 | #'   \item \code{nCount}: The count of reads or molecules for each spot, retrieved from \code{xe}.
270 | #'   \item \code{nFeature}: The number of features (e.g., genes) detected for each spot, retrieved from \code{xe}.
271 | #'   \item \code{score_diff_normalized}: The normalized \code{score_diff}, computed as \code{score_diff / nFeature}.
272 | #'   \item \code{is_singlet_in_normalized_thresh}: Logical flag indicating whether a spot's normalized score falls below the doublet threshold.
273 | #'   \item \code{spot_class_normalized}: Updated `spot_class` that classifies spots as "singlet" if they pass the normalized threshold and were not previously rejected.
274 | #' }
275 | #'
276 | #' @details
277 | #' The function compensates for the effect of feature count (`nFeature`) on `score_diff` by normalizing it. Spots are reclassified as singlets if their normalized score (`score_diff_normalized`) is below the threshold specified by \code{nFeature_doublet_threshold}, and they are not rejected.
278 | 
279 | 
280 | normalize_score_diff_by_nFeature <- function(
281 |     rctd,
282 |     nFeature_doublet_threshold = 0.5,
283 |     nFeature = NULL,
284 |     nCount = NULL
285 | ) {
286 | 
287 |   if(is.null(nFeature) || is.null(nCount)){
288 |     # nCount, nFeature
289 |     rctd@results$results_df_xe <- rctd@results$results_df_xe %>% mutate(
290 |       nCount = Matrix::colSums(rctd@spatialRNA@counts) %>% unname(),
291 |       nFeature = Matrix::colSums(rctd@spatialRNA@counts > 0) %>% unname()
292 |     )
293 |   } else {
294 |     rctd@results$results_df_xe <- rctd@results$results_df_xe %>% mutate(
295 |       nCount = nCount,
296 |       nFeature = nFeature
297 |     )
298 |   }
299 | 
300 |   # Normalization
301 |   rctd@results$results_df_xe <- rctd@results$results_df_xe %>%
302 |     mutate(
303 |       score_diff_normalized = score_diff / nFeature,
304 |       is_singlet_in_normalized_thresh = score_diff_normalized < nFeature_doublet_threshold,
305 |       spot_class_normalized = if_else(is_singlet_in_normalized_thresh & spot_class != "reject",
306 |                                       "singlet", spot_class) %>% ordered(levels = levels(spot_class))
307 |     )
308 |   return(rctd)
309 | }
310 | 
311 | #' Compute Alternative Annotations
312 | #'
313 | #' This function computes three alternative annotations for each cell based on the results in the `rctd` object:
314 | #'
315 | #' - `annot_min_singlet_score`: The cell type with the minimum singlet score.
316 | #' - `annot_max_weight`: The cell type with the maximum weight.
317 | #' - `annot_max_doublet_weight`: The cell type with the maximum doublet weight, choosing the first type if the first weight is greater than the second, or the second type otherwise.
318 | #'
319 | #' These annotations are added to the `results_df_xe` slot of the `rctd` object.
320 | #'
321 | #' @param rctd An object of class `RCTD` containing the results from the spatial transcriptomics analysis.
322 | #'
323 | #' @return The input `rctd` object with the updated `results_df_xe` slot containing the three alternative annotations.
324 | #'
325 | 
326 | compute_alternative_annotations <- function(rctd){
327 | 
328 |   # Compute alternative annotations
329 |   annot_min_singlet_score <- sapply(rctd@results$singlet_scores, function(x) {
330 |     names(x)[which.min(x)]  # directly find the name with min singlet score
331 |   }) %>% unname()
332 | 
333 |   annot_max_weight <- apply(rctd@results$weights, 1, function(x) {
334 |     names(x)[which.max(x)] # find the name with max weight
335 |   }) %>% unname()
336 | 
337 |   annot_max_doublet_weight <- ifelse(
338 |     rctd@results$results_df_xe$weight_first_type > rctd@results$results_df_xe$weight_second_type |
339 |       is.na(rctd@results$results_df_xe$weight_second_type),
340 |     rctd@results$results_df_xe$first_type %>% as.vector(),
341 |     rctd@results$results_df_xe$second_type %>% as.vector()
342 |   ) %>% unname()
343 | 
344 |   rctd@results$results_df_xe <- rctd@results$results_df_xe %>%
345 |     mutate(
346 |       annot_min_singlet_score = annot_min_singlet_score,
347 |       annot_max_weight = annot_max_weight,
348 |       annot_max_doublet_weight = annot_max_doublet_weight,
349 |       w1_larger_w2 = first_type == annot_max_doublet_weight
350 |     )
351 |   return(rctd)
352 | }
353 | 
354 | #' Computes Normalized Entropy of Annotations and Stores in `rctd@results$results_df_xe`
355 | #'
356 | #' This function computes the normalized entropy of the annotations for each cell type in the `rctd@results$results_df_xe` dataframe.
357 | #' The entropy is computed based on the relative frequencies of each annotation field and normalized by the maximum possible entropy.
358 | #'
359 | #' The following entropy values are computed:
360 | #' - `entropy_first_type`: Normalized entropy for the `first_type` annotation, considering annotations such as `annot_min_singlet_score`, `annot_max_weight`, and `annot_max_doublet_weight`.
361 | #' - `entropy_second_type`: Normalized entropy for the `second_type` annotation, computed in the same way as `entropy_first_type`.
362 | #'
363 | #' Entropy values indicate the diversity of the annotations for each cell type. A high entropy value suggests more diverse annotations, while a low value indicates more certainty in the annotation.
364 | #'
365 | #' @param rctd An object of class `RCTD` containing the results of the spatial transcriptomics analysis.
366 | #'
367 | #' @return The input `rctd` object with two additional columns in `results_df_xe`:
368 | #' - `entropy_first_type`: The normalized entropy for the `first_type` annotation.
369 | #' - `entropy_second_type`: The normalized entropy for the `second_type` annotation.
370 | 
371 | 
372 | compute_annotation_entropy <- function(rctd){
373 |   annotation_fields <- grep("annot_", colnames(rctd@results$results_df_xe), value = T)
374 | 
375 |   normalized_entropy <- function(x){
376 |     vec <- x %>% factor() %>% forcats::fct_infreq() %>% as.numeric()
377 |     freqs <- table(vec) / length(vec)
378 |     entropy_value <- entropy::entropy(freqs, unit = "log2")
379 |     # Calculate maximum entropy
380 |     max_entropy <- log2(length(vec))
381 | 
382 |     if(is.na(entropy_value)) # missing argmax doublet weight because the first type has low doublet weight,but very high overall weight such that ther is only one candidate cell type (instability in RCTD decomposition)
383 |       entropy_value <- 0
384 | 
385 |     return(entropy_value / max_entropy)
386 |   }
387 | 
388 |   entropy_first_type <- rctd@results$results_df_xe %>% select(all_of(c("first_type", annotation_fields))) %>%
389 |     apply(1, FUN = normalized_entropy)
390 |   entropy_second_type <- rctd@results$results_df_xe %>% select(all_of(c("second_type", annotation_fields))) %>%
391 |     apply(1, FUN = normalized_entropy)
392 | 
393 |   rctd@results$results_df_xe$entropy_first_type <- entropy_first_type
394 |   rctd@results$results_df_xe$entropy_second_type <- entropy_second_type
395 |   return(rctd)
396 | }
397 | 
398 | #' Computes Annotation Confidence Based on Normalized Entropy of Original and Alternative Annotations
399 | #'
400 | #' This function computes the annotation confidence for each cell based on the normalized entropy of the original and alternative annotations (`entropy_first_type` and `entropy_second_type`), as well as the cell's weight (`weight_first_type`).
401 | #' The confidence for each cell type is computed as follows:
402 | #' - `confidence_first_type = (1 - entropy_first_type) * weight_first_type`
403 | #' - `confidence_second_type = (1 - entropy_second_type) * (1 - weight_first_type)`
404 | #'
405 | #' These confidence values reflect the reliability of the annotations, with higher confidence indicating greater certainty in the cell's type assignment. The entropy values capture the diversity of the annotations, and the weights are used to adjust the confidence based on the strength of the cell-type assignment.
406 | #'
407 | #' @param rctd An object of class `RCTD` containing the results of the spatial transcriptomics analysis.
408 | #'
409 | #' @return The input `rctd` object with two additional columns in `results_df_xe`:
410 | #' - `confidence_first_type`: The computed confidence for the `first_type` annotation.
411 | #' - `confidence_second_type`: The computed confidence for the `second_type` annotation.
412 | 
413 | 
414 | compute_annotation_confidence <- function(rctd){
415 |   rctd@results$results_df_xe <- rctd@results$results_df_xe %>%
416 |     mutate(
417 |       confidence_first_type = (1 - entropy_first_type) * weight_first_type,
418 |       confidence_second_type = (1 - entropy_second_type) * (1-weight_first_type)
419 |     )
420 |   return(rctd)
421 | }
422 | 
423 | 
424 | #' Runs the post-processing pipeline for RCTD.
425 | #'
426 | #' This function sequentially applies a series of processing steps to the RCTD results,
427 | #' including score matrix updates, singlet correction, score updates, normalization,
428 | #' alternative annotation computation, entropy calculation, and annotation confidence computation.
429 | #'
430 | #' The following processing steps are performed in order:
431 | #' - Updates the score matrix using `update_score_mat_RCTD()`
432 | #' - Corrects singlets using `correct_singlets()`
433 | #' - Updates scores with `update_scores_RCTD()`
434 | #' - Normalizes the score difference by the number of features using `normalize_score_diff_by_nFeature()`
435 | #' - Computes alternative annotations using `compute_alternative_annotations()`
436 | #' - Computes normalized entropy for each annotation using `compute_annotation_entropy()`
437 | #' - Computes annotation confidence using `compute_annotation_confidence()`
438 | #'
439 | #' The final results are stored in the `rctd@results$results_df` slot, and the previous results are saved in
440 | #' `rctd@results$results_df_old` for tracking purposes. The intermediate results in `results_df_xe` are cleared.
441 | #'
442 | #' @param rctd RCTD object containing the results.
443 | #' @param min_weight Minimum weight threshold for the score matrix update (default: 0.05).
444 | #' @param nFeature_doublet_threshold Threshold for doublet classification when normalizing the score difference
445 | #'                                  by the number of features (default: 0.5).
446 | #' @param nFeature Optional numeric vector providing the number of features for each cell,
447 | #'                 overrides calculation based on the data if provided (default: NULL).
448 | #' @param nCount Optional numeric vector providing the count of features for each cell,
449 | #'                overrides calculation based on the data if provided (default: NULL).
450 | #'
451 | #' @return Updated RCTD object with processed results.
452 | #' @export
453 | 
454 | run_post_process_RCTD <- function(
455 |     rctd,
456 |     min_weight = 0.05,
457 |     nFeature_doublet_threshold = 0.5,
458 |     nFeature = NULL,
459 |     nCount = NULL,
460 |     n_workers = NULL,
461 |     lite = TRUE
462 | ){
463 |   message("Updating score_mat ...")
464 |   rctd <- update_score_mat_RCTD(
465 |     rctd = rctd,
466 |     min_weight = min_weight,
467 |     n_workers = n_workers
468 |   )
469 | 
470 |   message("Correcting singlets ...")
471 |   rctd <- correct_singlets(rctd = rctd)
472 | 
473 |   message("Updating scores ...")
474 |   rctd <- update_scores_RCTD(rctd = rctd, lite = lite)
475 | 
476 |   message("Add coordinates to results ...")
477 |   rctd@results$results_df_xe$x <- rctd@spatialRNA@coords[rownames(rctd@results$results_df_xe), "x"]
478 |   rctd@results$results_df_xe$y <- rctd@spatialRNA@coords[rownames(rctd@results$results_df_xe), "y"]
479 | 
480 |   if(!lite){
481 |     message("Normalizing score_diff by nFeature ...")
482 |     rctd <- normalize_score_diff_by_nFeature(
483 |       rctd = rctd,
484 |       nFeature_doublet_threshold = nFeature_doublet_threshold,
485 |       nFeature = nFeature,
486 |       nCount = nCount
487 |     )
488 |   }
489 | 
490 |   message("Computing alternative annotations ...")
491 |   rctd <- compute_alternative_annotations(rctd)
492 | 
493 |   #message("Computing annotation entropy ...")
494 |   #rctd <- compute_annotation_entropy(rctd)
495 | 
496 |   #message("Computing annotation confidence ...")
497 |   #rtrd <- compute_annotation_confidence(rctd)
498 | 
499 |   message("Replacing results_df ...")
500 |   # store the old results to keep track and replace them with the new ones
501 |   rctd@results$results_df_old <- rctd@results$results_df
502 |   rctd@results$results_df <- rctd@results$results_df_xe
503 |   rctd@results$results_df_xe <- NULL
504 | 
505 |   return(rctd)
506 | }
507 | 
508 | 


--------------------------------------------------------------------------------
/R/purification.R:
--------------------------------------------------------------------------------
  1 | # Functions below were adapted and modified from:
  2 | 
  3 | #get_decomposed_data() # https://github.com/dmcable/spacexr/blob/0a0861e3d1e16014a20e9b743d0e19d3b42231f3/R/postProcessing.R#L81C1-L81C20
  4 | #get_decomposed_data_full_doublet() # https://github.com/dmcable/spacexr/blob/0a0861e3d1e16014a20e9b743d0e19d3b42231f3/R/postProcessing.R#L44
  5 | #decompose_doublet_fast() # https://github.com/dmcable/spacexr/blob/0a0861e3d1e16014a20e9b743d0e19d3b42231f3/R/RCTD_helper.R#L219C1-L219C23
  6 | 
  7 | #' Decompose doublet into 2 profiles
  8 | #'
  9 | #' @param bead (counts) profile to decompose
 10 | #' @param weights cell-type weights (expected to sum up to 1)
 11 | #' @param gene_list vector of genes (should be (a subset of) common genes between reference data and query data)
 12 | #' @param cell_type_info slot of \link[spacexr]{run.RCDT} output containing reference profile matrix
 13 | #' @param type1 name of the main cell-type (expected `first_type`)
 14 | #' @param cell_types name(s) of all cell types present in the cell (expects `second_type` in case of decomposition of `doublet_certain` and all present cell types in case of `doublet_uncertain`)
 15 | #'
 16 | #' @return description list of the profile of the first and the second profiles,
 17 | #'in case of decomposition of `doublet_uncertain` (i.e., when `length(cell_types)>1`), the second profile corresponds to the complement (bead - first_profile)
 18 | #' @export
 19 | 
 20 | decompose_doublet <- function(
 21 |     bead, weights, gene_list, cell_type_info, type1, cell_types
 22 | ){
 23 |   bead     <- bead[gene_list]
 24 | 
 25 |   N_genes  <- length(gene_list)
 26 |   epsilon  <- 1e-10
 27 | 
 28 |   if(!(type1 %in% cell_types)){
 29 |     cell_types <- c(type1, cell_types)
 30 |   }
 31 | 
 32 |   cell_types   <- unique(cell_types)
 33 |   N_cell_types <- length(cell_types)
 34 | 
 35 |   denom       <- rowSums(sweep(as.matrix(cell_type_info[[1]][gene_list,cell_types]), 2, weights[cell_types], FUN = "*")) + epsilon
 36 |   posterior_1 <- (weights[type1] * cell_type_info[[1]][gene_list,type1] + epsilon/N_cell_types) / denom
 37 |   expect_1    <- posterior_1 * bead
 38 | #  expect_2    <- bead - expect_1
 39 |   variance    <- expect_1 * (1 - posterior_1)
 40 | 
 41 |   return(list(expect_1 = expect_1,
 42 |              # expect_2 = expect_2,
 43 |               variance = variance))
 44 | }
 45 | 
 46 | 
 47 | #' Purify Counts Using RCTD Output
 48 | #'
 49 | #' This function purifies a query count matrix using the RCTD output, handling both certain and uncertain doublets, as well as optionally processing singlets.
 50 | #'
 51 | #' @param counts A matrix of gene expression counts, where rows represent genes and columns represent cell IDs.
 52 | #' @param results_df A data frame from the `results_df` slot of \link[spacexr]{run.RCTD} output, containing cell-type assignments and classifications.
 53 | #' @param ct_weights A matrix of cell-type weights for each spot from the RCTD output.
 54 | #' @param cell_type_info A list containing cell-type information used for purification.
 55 | #' @param DO_purify_singlets Logical; if `TRUE`, singlets will also be purified.
 56 | #' @param n_workers Integer; the number of parallel workers to use. If `NULL`, it defaults to the number of available cores minus one.
 57 | #' @param chunk_size Integer; the number of barcodes processed in each batch for parallelization. Default is 10,000.
 58 | #'
 59 | #' @return A list containing:
 60 | #' \describe{
 61 | #'   \item{purified_counts}{A matrix of purified gene expression counts.}
 62 | #'   \item{cell_meta}{A data frame with metadata for each cell, including purification status.}
 63 | #' }
 64 | #'
 65 | #' @import BiocParallel
 66 | #' @export
 67 | 
 68 | purify_counts_with_rctd <- function(counts, results_df, ct_weights, cell_type_info, DO_purify_singlets, n_workers = NULL, chunk_size = 10000) {
 69 | 
 70 |   is.certain <- c("doublet_certain")
 71 |   if(DO_purify_singlets){
 72 |     is.certain <- c(is.certain, "singlet")
 73 |   }
 74 | 
 75 |   doublets_certain    <- results_df[results_df$spot_class %in% is.certain,] %>% rownames()
 76 |   doublets_uncertain  <- results_df[results_df$spot_class %in% c("doublet_uncertain"),] %>% rownames()
 77 | 
 78 |   if (is.null(n_workers)) {
 79 |     n_workers <- multicoreWorkers() - 1
 80 |   }
 81 |   param <- MulticoreParam(workers = n_workers)
 82 | 
 83 |   gene_list <- intersect(rownames(counts), rownames(cell_type_info[[1]]))
 84 |   print(length(gene_list))
 85 |   # Function to decompose certain doublets
 86 |   decompose_certain <- function(bead, results_df_bead, ct_weights_bead, gene_list, cell_type_info) {
 87 |     tryCatch({
 88 | 
 89 |       bead <- bead[gene_list, ]
 90 |       barcode <- rownames(results_df_bead)[1]
 91 |       type1 <- as.vector(results_df_bead[barcode, "first_type"])
 92 |       type2 <- as.vector(results_df_bead[barcode, "second_type"])
 93 | 
 94 |       w1 <- as.vector(results_df_bead[barcode, "weight_first_type"])
 95 |       w2 <- as.vector(results_df_bead[barcode, "weight_second_type"])
 96 | 
 97 |       if(is.na(type2)){ # highly confident singlet -> Do Not Purify!
 98 |         return(list(barcode = barcode, res = bead))
 99 |       }
100 | 
101 |       if(is.na(w2)){
102 |         w1 <- 1
103 |         w2 <- 0
104 |       }
105 | 
106 |       wgts <- c(w1, w2)
107 |       wgts <- wgts / sum(wgts)
108 | 
109 |       names(wgts) <- c(type1, type2)
110 | 
111 |       doub_res <- decompose_doublet(
112 |         bead,
113 |         wgts,
114 |         gene_list,
115 |         cell_type_info,
116 |         type1,
117 |         c(type1, type2)
118 |       )
119 |       return(list(barcode = barcode, res = doub_res$expect_1))
120 |     }, error = function(e) {
121 |       message(sprintf("Error processing barcode %s: %s", barcode, e$message))
122 |       return(NULL)
123 |     })
124 |   }
125 | 
126 |   # Function to decompose uncertain doublets
127 |   decompose_uncertain <- function(bead, results_df_bead, ct_weights_bead, gene_list, cell_type_info) {
128 |     tryCatch({
129 | 
130 |       bead <- bead[gene_list,]
131 |       barcode <- rownames(results_df_bead)[1]
132 |       type1 <- as.vector(results_df_bead[barcode, "first_type"])
133 |       type2 <- as.vector(results_df_bead[barcode, "second_type"])
134 | 
135 |       if(is.na(type2)){ # highly confident singlet -> Do Not Purify!
136 |         return(list(barcode = barcode, res = bead))
137 |       }
138 | 
139 |       wgts <- ct_weights_bead
140 |       wgts <- wgts / sum(wgts)
141 |       types <- names(wgts)
142 | 
143 |       doub_res <- decompose_doublet(
144 |         bead,
145 |         wgts,
146 |         gene_list,
147 |         cell_type_info,
148 |         type1,
149 |         types
150 |       )
151 |       return(list(barcode = barcode, res = doub_res$expect_1))
152 |     }, error = function(e) {
153 |       message(sprintf("Error processing barcode %s: %s", barcode, e$message))
154 |       return(NULL)
155 |     })
156 |   }
157 | 
158 |   # Helper function to process chunks
159 |   process_chunks <- function(barcodes, decompose_func) {
160 |     results_list <- list()
161 | 
162 |     for (i in seq(1, length(barcodes), by = chunk_size)) {
163 |       cat(round(i * 100 / length(barcodes)), "%\n")
164 | 
165 |       # Select only relevant barcodes for this chunk
166 |       chunk_barcodes <- barcodes[i:min(i + chunk_size - 1, length(barcodes))]
167 | 
168 | 
169 | 
170 |       # Process in parallel, passing only the subsetted counts
171 |       chunk_results <- bplapply(
172 |         chunk_barcodes,
173 |         function(barcode) {
174 |           decompose_func(counts[, barcode, drop = FALSE], results_df[barcode,], ct_weights[barcode,], gene_list, cell_type_info)
175 |         },
176 |         BPPARAM = param
177 |       )
178 | 
179 |       results_list <- c(results_list, chunk_results)
180 |     }
181 | 
182 |     results_list <- Filter(Negate(is.null), results_list)
183 |     return(results_list)
184 |   }
185 | 
186 | 
187 |   # Process certain doublets
188 |   cat("Processing certain doublets...\n")
189 |   certain_results <- process_chunks(doublets_certain, decompose_certain)
190 |   res_certain_mtrx <- matrix(NA, nrow = length(gene_list), ncol = length(doublets_certain), dimnames = list(gene_list, doublets_certain))
191 |   for (res in certain_results) {
192 |     res_certain_mtrx[, res$barcode] <- res$res
193 |   }
194 | 
195 |   # Process uncertain doublets
196 |   cat("Processing uncertain doublets...\n")
197 |   uncertain_results <- process_chunks(doublets_uncertain, decompose_uncertain)
198 |   res_uncertain_mtrx <- matrix(NA, nrow = length(gene_list), ncol = length(doublets_uncertain), dimnames = list(gene_list, doublets_uncertain))
199 |   for (res in uncertain_results) {
200 |     res_uncertain_mtrx[, res$barcode] <- res$res
201 |   }
202 | 
203 |   # Combine results
204 |   cat("Combaning doublets results ...\n")
205 |   purified <- cbind(res_certain_mtrx, res_uncertain_mtrx)
206 |   cell_ids <- c(colnames(res_certain_mtrx), colnames(res_uncertain_mtrx))
207 | 
208 |   # Process singlets
209 |   if(!DO_purify_singlets){
210 |     cat("Processing singlets...\n")
211 |     singlets <- results_df[results_df$spot_class == "singlet",]
212 |     res_singlet_mtrx <- counts[gene_list, rownames(singlets)]
213 |     purified <- cbind(purified, res_singlet_mtrx)
214 |     cell_ids <- c(cell_ids, colnames(res_singlet_mtrx))
215 |   }
216 | 
217 |   colnames(purified) <- cell_ids
218 |   rownames(purified) <- gene_list
219 | 
220 |   #purified <- purified[,colnames(counts)]
221 |   cell_meta <- results_df[colnames(purified),]
222 | 
223 |   cell_meta$purification_status <- "purified"
224 |   if(!DO_purify_singlets){
225 |     cell_meta$purification_status[cell_meta$spot_class == "singlet"] <- "raw"
226 |   }
227 |   cell_meta$purification_status[is.na(cell_meta$second_type)] <- "raw" # Very confident singlets are not purified
228 | 
229 |   cell_meta$cell_id <- rownames(cell_meta)
230 | 
231 |   return(list(purified_counts = purified, cell_meta = cell_meta))
232 | }
233 | 
234 | 
235 | #' Purify Data Using RCTD Output
236 | #'
237 | #' This function purifies a query count matrix using the RCTD output, handling both certain and uncertain doublets, as well as optionally processing singlets.
238 | #'
239 | #' @param counts A matrix of gene expression counts, where rows represent genes and columns represent cell IDs.
240 | #' @param rctd RCTD output
241 | #' @param DO_purify_singlets Logical; if `TRUE`, singlets will also be purified.
242 | #' @param n_workers Integer; the number of parallel workers to use. If `NULL`, it defaults to the number of available cores minus one.
243 | #' @param chunk_size Integer; the number of barcodes processed in each batch for parallelization. Default is 10,000.
244 | #'
245 | #' @return A list containing:
246 | #' \describe{
247 | #'   \item{purified_counts}{A matrix of purified gene expression counts.}
248 | #'   \item{cell_meta}{A data frame with metadata for each cell, including purification status.}
249 | #' }
250 | #'
251 | #' @import BiocParallel
252 | #' @export
253 | 
254 | purify <- function(counts, rctd, DO_purify_singlets, n_workers = NULL, chunk_size = 10000) {
255 | 
256 |   results_df <- rctd@results$results_df
257 | 
258 |   common_cells <- intersect(colnames(counts), rownames(results_df))
259 |   results_df <- results_df[common_cells, ]
260 | 
261 |   cell_type_info <- rctd@cell_type_info[[1]]
262 |   ct_weights <- rctd@results$weights
263 |   ct_weights <- ct_weights[common_cells, colnames(cell_type_info[[1]])]
264 | 
265 |   counts <- counts[,common_cells]
266 | 
267 |   return(purify_counts_with_rctd(
268 |     counts = counts,
269 |     results_df = results_df,
270 |     ct_weights = ct_weights,
271 |     cell_type_info = cell_type_info,
272 |     DO_purify_singlets = DO_purify_singlets,
273 |     n_workers = n_workers,
274 |     chunk_size = chunk_size
275 |   )
276 |   )
277 | }
278 | 
279 | 
280 | 
281 | 
282 | 
283 | 
284 | 
285 | 
286 | 
287 | 
288 | 
289 | 
290 | 


--------------------------------------------------------------------------------
/R/split_dataset.R:
--------------------------------------------------------------------------------
  1 | #' Split Purified Count Data
  2 | #'
  3 | #' This function splits purified count data into two components based on the first and second cell type annotations. It ensures that negative values in the residual split profile are replaced with zeros.
  4 | #'
  5 | #' @param counts A matrix of raw count data where rows represent genes and columns represent cells.
  6 | #' @param rctd An object containing RCTD results, including spatial and cell type information.
  7 | #' @param DO_purify_singlets A logical value indicating whether to purify singlet cells.
  8 | #' @param n_workers An optional integer specifying the number of workers for parallel processing. Defaults to `NULL`.
  9 | #' @param chunk_size An integer defining the chunk size for processing. Defaults to `10000`.
 10 | #'
 11 | #' @return A list containing:
 12 | #'   \itemize{
 13 | #'     \item `purified_counts`: A matrix of purified count data, with two components labeled `_1` and `_2` for each cell.
 14 | #'     \item `cell_meta`: A data frame containing metadata for the purified cells, including cell type annotations and decomposition order.
 15 | #'   }
 16 | #'
 17 | #' @note The function modifies negative residual values in `purified_2`, replacing them with zeros.
 18 | #' @export
 19 | 
 20 | 
 21 | split <- function(counts, rctd, DO_purify_singlets, n_workers = NULL, chunk_size = 10000){
 22 | 
 23 |   purified <- purify(counts = counts, rctd = rctd, DO_purify_singlets = DO_purify_singlets, n_workers = n_workers, chunk_size = chunk_size)
 24 | 
 25 |   purified_1 <- purified$purified_counts
 26 |   cell_meta  <- purified$cell_meta
 27 |   cell_meta$cell_id <- rownames(cell_meta)
 28 | 
 29 |   purified_2 <- counts[rownames(purified_1), colnames(purified_1)] - purified_1
 30 | 
 31 |   if(sum(purified_2<0) > 0){
 32 |     warning("Residual split profile has ", sum(purified_2<0), "negative values that are replaced with 0s")
 33 |     purified_2[purified_2<0] <- 0
 34 |   }
 35 | 
 36 |   cell_meta_1 <- cell_meta
 37 |   cell_meta_1$cell_type <- cell_meta_1$first_type
 38 | 
 39 |   colnames(purified_1) <- paste0(colnames(purified_1), "_1")
 40 |   rownames(cell_meta_1) <- paste0(rownames(cell_meta_1), "_1")
 41 |   cell_meta_1$decomposition_order <- "first"
 42 | 
 43 |   cell_meta_2 <- cell_meta
 44 |   cell_meta_2$cell_type <- cell_meta_2$second_type
 45 |   cell_meta_2$purification_status[cell_meta_2$purification_status == "raw"] <- "null"
 46 |   colnames(purified_2) <- paste0(colnames(purified_2), "_2")
 47 |   rownames(cell_meta_2) <- paste0(rownames(cell_meta_2), "_2")
 48 |   cell_meta_2$decomposition_order <- "second"
 49 | 
 50 | 
 51 |   return(list(
 52 |     purified_counts = cbind(purified_1, purified_2),
 53 |     cell_meta = rbind(cell_meta_1, cell_meta_2)
 54 |   ))
 55 | }
 56 | 
 57 | 
 58 | #' Split Purified Count Data
 59 | #' @export
 60 | 
 61 | 
 62 | balance_split <- function(
 63 |     xe_raw,
 64 |     xe_purified,
 65 |     spot_class_key = "spot_class",
 66 |     DO_purify_singlets = TRUE,
 67 |     DO_split_singlets = TRUE,
 68 |     DO_split_doublets_uncertain = FALSE, # keep first only
 69 |     default_assay = "Xenium"
 70 | ){
 71 | 
 72 |   if(!spot_class_key %in% colnames(xe_raw@meta.data)){
 73 |     stop("spot_class_key", spot_class_key, "is not available in `xe_raw`, please compute it first!")
 74 |   }
 75 | 
 76 |   if(!"first_type" %in% colnames(xe_raw@meta.data)){
 77 |     stop("`first_type` is not available in `xe_raw`, please compute it first!")
 78 |   }
 79 | 
 80 |   cells_to_remove <-
 81 |     xe_raw@meta.data %>%
 82 |     filter(.data[[spot_class_key]] == "reject") %>%
 83 |     rownames()
 84 | 
 85 |   if(DO_purify_singlets){
 86 |     cells_to_keep_raw <-
 87 |       xe_raw@meta.data %>%
 88 |       filter(is.na(second_type)) %>%
 89 |       rownames()
 90 |     cat(cells_to_keep_raw[1:10], "\n")
 91 |     cat(length(cells_to_keep_raw))
 92 |   } else {
 93 |     cells_to_keep_raw <-
 94 |       xe_raw@meta.data %>%
 95 |       filter(.data[[spot_class_key]] == "singlet") %>%
 96 |       rownames()
 97 |     cat(cells_to_keep_raw[1:10], "\n")
 98 |     cat(length(cells_to_keep_raw))
 99 |   }
100 |   cells_to_keep_raw <- setdiff(cells_to_keep_raw, cells_to_remove)
101 | 
102 |   cells_to_replace_with_purified <- setdiff(colnames(xe_raw), c(cells_to_keep_raw, cells_to_remove))
103 |   cat(length(cells_to_replace_with_purified), " - all cells_to_replace_with_purified")
104 |   cells_to_replace_with_purified <- cells_to_replace_with_purified[cells_to_replace_with_purified %in% colnames(xe_purified)]
105 |   cat(length(cells_to_replace_with_purified), " - in colnames(xe_purified)")
106 | 
107 |   xe_raw$purification_status <- "raw"
108 |   xe_raw@meta.data[cells_to_replace_with_purified,"purification_status"] <- xe_purified@meta.data[cells_to_replace_with_purified,"purification_status"]
109 |   xe_raw@meta.data[cells_to_remove,"purification_status"] <- "removed"
110 | 
111 |   common_genes <- intersect(rownames(xe_raw), rownames(xe_purified))
112 | 
113 |   message("raw... \n")
114 |   raw <- GetAssayData(xe_raw, assay = default_assay, layer = "counts")[common_genes, cells_to_keep_raw]
115 |   cell_meta_raw <- xe_raw@meta.data[cells_to_keep_raw,]
116 |   cell_meta_raw$decomposition_order <- "raw"
117 |   cell_meta_raw$cell_id <- rownames(cell_meta_raw)
118 |   cell_meta_raw$cell_type <- cell_meta_raw$first_type
119 | 
120 |   message("pur_1 ... \n")
121 |   cat(cells_to_replace_with_purified[1:5], "\n")
122 |   cat(length(cells_to_replace_with_purified))
123 |   purified_1 <- GetAssayData(xe_purified, assay = default_assay, layer = "counts")[common_genes, cells_to_replace_with_purified]
124 |   cell_meta  <- xe_purified@meta.data[colnames(purified_1), ]
125 |   cell_meta$cell_id <- rownames(cell_meta)
126 | 
127 |   message("pur_2 ... \n")
128 |   purified_2 <- GetAssayData(xe_raw, assay = default_assay, layer = "counts")[rownames(purified_1), colnames(purified_1)] - purified_1
129 |   if(sum(purified_2<0) > 0){
130 |     warning(purified_2[purified_2<0] %>% as.vector() %>% summary())
131 |     warning("Residual split profile has ", sum(purified_2<0), "negative values that are replaced with 0s")
132 |     purified_2[purified_2<0] <- 0
133 |   }
134 | 
135 |   cell_meta_1 <- cell_meta
136 |   cell_meta_1$cell_type <- cell_meta_1$first_type
137 |   colnames(purified_1) <- paste0(colnames(purified_1), "_1")
138 |   rownames(cell_meta_1) <- paste0(rownames(cell_meta_1), "_1")
139 |   cell_meta_1$decomposition_order <- "first"
140 | 
141 |   cell_meta_2 <- cell_meta
142 |   cell_meta_2$cell_type <- cell_meta_2$second_type
143 |   cell_meta_2$purification_status[cell_meta_2$purification_status == "raw"] <- "null"
144 |   colnames(purified_2) <- paste0(colnames(purified_2), "_2")
145 |   rownames(cell_meta_2) <- paste0(rownames(cell_meta_2), "_2")
146 |   cell_meta_2$decomposition_order <- "second"
147 | 
148 | 
149 |   if(!DO_split_singlets){
150 |     if(DO_purify_singlets){
151 |       # remove second profile of singlets
152 |       cells_to_keep_second_profile <- cell_meta_2 %>% filter(spot_class != "singlet") %>% rownames()
153 | 
154 |       purified_2 <- purified_2[,cells_to_keep_second_profile]
155 |       cell_meta_2 <- cell_meta_2[cells_to_keep_second_profile,]
156 |     }
157 |   } else {
158 |     if(!DO_purify_singlets){
159 |       warning("Cannot split singlets as they are raw")
160 |     }
161 |   }
162 | 
163 |   if(!DO_split_doublets_uncertain){
164 |     cells_to_keep_second_profile <- cell_meta_2 %>% filter(spot_class != "doublet_uncertain") %>% rownames()
165 | 
166 |     purified_2 <- purified_2[,cells_to_keep_second_profile]
167 |     cell_meta_2 <- cell_meta_2[cells_to_keep_second_profile,]
168 |   }
169 | 
170 |   common_cols <- intersect(colnames(cell_meta_raw), colnames(cell_meta_1))
171 | 
172 |   count_matrix <- cbind(raw, cbind(purified_1, purified_2))
173 |   meta_data = rbind(cell_meta_raw[,common_cols], rbind(cell_meta_1[,common_cols], cell_meta_2[,common_cols]))
174 | 
175 |   xe_balanced <- CreateSeuratObject(counts = count_matrix, assay = default_assay, meta.data = meta_data)
176 |   return(xe_balanced)
177 | }
178 | 
179 | 
180 | 


--------------------------------------------------------------------------------
/R/utils.R:
--------------------------------------------------------------------------------
 1 | #' Extract Singlet Cell Data
 2 | #'
 3 | #' This function filters an `xe` object to retain only the cells classified as "singlet"
 4 | #' based on the classification stored in the RCTD results.
 5 | #'
 6 | #' @param xe A Seurat object containing spatial transcriptomics data.
 7 | #' @param rctd An object containing RCTD results.
 8 | #' @param spot_class_key A character string specifying the column name in `rctd@results$results_df`
 9 | #'   that contains spot classification information. Default is `"spot_class"`.
10 | #'
11 | #' @return A subset of `xe` containing only singlet-classified cells.
12 | #'
13 | #' @import spacexr
14 | #' @import Seurat
15 | #' @import dplyr
16 | #' @export
17 | #'
18 | get_singlet_data <- function(
19 |     xe,
20 |     rctd,
21 |     spot_class_key = "spot_class"
22 | ){
23 |   results_df <- rctd@results$results_df
24 |   singlets_ids <- results_df %>% filter(.data[[spot_class_key]] == "singlet") %>% rownames()
25 |   xe <- subset(xe, cells = singlets_ids)
26 |   return(xe)
27 | }
28 | 


--------------------------------------------------------------------------------
/R/visualization.R:
--------------------------------------------------------------------------------
  1 | #' Generate Pie Plot Data Frame
  2 | #'
  3 | #' This function generates a dsata frame for creating pie plots, where each row corresponds to a spatial cell with information about cell types, weights, and coordinates. The function computes weights for each cell type based on the first and second cell type annotations from RCTD results and supports different modes of analysis.
  4 | #'
  5 | #' @param rctd An object containing RCTD results, including spatial and cell type data.
  6 | #' @param mode A character string specifying the mode of computation. Currently, only `"doublet"` mode is implemented. Available options include:
  7 | #'   \itemize{
  8 | #'     \item `"doublet"`: Computes weights for first and second cell types for doublet cells.
  9 | #'     \item `"all"`: Not yet implemented.
 10 | #'     \item `"by_spot_class"`: Not yet implemented.
 11 | #'   }
 12 | #'
 13 | #' @return A data frame containing the following columns:
 14 | #'   \itemize{
 15 | #'     \item `cell_id`: Unique identifier for each spatial cell.
 16 | #'     \item `x`, `y`: Spatial coordinates of the cell.
 17 | #'     \item `weight_first_type`, `weight_second_type`: Weights for the first and second cell types.
 18 | #'     \item `first_type`, `second_type`: Annotations of the first and second cell types.
 19 | #'     \item `w1_larger_w2`: Logical indicator of whether the first type weight is larger than the second type weight.
 20 | #'     \item One column for each cell type, containing the computed weight of that cell type for each spatial cell.
 21 | #'   }
 22 | #'
 23 | #' @note Only the `"doublet"` mode is currently implemented. Other modes will result in an error.
 24 | #'
 25 | #' @importFrom dplyr %>% select
 26 | #'
 27 | #' @export
 28 | 
 29 | get_pieplot_df <- function(
 30 |     rctd,
 31 |     mode = c("doublet", "all", "by_spot_class")[1]
 32 | ){
 33 | 
 34 |   cell_types <- rctd@cell_type_info[[1]][[2]]
 35 |   cell_types_N <- length(cell_types)
 36 | 
 37 |   pie_df <- rctd@results$results_df %>%
 38 |     select(spot_class, weight_first_type, weight_second_type, first_type, second_type, w1_larger_w2, x, y)
 39 |   pie_df$cell_id <- rownames(pie_df)
 40 | 
 41 |   if(mode == "doublet"){
 42 |     pie_celltype_weigts <- apply(pie_df, 1, FUN = function(row){
 43 |       res <- rep(0, cell_types_N)
 44 |       names(res) <- cell_types
 45 |       ft <- row["first_type"] %>% unname()
 46 |       st <- row["second_type"] %>% unname()
 47 |       w1 <- row["weight_first_type"] %>% as.numeric()
 48 | 
 49 |       if(!is.na(ft)){
 50 |         if(is.na(st)){
 51 |           res[ft] <- 1
 52 |         } else {
 53 |           res[ft] <- w1
 54 |           res[st] <- 1 - w1
 55 |         }
 56 |       } else {
 57 |         return(res)
 58 |       }
 59 |       return(res)
 60 | 
 61 |     }) %>% t()
 62 |   } else {
 63 |     stop("Only doublet mode is implemented so far!")
 64 |   }
 65 | 
 66 |   pie_celltype_weigts <- as.data.frame(pie_celltype_weigts)
 67 |   pie_celltype_weigts$cell_id <- rownames(pie_celltype_weigts)
 68 | 
 69 | 
 70 |   # Merge the two data frames by `cell_id`
 71 |   merged_pie_df <- merge(
 72 |     pie_df,
 73 |     pie_celltype_weigts,
 74 |     by = "cell_id",
 75 |     all.x = TRUE  # Keep all rows from pie_df
 76 |   )
 77 |   rownames(merged_pie_df) <- merged_pie_df$cell_id
 78 |   return(merged_pie_df)
 79 | }
 80 | 
 81 | #' Crop pie_df Data to a Specific Cell
 82 | #'
 83 | #' This function filters a data frame to retain only the rows within a specified radius around a given cell.
 84 | #'
 85 | #' @param df A data frame containing spatial cell information.
 86 | #' @param cell_id The unique identifier of the central cell.
 87 | #' @param radius A numeric value specifying the radius for cropping (default: 100).
 88 | #'
 89 | #' @importFrom dplyr %>% filter
 90 | #' @return A filtered data frame containing only the cells within the specified radius.
 91 | #' @export
 92 | #'
 93 | crop_pie_df_to_cell <- function(
 94 |     df,
 95 |     cell_id,
 96 |     radius = 100
 97 | ){
 98 |   x_center <- df[cell_id, "x"]
 99 |   y_center <- df[cell_id, "y"]
100 |   print(x_center)
101 |   print(y_center)
102 |   result <- df %>% filter(
103 |     (x < (x_center + radius) & x > (x_center - radius) &
104 |        y < (y_center + radius) & y > (y_center - radius))
105 |   )
106 |   return(result)
107 | }
108 | 
109 | #' Generate a Pie Plot
110 | #'
111 | #' This function creates a pie plot from a given data frame containing spatial cell type proportions.
112 | #'
113 | #' @param pie_df A data frame containing spatial cell type proportions and coordinates.
114 | #' @param cols An optional named vector of colors for cell types.
115 | #' @param pie_scale A numeric value to scale pie sizes (default: 1).
116 | #'
117 | #' @return A ggplot object representing the spatial pie chart.
118 | #'
119 | #' @importFrom scatterpie geom_scatterpie
120 | #' @importFrom ggplot2 ggplot coord_fixed scale_color_manual scale_fill_manual aes theme_void guides
121 | #' @importFrom dplyr %>%
122 | 
123 | #' @export
124 | 
125 | plot_pie <- function(pie_df, cols = NULL, pie_scale = 1){
126 | 
127 |   cell_types <- unique(c(pie_df$first_type,pie_df$second_type)) %>% as.vector()
128 |   cell_types <- cell_types[!is.na(cell_types)]
129 | 
130 |   p <- ggplot() +
131 |     geom_scatterpie(
132 |       aes(x = x, y = -y, group = cell_id, color = first_type),  # Define the position and grouping
133 |       data = pie_df,
134 |       cols = cell_types ,               # Specify columns for the pie slices
135 |       pie_scale = pie_scale
136 |     )
137 | 
138 |   pie_df_cell_type <- pie_df %>% filter(spot_class != "singlet")
139 | 
140 |   if(!is.null(pie_df_cell_type)){
141 |     pie_df_cell_type <- pie_df_cell_type %>%
142 |       mutate(
143 |         second_type_spot_class = case_when(
144 |           spot_class %in% c("reject", "doublet_uncertain") ~ as.character(spot_class),
145 |           spot_class == "doublet_certain" ~ as.character(second_type),
146 |           TRUE ~ as.character(second_type)
147 |         )
148 |       )
149 | 
150 |     p <- p + geom_point(data = pie_df_cell_type, aes(x = x, y = -y, pch = spot_class, color = second_type_spot_class), size = 3*pie_scale)
151 |   }
152 | 
153 |   if(!is.null(cols)){
154 |     cols["doublet_uncertain"] <- "black"
155 |     cols["reject"] <- "black"
156 | 
157 |     p <- p +
158 |       scale_color_manual(values = cols) +
159 |       scale_fill_manual(values = cols)
160 |   }
161 | 
162 |   p <- p + scale_shape_manual(values = c("doublet_certain" = 16, "doublet_uncertain" = 16, "reject" = 4))
163 | 
164 |   p <- p +
165 |     coord_fixed() +
166 |     theme_void() +
167 |     guides(color = "none")  # Hide the color legend but keep the fill legend
168 | 
169 |   return(p)
170 | }
171 | 
172 | 
173 | #' Plot Pie Chart by Coordinate Range
174 | #'
175 | #' This function filters the pie data frame by x and y limits and generates a pie plot.
176 | #'
177 | #' @param pie_df A data frame containing spatial cell type proportions and coordinates.
178 | #' @param x_lims A numeric vector specifying x-axis limits.
179 | #' @param y_lims A numeric vector specifying y-axis limits.
180 | #' @param cols An optional named vector of colors for cell types.
181 | #' @param pie_scale A numeric value to scale pie sizes (default: 1).
182 | #'
183 | #' @importFrom dplyr %>% filter
184 | #' @return A ggplot object representing the cropped spatial pie chart.
185 | #' @export
186 | 
187 | plot_pie_by_coordinates <- function(
188 |     pie_df,
189 |     x_lims, y_lims,
190 |     cols = NULL,
191 |     pie_scale = 1
192 | ){
193 | 
194 |   pie_df_crop <- pie_df %>% filter(x > min(x_lims) & x < max(x_lims) & y > min(y_lims) & y < max(y_lims))
195 |   p <- plot_pie(pie_df_crop, cols = cols, pie_scale = pie_scale)
196 | 
197 |   return(p)
198 | }
199 | 
200 | #' Plot Pie Chart Around a Specific Cell
201 | #'
202 | #' This function crops the data around a specified cell within a given radius and generates a pie plot.
203 | #'
204 | #' @param pie_df A data frame containing spatial cell type proportions and coordinates.
205 | #' @param cell_id The unique identifier of the central cell.
206 | #' @param radius A numeric value specifying the radius for cropping (default: 100).
207 | #' @param pie_scale A numeric value to scale pie sizes (default: 1).
208 | #' @param cols An optional named vector of colors for cell types.
209 | #' @param DO_highlight_cell A logical indicating whether to highlight the central cell (default: TRUE).
210 | #'
211 | #' @importFrom ggplot2 annotate
212 | #' @return A ggplot object representing the spatial pie chart around the specified cell.
213 | #' @export
214 | 
215 | plot_pie_around_cell <- function(
216 |     pie_df,
217 |     cell_id,
218 |     radius = 100,
219 |     pie_scale = 1,
220 |     cols = NULL,
221 |     DO_highlight_cell = TRUE
222 | ){
223 | 
224 |   pie_df_crop <- crop_pie_df_to_cell(pie_df, cell_id = cell_id, radius = radius)
225 |   p <- plot_pie(pie_df_crop, cols = cols, pie_scale = pie_scale)
226 | 
227 |   if(DO_highlight_cell){
228 |     x_center <- pie_df_crop[cell_id, "x"]
229 |     y_center <- pie_df_crop[cell_id, "y"]
230 |     p <- p +
231 |         annotate("point", x = x_center, y = -y_center, color = "black", size = 1, shape = 8)
232 |   }
233 | 
234 |   return(p)
235 | }
236 | 
237 | #' Save Pie Plot to File
238 | #'
239 | #' This function saves a given ggplot object as an image file.
240 | #'
241 | #' @param p A ggplot object representing the pie chart.
242 | #' @param filename A character string specifying the output file name.
243 | #' @param ... Additional arguments passed to `ggsave()`, such as width, height, and dpi.
244 | #'
245 | #' @return Saves the plot as a file with a transparent background.
246 | #' @export
247 | 
248 | save_pieplot <- function(
249 |     p,
250 |     filename,
251 |     ...
252 | ){
253 |   ggsave(filename, plot = p, bg = "transparent", ... )
254 | }
255 | 
256 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | <!-- badges: start --> 
 2 | [![R-CMD-check](https://github.com/BDSC-tds/SPLIT/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/BDSC-tds/SPLIT/actions/workflows/R-CMD-check.yaml) 
 3 | [![DOI](https://img.shields.io/badge/DOI%3A-10.1101%2F2025.04.23.649965-brightgreen)](https://doi.org/10.1101/2025.04.23.649965)
 4 | [![Version](https://img.shields.io/badge/version-0.1.0-blue)](https://github.com/bdsc-tds/SPLIT/releases/tag/v0.1.0)
 5 | <!-- badges: end -->
 6 | 
 7 | # SPLIT: Spatial Purification of Layered Intracellular Transcripts
 8 | 
 9 | ![](vignettes/plots/SPLIT_schema.png)
10 | 
11 | 🚧 **This package is under active development.**\
12 | ⚡ Use the **Quick Start** guide below to get up and running quickly.\
13 | 📖 A **comprehensive tutorial** of running SPLIT on Xenium data is now available as [.Rmd](https://github.com/bdsc-tds/SPLIT/blob/main/vignettes/Run_RCTD_and_SPLIT_on_Xenium.Rmd) and [.html](https://github.com/bdsc-tds/SPLIT/blob/main/doc/Run_RCTD_and_SPLIT_on_Xenium.html).
14 | 
15 | ------------------------------------------------------------------------
16 | 
17 | ## 📦 Installation
18 | 
19 | To install SPLIT from GitHub:
20 | 
21 | ``` r
22 | # Install SPLIT
23 | remotes::install_github("bdsc-tds/SPLIT")
24 | ```
25 | 
26 | ## 🚀 Quick Start
27 | 
28 | ⚠️ **IMPORTANT:**\
29 | SPLIT currently requires **doublet-mode** RCTD results from the original [spacexr GitHub repository](https://github.com/dmcable/spacexr) or its faster [HD fork](https://github.com/jpromeror/spacexr/tree/HD), **not** from the newly released [Bioconductor version](https://www.bioconductor.org/packages/release/bioc/html/spacexr.html).\
30 | 🚧 **Compatibility with Bioconductor's spacexr is coming soon.**
31 | 
32 | If you already have your **Xenium** dataset as a Seurat object (`xe`) and **RCTD** results from **doublet-mode** decomposition in `RCTD`, you can run SPLIT purification like this:
33 | 
34 | ```{r}
35 | library(SPLIT)
36 | library(spacexr)
37 | library(dplyr)
38 | library(Seurat)
39 | library(ggplot2)
40 | 
41 | # Post-process RCTD output
42 | RCTD <- SPLIT::run_post_process_RCTD(RCTD)
43 | 
44 | # Run SPLIT purification
45 | res_split <- SPLIT::purify(
46 |   counts = GetAssayData(xe, assay = 'Xenium', layer = 'counts'), # or any gene x cells counts matrix
47 |   rctd = RCTD,
48 |   DO_purify_singlets = TRUE # optional
49 | )
50 | 
51 | # Create a purified Seurat object
52 | xe_purified <- CreateSeuratObject(
53 |   counts = res_split$purified_counts,
54 |   meta.data = res_split$cell_meta,
55 |   assay = "Xenium"
56 | )
57 | 
58 | # Optional: Filter, normalize and visualize
59 | xe_purified <- subset(xe_purified, subset = nCount_Xenium > 5)
60 | xe_purified <- xe_purified %>%
61 |   SCTransform(assay = "Xenium") %>%
62 |   RunPCA() %>%
63 |   RunUMAP(dims = 1:20)
64 | UMAPPlot(xe_purified, group.by = c("first_type"), label = T, repel = T) + theme(aspect.ratio = 1)
65 | ```
66 | 
67 | ## Citation
68 | 
69 | If you use **SPLIT** in your work, please cite:
70 | 
71 | > **From Transcripts to Cells: Dissecting Sensitivity, Signal Contamination, and Specificity in Xenium Spatial Transcriptomics**\
72 | > Mariia Bilous, Daria Buszta, Jonathan Bac, Senbai Kang, Yixing Dong, Stephanie Tissot, Sylvie Andre, Marina Alexandre-Gaveta, Christel Voize, Solange Peters, Krisztian Homicsko, Raphael Gottardo\
73 | > *bioRxiv* (2025). <https://doi.org/10.1101/2025.04.23.649965>
74 | 
75 | ## Contact
76 | 
77 | If you have any questions about the package, feel free to [open an issue](https://github.com/bdsc-tds/SPLIT/issues) or contact **Mariia Bilous** at [Mariia.Bilous\@chuv.ch](mailto:Mariia.Bilous@chuv.ch).
78 | 


--------------------------------------------------------------------------------
/doc/Run_RCTD_and_SPLIT_on_Xenium.R:
--------------------------------------------------------------------------------
  1 | ## ----libs, message=FALSE------------------------------------------------------
  2 | if(!requireNamespace("spacexr", quietly = TRUE)){
  3 |   remotes::install_github("dmcable/spacexr") ## or remotes::install_github("jpromeror/spacexr@HD") for implementation of the doublet mode.
  4 | }
  5 | library(spacexr)
  6 | 
  7 | if(!requireNamespace("SPLIT", quietly = TRUE)){
  8 |   remotes::install_github("bdsc-tds/SPLIT") 
  9 | }
 10 | library(SPLIT)
 11 | 
 12 | library(dplyr)
 13 | library(Seurat)
 14 | library(readxl)
 15 | library(SingleCellExperiment)
 16 | library(httr)
 17 | library(ggplot2)
 18 | 
 19 | ## ----load-chormium-metadata---------------------------------------------------
 20 | # read metadata
 21 | url <- "https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-023-43458-x/MediaObjects/41467_2023_43458_MOESM4_ESM.xlsx"
 22 | temp_file <- tempfile(fileext = ".xlsx")
 23 | GET(url, write_disk(temp_file, overwrite = TRUE))
 24 | 
 25 | chrom_metadata <- read_excel(temp_file, sheet = 1) %>% as.data.frame()
 26 | rownames(chrom_metadata) <- chrom_metadata$Barcode
 27 | 
 28 | ## ----load-chromium------------------------------------------------------------
 29 | # read Chromium 
 30 | url <- "https://cf.10xgenomics.com/samples/cell-exp/7.0.1/Chromium_FFPE_Human_Breast_Cancer_Chromium_FFPE_Human_Breast_Cancer/Chromium_FFPE_Human_Breast_Cancer_Chromium_FFPE_Human_Breast_Cancer_count_sample_filtered_feature_bc_matrix.h5"
 31 | temp_file <- tempfile(fileext = ".h5")
 32 | GET(url, write_disk(temp_file, overwrite = TRUE))
 33 | 
 34 | chrom_counts <- Read10X_h5(temp_file)
 35 | chrom <- CreateSeuratObject(counts = chrom_counts, assay = "RNA", meta.data = chrom_metadata)
 36 | 
 37 | chrom$QC <- !is.na(chrom$Annotation)
 38 | chrom$is_hybrid <- grepl("Hybrid", chrom$Annotation, ignore.case = TRUE)
 39 | chrom <- subset(chrom, subset = QC == TRUE & is_hybrid == FALSE) # remove not annotated cells that cells that have sign of doublets
 40 | 
 41 | 
 42 | ## ----class-df-----------------------------------------------------------------
 43 | cell_type_to_class <- c(
 44 |   "B_Cells" = "B cell",
 45 |   "CD4+_T_Cells" = "T cell",
 46 |   "CD8+_T_Cells" = "T cell",
 47 |   "IRF7+_DCs" = "Myeloid",
 48 |   "LAMP3+_DCs" = "Myeloid",
 49 |   "Macrophages_1" = "Myeloid",
 50 |   "Macrophages_2" = "Myeloid",
 51 |   "Mast_Cells" = "Myeloid",
 52 |   "DCIS 1" = "Epithelial",
 53 |   "DCIS 2" = "Epithelial",
 54 |   "Invasive_Tumor" = "Epithelial",
 55 |   "Prolif_Invasive_Tumor" = "Epithelial",
 56 |   "Myoepi_ACTA2+" = "Myoepithelial",
 57 |   "Myoepi_KRT15+" = "Myoepithelial",
 58 |   "Stromal" = "Stromal",
 59 |   "Perivascular-Like" = "Stromal",
 60 |   "Endothelial" = "Endothelial"
 61 | )
 62 | 
 63 | class_df <- data.frame(class = cell_type_to_class)
 64 | 
 65 | # and define colors for reproducibility 
 66 | library(RColorBrewer)
 67 | 
 68 | cell_types <- unique(chrom$Annotation)
 69 | colors <- brewer.pal(n = max(3, min(length(cell_types), 12)), name = "Set3")
 70 | # Recycle colors if not enough
 71 | colors <- rep(colors, length.out = length(cell_types))
 72 | pal <- setNames(colors, cell_types)
 73 | 
 74 | ## ----load-xenium--------------------------------------------------------------
 75 | if(!requireNamespace("STexampleData", quietly = TRUE))
 76 |   remotes::install_github("lmweber/STexampleData")
 77 | xe_full_seu <- STexampleData::Janesick_breastCancer_Xenium_rep1() 
 78 | 
 79 | ## Convert to Seurat to stay consistent with chromium object
 80 | sp_coords <- spatialCoords(xe_full_seu)
 81 | colnames(sp_coords) <- c("ST_1", "ST_2")
 82 | 
 83 | xe_full <- CreateSeuratObject(
 84 |   counts = counts(xe_full_seu),
 85 |   assay = "Xenium",
 86 |   meta.data = as.data.frame(colData(xe_full_seu))
 87 | )
 88 | 
 89 | xe_full[["spatial"]] <- CreateDimReducObject(sp_coords, assay = "Xenium", key = "ST_")
 90 | 
 91 | xe_full$x <- sp_coords[,1]
 92 | xe_full$y <- sp_coords[,2]
 93 | rm(xe_full_seu)
 94 | 
 95 | ## ----downsampling-------------------------------------------------------------
 96 | DO_subset_xe <- FALSE 
 97 | X_lim <- c(6000, Inf) # cropping area 
 98 | Y_lim <- c(4000, Inf) # cropping area 
 99 | 
100 | if(DO_subset_xe){
101 |   xe <- subset(xe_full, subset = x > min(X_lim) & x < max(X_lim) & y > min(Y_lim) & y < max(Y_lim))
102 | } else {
103 |   xe <- xe_full
104 | }
105 | 
106 | ## ----rctd---------------------------------------------------------------------
107 | DO_run_RCTC <- FALSE # FALSE to load pre-computed results
108 | 
109 | common_genes <- intersect(rownames(xe), rownames(chrom))
110 | ref_labels <- chrom$Annotation %>% as.factor()
111 | 
112 | ref.obj <- Reference(GetAssayData(chrom, "RNA", "counts")[common_genes, ],
113 |                      cell_types = ref_labels, min_UMI = 10, require_int = TRUE)
114 | 
115 | test.obj <- SpatialRNA(coords = xe@reductions$spatial@cell.embeddings %>% as.data.frame(),
116 |                        counts = GetAssayData(xe, assay = "Xenium", layer = "counts")[common_genes, ],
117 |                        require_int = TRUE)
118 | 
119 | if(!exists("class_df")) 
120 |   class_df <- NULL
121 | 
122 | RCTD <- create.RCTD(
123 |   test.obj, 
124 |   ref.obj, 
125 |   UMI_min = 10, 
126 |   counts_MIN = 10, 
127 |   UMI_min_sigma = 100,
128 |   max_cores = BiocParallel::multicoreWorkers() - 1, 
129 |   CELL_MIN_INSTANCE = 25,
130 |   class_df = class_df # highly recommended 
131 |   )
132 | 
133 | if(DO_run_RCTC){
134 |   RCTD <- run.RCTD(RCTD, doublet_mode = "doublet")
135 |   saveRDS(RCTD, "~/precomp_rctd_class_aware.rds")
136 | } else {
137 |   message("reading precomp RCTD results")
138 |   
139 |   # Install googledrive if you haven't already
140 |   if (!requireNamespace("googledrive", quietly = TRUE)) {
141 |     install.packages("googledrive")
142 |   }
143 |   library(googledrive)
144 |   drive_deauth()
145 |   # Define the file ID from the Google Drive link
146 |   file_id <- "1pTUKq49JbUFwVk7vttjZIFqkx-AKznRF" #"1DCalFIZJywOvrSGBSPqrHh9QINeQp8aq"
147 |   local_path <- tempfile(fileext = ".rds")
148 |   drive_download(as_id(file_id), path = local_path, overwrite = TRUE)
149 |   RCTD <- readRDS(local_path)
150 |   
151 | }
152 | 
153 | 
154 | ## ----post-rctd----------------------------------------------------------------
155 | RCTD <- SPLIT::run_post_process_RCTD(RCTD)
156 | xe <- AddMetaData(xe, RCTD@results$results_df)
157 | xe <- subset(xe, subset = nCount_Xenium >= 10)
158 | 
159 | cat("Proprtion of spot classes")
160 | (xe$spot_class %>% table())/ncol(xe)*100 
161 | 
162 | ## ----plot-raw-xenium, fig.width=16, message=FALSE-----------------------------
163 | 
164 | xe <- xe %>% SCTransform(assay = "Xenium", verbose = FALSE) %>% RunPCA(verbose = FALSE) %>% RunUMAP(dims = 1:50, verbose = FALSE)
165 | p1 <- UMAPPlot(xe, group.by = "first_type", label = T, repel = T, cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "none")
166 | p2 <- UMAPPlot(xe, group.by = "second_type", cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom")
167 | p3 <- UMAPPlot(xe, group.by = "spot_class") + theme_void() + theme(aspect.ratio = 1, legend.position = "right")
168 |  
169 | p1 | p2 | p3
170 | 
171 | 
172 | ## ----spatial-plot, fig.width=12, warning=FALSE--------------------------------
173 | DimPlot(xe, reduction = "spatial", group.by = "first_type", raster = TRUE, cols = pal) + coord_fixed()
174 | 
175 | ## ----SPLIT, message=FALSE-----------------------------------------------------
176 | # Run SPLIT purification
177 | res_split <- SPLIT::purify(
178 |   counts = GetAssayData(xe, assay = 'Xenium', layer = 'counts'), # or any gene x cells counts matrix
179 |   rctd = RCTD,
180 |   DO_purify_singlets = TRUE # Optional. If TRUE, singlets with an available secondary type are purified the same way as doublets_certain; otherwise, left unchanged.
181 |   )
182 | 
183 | # Create a purified Seurat object
184 | xe_purified <- CreateSeuratObject(
185 |   counts = res_split$purified_counts,
186 |   meta.data = res_split$cell_meta,
187 |   assay = "Xenium"
188 | )
189 | 
190 | # Optional: Filter, normalize and visualize
191 | xe_purified <- subset(xe_purified, subset = nCount_Xenium > 5)
192 | xe_purified <- xe_purified %>%
193 |   SCTransform(assay = "Xenium", verbose = FALSE) %>%
194 |   RunPCA(verbose = FALSE) %>%
195 |   RunUMAP(dims = 1:20, verbose = FALSE)
196 | 
197 | #UMAPPlot(xe_purified, group.by = c("first_type"), label = T, repel = T) + theme_void() + theme(aspect.ratio = 1, legend.position = "none")
198 | 
199 | ## ----plot-raw-split-purified, fig.width=12, message=FALSE---------------------
200 | p1 <- UMAPPlot(xe, group.by = c("first_type"), label = T, repel = T, cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "none") + ggtitle("Raw Xenium data")
201 | 
202 | p2 <- UMAPPlot(xe_purified, group.by = c("first_type"), label = T, repel = T, cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "right") + ggtitle("SPLIT-purified Xenium data")
203 | 
204 | p3 <- UMAPPlot(xe_purified, group.by = c("spot_class")) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom") + ggtitle("SPLIT-purified Xenium data colored by spot class")
205 | p4 <- UMAPPlot(xe_purified, group.by = c("purification_status")) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom") + ggtitle("SPLIT-purified Xenium data colored by purification status")
206 | 
207 | 
208 | (p1|p2) 
209 | (p3|p4)
210 | 
211 | 
212 | ## ----spatial-nw---------------------------------------------------------------
213 | sp_nw <- SPLIT::build_spatial_network(
214 |   xe, 
215 |   reduction = "spatial",
216 |   dims = 1:2, 
217 |   DO_prune = TRUE, 
218 |   rad_pruning = 15, # remove connections further than 15um
219 |   k_knn = 20
220 |   )
221 | 
222 | sp_nw <- SPLIT::add_spatial_metric(spatial_neighborhood = sp_nw, rctd = RCTD)
223 | sp_neigh_df <- SPLIT::neighborhood_analysis_to_metadata(sp_nw)
224 | 
225 | xe <- AddMetaData(xe, sp_neigh_df)
226 | 
227 | rm(sp_nw, sp_neigh_df)
228 | 
229 | ## ----plot-neigh-weight-second-type, message=FALSE-----------------------------
230 | # Plot magnitude of local diffusion on UMAP
231 | FeaturePlot(xe, features = c("neighborhood_weights_second_type")) + theme_void() + theme(aspect.ratio = 1)
232 | 
233 | # Plot distribution of local diffusion value
234 | hist(xe$neighborhood_weights_second_type)
235 | 
236 | # Plot distribution of local diffusion value per `spot_class`
237 | xe@meta.data %>% filter(!is.na(spot_class)) %>% 
238 |   ggplot(aes(x = spot_class, y = neighborhood_weights_second_type, color = spot_class)) + geom_boxplot() + labs(title = "Local neighbohood diffusion by spot class") + theme_minimal() 
239 | 
240 | ## ----spatially-aware-split, fig.width=12, message=FALSE-----------------------
241 | xe_purified_balanced_score <- SPLIT::balance_raw_and_purified_data_by_score(
242 |   xe_raw = xe,
243 |   xe_purified = xe_purified,
244 |   default_assay = "Xenium", # should be param, but can wait 
245 |   spot_class_key = "spot_class",
246 |   threshold = 0.05, # lower -> more cells will be purified
247 |   score_name = "neighborhood_weights_second_type"
248 | )
249 | 
250 | # Optional: Filter, normalize and visualize
251 | xe_purified_balanced_score <- subset(xe_purified_balanced_score, subset = nCount_Xenium > 5)
252 | xe_purified_balanced_score <- xe_purified_balanced_score %>%
253 |   SCTransform(assay = "Xenium", verbose = FALSE) %>%
254 |   RunPCA(verbose = FALSE) %>%
255 |   RunUMAP(dims = 1:20, verbose = FALSE)
256 | 
257 | p5 <- UMAPPlot(xe_purified_balanced_score, group.by = c("first_type"), label = T, repel = T, cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "none") + ggtitle("Spatially-aware SPLIT-purified Xenium data")
258 | p6 <- UMAPPlot(xe_purified_balanced_score, group.by = c("purification_status")) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom")
259 | 
260 | p5|p6
261 | 
262 | ## ----transcriptomics-nw-------------------------------------------------------
263 | tr_nw <- build_transcriptomics_network(
264 |   xe,
265 |   DO_prune = FALSE,
266 |   k_knn = 100
267 | )
268 | tr_nw <- add_transcriptomics_metric(transcriptomics_neighborhood = tr_nw, rctd = RCTD) 
269 | tr_neigh_df <- neighborhood_analysis_to_metadata(tr_nw)
270 | xe <- AddMetaData(xe, tr_neigh_df)
271 | 
272 | rm(tr_nw, tr_neigh_df)
273 | 
274 | ## ----run-split-shift----------------------------------------------------------
275 | xe_split_shift <- SPLIT::balance_raw_and_purified_data_by_score(
276 |   xe_raw = xe,
277 |   xe_purified = xe_purified,
278 |   default_assay = "Xenium",
279 |   spot_class_key = "spot_class",
280 |   threshold = 0.05, # to be consistent with spatially-aware SPLIT results
281 |   score_name = "neighborhood_weights_second_type",
282 |   DO_swap_lables = TRUE
283 | )
284 | 
285 | # Optional: Filter, normalize and visualize
286 | xe_split_shift <- subset(xe_split_shift, subset = nCount_Xenium > 5)
287 | xe_split_shift <- xe_split_shift %>%
288 |   SCTransform(assay = "Xenium", verbose = FALSE) %>%
289 |   RunPCA(verbose = FALSE) %>%
290 |   RunUMAP(dims = 1:20, verbose = FALSE)
291 | 
292 | ## ----vis-split-shift, fig.width=16, message=FALSE-----------------------------
293 | p7 <- UMAPPlot(xe_split_shift, group.by = c("first_type"), label = T, repel = T, cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "none") + ggtitle("SPLIT-shift-purified Xenium data")
294 | p8 <- UMAPPlot(xe_split_shift, group.by = c("purification_status")) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom")
295 | p9 <- UMAPPlot(xe_split_shift, group.by = c("swap")) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom")
296 | 
297 | p7|p8|p9
298 | 
299 | ## ----vis-split-shift-swap-facet, fig.width=12, message=FALSE------------------
300 | # Visualize results faceted by swapping status
301 | p10 <- UMAPPlot(xe_split_shift, group.by = c("first_type"), split.by = "swap", raster = F, cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom") + ggtitle("SPLIT-shift-purified Xenium data faceted by lable swapping status")
302 | p10
303 | 
304 | ## ----summary-plot, fig.width=18, fig.height=7, message=FALSE------------------
305 | p1 | p2+theme(legend.position = "bottom") | p5 | p7
306 | 
307 | ## ----fig.width=12-------------------------------------------------------------
308 | pie_df <- get_pieplot_df(rctd = RCTD)
309 | cois <- xe@meta.data %>% filter(neighborhood_weights_second_type > .2, first_type %in% c("CD4+_T_Cells", "CD8+_T_Cells"), spot_class != "reject") %>% rownames()
310 | 
311 | p <- plot_pie_around_cell(pie_df, cell_id = cois[1], radius = 50, cols = pal)
312 | plot(p)
313 | 
314 | 


--------------------------------------------------------------------------------
/doc/Run_RCTD_and_SPLIT_on_Xenium.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "RCTD Annotation and SPLIT Purification for Xenium Sample"
  3 | output: 
  4 |   rmarkdown::github_document:
  5 |     toc: true
  6 |     toc_depth: 2
  7 | vignette: >
  8 |   %\VignetteIndexEntry{Run RCTD and SPLIT on Xenium}
  9 |   %\VignetteEngine{knitr::rmarkdown}
 10 |   %\VignetteEncoding{UTF-8}
 11 | ---
 12 | 
 13 | # Introduction
 14 | 
 15 | This vignette demonstrates how to annotate Xenium spatial transcriptomics data using RCTD, followed by purification with SPLIT.
 16 | 
 17 | ⚠️ **Important Notice**
 18 | 
 19 | SPLIT currently requires **doublet-mode** RCTD results generated with the original [spacexr GitHub repository](https://github.com/dmcable/spacexr) or the faster [HD fork](https://github.com/jpromeror/spacexr/tree/HD).  
 20 | 🚧 **Compatibility with the newly released [Bioconductor version](https://www.bioconductor.org/packages/release/bioc/html/spacexr.html) of spacexr is under development.**
 21 | 
 22 | # Overview
 23 | 
 24 | In this vignette, we:
 25 | 
 26 | 1. **Run RCTD Annotation**
 27 |     - We begin by running RCTD annotation on a Xenium sample, using matched Chromium data from a public 10x Genomics dataset as the reference.
 28 |     
 29 | 2. **Apply Default SPLIT Purification**
 30 |     - The default SPLIT purification method is then applied to clean the annotated Xenium sample, refining the initial annotations.
 31 | 
 32 | 3. **Apply Spatially-Aware SPLIT**
 33 |     - We apply spatially-aware SPLIT, which purifies cells showing signs of contamination based on local spatial diffusion patterns.
 34 |     
 35 | 4. **Apply SPLIT-Shift**
 36 |     - Finally, SPLIT-shift is applied to swap primary and secondary labels based on transcriptomic neighborhood heterogeneity, improving the accuracy of cell type assignments.
 37 | 
 38 | This pipeline assumes that cell type assignments — originally derived from the Chromium reference — are refined and reliable for downstream analysis.
 39 | 
 40 | 
 41 | ```{r libs, message=FALSE}
 42 | if(!requireNamespace("spacexr", quietly = TRUE)){
 43 |   remotes::install_github("dmcable/spacexr") ## or remotes::install_github("jpromeror/spacexr@HD") for implementation of the doublet mode.
 44 | }
 45 | library(spacexr)
 46 | 
 47 | if(!requireNamespace("SPLIT", quietly = TRUE)){
 48 |   remotes::install_github("bdsc-tds/SPLIT") 
 49 | }
 50 | library(SPLIT)
 51 | 
 52 | library(dplyr)
 53 | library(Seurat)
 54 | library(readxl)
 55 | library(SingleCellExperiment)
 56 | library(httr)
 57 | library(ggplot2)
 58 | ```
 59 | 
 60 | ## Load Data
 61 | 
 62 | For this vignette, we use a publicly available Xenium dataset from the 10x Genomics database, originating from:
 63 | 
 64 | > **Janesick, A., Shelansky, R., Gottscho, A.D. et al.**  
 65 | > *High resolution mapping of the tumor microenvironment using integrated single-cell, spatial and in situ analysis.*  
 66 | > *Nature Communications* 14, 8353 (2023).  
 67 | > [https://doi.org/10.1038/s41467-023-43458-x](https://doi.org/10.1038/s41467-023-43458-x)
 68 | 
 69 | This dataset provides high-resolution spatial transcriptomics data suitable for downstream analysis with RCTD and SPLIT.
 70 | 
 71 | ---
 72 | 
 73 | ### Load Chromium Dataset (Reference)
 74 | 
 75 | We manually load metadata for the Chromium single-cell dataset and metadata from the same study, which will serve as the **reference** for RCTD annotation.
 76 | 
 77 | ```{r load-chormium-metadata}
 78 | # read metadata
 79 | url <- "https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-023-43458-x/MediaObjects/41467_2023_43458_MOESM4_ESM.xlsx"
 80 | temp_file <- tempfile(fileext = ".xlsx")
 81 | GET(url, write_disk(temp_file, overwrite = TRUE))
 82 | 
 83 | chrom_metadata <- read_excel(temp_file, sheet = 1) %>% as.data.frame()
 84 | rownames(chrom_metadata) <- chrom_metadata$Barcode
 85 | ```
 86 | 
 87 | Manually load Chromium from 10x 
 88 | ```{r load-chromium}
 89 | # read Chromium 
 90 | url <- "https://cf.10xgenomics.com/samples/cell-exp/7.0.1/Chromium_FFPE_Human_Breast_Cancer_Chromium_FFPE_Human_Breast_Cancer/Chromium_FFPE_Human_Breast_Cancer_Chromium_FFPE_Human_Breast_Cancer_count_sample_filtered_feature_bc_matrix.h5"
 91 | temp_file <- tempfile(fileext = ".h5")
 92 | GET(url, write_disk(temp_file, overwrite = TRUE))
 93 | 
 94 | chrom_counts <- Read10X_h5(temp_file)
 95 | chrom <- CreateSeuratObject(counts = chrom_counts, assay = "RNA", meta.data = chrom_metadata)
 96 | 
 97 | chrom$QC <- !is.na(chrom$Annotation)
 98 | chrom$is_hybrid <- grepl("Hybrid", chrom$Annotation, ignore.case = TRUE)
 99 | chrom <- subset(chrom, subset = QC == TRUE & is_hybrid == FALSE) # remove not annotated cells that cells that have sign of doublets
100 | 
101 | ```
102 | 
103 | #### Generate broader level annoation (optional, but **highly recommended** for more robust RCTD annotation)
104 | Providing higher-level cell type classes improves RCTD accuracy and **significantly** reduces the number of rejected cells, preserving more cells (rejects are excluded from the downstream analyses and do not undergo SPLIT purification).
105 | ```{r class-df}
106 | cell_type_to_class <- c(
107 |   "B_Cells" = "B cell",
108 |   "CD4+_T_Cells" = "T cell",
109 |   "CD8+_T_Cells" = "T cell",
110 |   "IRF7+_DCs" = "Myeloid",
111 |   "LAMP3+_DCs" = "Myeloid",
112 |   "Macrophages_1" = "Myeloid",
113 |   "Macrophages_2" = "Myeloid",
114 |   "Mast_Cells" = "Myeloid",
115 |   "DCIS 1" = "Epithelial",
116 |   "DCIS 2" = "Epithelial",
117 |   "Invasive_Tumor" = "Epithelial",
118 |   "Prolif_Invasive_Tumor" = "Epithelial",
119 |   "Myoepi_ACTA2+" = "Myoepithelial",
120 |   "Myoepi_KRT15+" = "Myoepithelial",
121 |   "Stromal" = "Stromal",
122 |   "Perivascular-Like" = "Stromal",
123 |   "Endothelial" = "Endothelial"
124 | )
125 | 
126 | class_df <- data.frame(class = cell_type_to_class)
127 | 
128 | # and define colors for reproducibility 
129 | library(RColorBrewer)
130 | 
131 | cell_types <- unique(chrom$Annotation)
132 | colors <- brewer.pal(n = max(3, min(length(cell_types), 12)), name = "Set3")
133 | # Recycle colors if not enough
134 | colors <- rep(colors, length.out = length(cell_types))
135 | pal <- setNames(colors, cell_types)
136 | ```
137 | 
138 | 
139 | ### Load Xenium Dataset
140 | 
141 | We load the Xenium spatial transcriptomics data using the `STexampleData` package, which provides convenient access to example spatial datasets for analysis and demonstration.
142 | 
143 | ```{r load-xenium}
144 | if(!requireNamespace("STexampleData", quietly = TRUE))
145 |   remotes::install_github("lmweber/STexampleData")
146 | xe_full_seu <- STexampleData::Janesick_breastCancer_Xenium_rep1() 
147 | 
148 | ## Convert to Seurat to stay consistent with chromium object
149 | sp_coords <- spatialCoords(xe_full_seu)
150 | colnames(sp_coords) <- c("ST_1", "ST_2")
151 | 
152 | xe_full <- CreateSeuratObject(
153 |   counts = counts(xe_full_seu),
154 |   assay = "Xenium",
155 |   meta.data = as.data.frame(colData(xe_full_seu))
156 | )
157 | 
158 | xe_full[["spatial"]] <- CreateDimReducObject(sp_coords, assay = "Xenium", key = "ST_")
159 | 
160 | xe_full$x <- sp_coords[,1]
161 | xe_full$y <- sp_coords[,2]
162 | rm(xe_full_seu)
163 | ```
164 | 
165 | #### Downsample Xenium Dataset for Faster RCTD (Optional)
166 | 
167 | Running RCTD can be time-consuming, especially on large datasets like Xenium. To speed up computation during this tutorial, we optionally provide code to downsample the dataset. We recommend spatial cropping (rather than random sampling) to preserve neighborhood structure, which is important for downstream analysis.
168 | 
169 | That said, downsampling is **not required **, we provide precomputed RCTD results below so you can skip running RCTD altogether if desired.  
170 | 
171 | ```{r downsampling}
172 | DO_subset_xe <- FALSE 
173 | X_lim <- c(6000, Inf) # cropping area 
174 | Y_lim <- c(4000, Inf) # cropping area 
175 | 
176 | if(DO_subset_xe){
177 |   xe <- subset(xe_full, subset = x > min(X_lim) & x < max(X_lim) & y > min(Y_lim) & y < max(Y_lim))
178 | } else {
179 |   xe <- xe_full
180 | }
181 | ```
182 | 
183 | # RCDT annotation
184 | Run RCTD Annotation on Xenium
185 | Running RCTD on large datasets can be computationally intensive and may take several hours. To streamline the workflow, we provide the full code for reproducibility. However, we recommend loading a pre-computed RCTD object by setting `DO_run_RCTD <- FALSE`.
186 | ```{r rctd}
187 | DO_run_RCTC <- FALSE # FALSE to load pre-computed results
188 | 
189 | common_genes <- intersect(rownames(xe), rownames(chrom))
190 | ref_labels <- chrom$Annotation %>% as.factor()
191 | 
192 | ref.obj <- Reference(GetAssayData(chrom, "RNA", "counts")[common_genes, ],
193 |                      cell_types = ref_labels, min_UMI = 10, require_int = TRUE)
194 | 
195 | test.obj <- SpatialRNA(coords = xe@reductions$spatial@cell.embeddings %>% as.data.frame(),
196 |                        counts = GetAssayData(xe, assay = "Xenium", layer = "counts")[common_genes, ],
197 |                        require_int = TRUE)
198 | 
199 | if(!exists("class_df")) 
200 |   class_df <- NULL
201 | 
202 | RCTD <- create.RCTD(
203 |   test.obj, 
204 |   ref.obj, 
205 |   UMI_min = 10, 
206 |   counts_MIN = 10, 
207 |   UMI_min_sigma = 100,
208 |   max_cores = BiocParallel::multicoreWorkers() - 1, 
209 |   CELL_MIN_INSTANCE = 25,
210 |   class_df = class_df # highly recommended 
211 |   )
212 | 
213 | if(DO_run_RCTC){
214 |   RCTD <- run.RCTD(RCTD, doublet_mode = "doublet")
215 |   saveRDS(RCTD, "~/precomp_rctd_class_aware.rds")
216 | } else {
217 |   message("reading precomp RCTD results")
218 |   
219 |   # Install googledrive if you haven't already
220 |   if (!requireNamespace("googledrive", quietly = TRUE)) {
221 |     install.packages("googledrive")
222 |   }
223 |   library(googledrive)
224 |   drive_deauth()
225 |   # Define the file ID from the Google Drive link
226 |   file_id <- "1pTUKq49JbUFwVk7vttjZIFqkx-AKznRF" #"1DCalFIZJywOvrSGBSPqrHh9QINeQp8aq"
227 |   local_path <- tempfile(fileext = ".rds")
228 |   drive_download(as_id(file_id), path = local_path, overwrite = TRUE)
229 |   RCTD <- readRDS(local_path)
230 |   
231 | }
232 | 
233 | ```
234 | 
235 | Visualize RCTD Annotation
236 | Post-process RCDT output and add reselts into Xenium object
237 | ```{r post-rctd}
238 | RCTD <- SPLIT::run_post_process_RCTD(RCTD)
239 | xe <- AddMetaData(xe, RCTD@results$results_df)
240 | xe <- subset(xe, subset = nCount_Xenium >= 10)
241 | 
242 | cat("Proprtion of spot classes")
243 | (xe$spot_class %>% table())/ncol(xe)*100 
244 | ```
245 | 
246 | ```{r plot-raw-xenium, fig.width=16, message=FALSE}
247 | 
248 | xe <- xe %>% SCTransform(assay = "Xenium", verbose = FALSE) %>% RunPCA(verbose = FALSE) %>% RunUMAP(dims = 1:50, verbose = FALSE)
249 | p1 <- UMAPPlot(xe, group.by = "first_type", label = T, repel = T, cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "none")
250 | p2 <- UMAPPlot(xe, group.by = "second_type", cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom")
251 | p3 <- UMAPPlot(xe, group.by = "spot_class") + theme_void() + theme(aspect.ratio = 1, legend.position = "right")
252 |  
253 | p1 | p2 | p3
254 | 
255 | ```
256 | Spatial Visualization
257 | ```{r spatial-plot, fig.width=12, warning=FALSE}
258 | DimPlot(xe, reduction = "spatial", group.by = "first_type", raster = TRUE, cols = pal) + coord_fixed()
259 | ```
260 | 
261 | # Purification
262 | ## SPLIT (default)
263 | This section runs the default SPLIT purification and visualizes purified data.
264 | 
265 | ```{r SPLIT, message=FALSE}
266 | # Run SPLIT purification
267 | res_split <- SPLIT::purify(
268 |   counts = GetAssayData(xe, assay = 'Xenium', layer = 'counts'), # or any gene x cells counts matrix
269 |   rctd = RCTD,
270 |   DO_purify_singlets = TRUE # Optional. If TRUE, singlets with an available secondary type are purified the same way as doublets_certain; otherwise, left unchanged.
271 |   )
272 | 
273 | # Create a purified Seurat object
274 | xe_purified <- CreateSeuratObject(
275 |   counts = res_split$purified_counts,
276 |   meta.data = res_split$cell_meta,
277 |   assay = "Xenium"
278 | )
279 | 
280 | # Optional: Filter, normalize and visualize
281 | xe_purified <- subset(xe_purified, subset = nCount_Xenium > 5)
282 | xe_purified <- xe_purified %>%
283 |   SCTransform(assay = "Xenium", verbose = FALSE) %>%
284 |   RunPCA(verbose = FALSE) %>%
285 |   RunUMAP(dims = 1:20, verbose = FALSE)
286 | 
287 | #UMAPPlot(xe_purified, group.by = c("first_type"), label = T, repel = T) + theme_void() + theme(aspect.ratio = 1, legend.position = "none")
288 | ```
289 | ### Visually compare results of Raw and SPLIT-Purified data
290 | ```{r plot-raw-split-purified, fig.width=12, message=FALSE}
291 | p1 <- UMAPPlot(xe, group.by = c("first_type"), label = T, repel = T, cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "none") + ggtitle("Raw Xenium data")
292 | 
293 | p2 <- UMAPPlot(xe_purified, group.by = c("first_type"), label = T, repel = T, cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "right") + ggtitle("SPLIT-purified Xenium data")
294 | 
295 | p3 <- UMAPPlot(xe_purified, group.by = c("spot_class")) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom") + ggtitle("SPLIT-purified Xenium data colored by spot class")
296 | p4 <- UMAPPlot(xe_purified, group.by = c("purification_status")) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom") + ggtitle("SPLIT-purified Xenium data colored by purification status")
297 | 
298 | 
299 | (p1|p2) 
300 | (p3|p4)
301 | 
302 | ```
303 | 
304 | ## Spatially-aware SPLIT
305 | SPLIT can leverage spatial information to assess the abundance of secondary signals in the local neighborhood (i.e., local diffusion potential), enabling selective decomposition only when contamination is likely. This spatially informed strategy helps prevent overcorrection of phenotypes that may be underrepresented or absent in the reference. 
306 | Specifically, we first compute the spatial neighborhood for each cell, then identify and purify cells that have sign of local diffusion of the secondary cell type.
307 | ```{r spatial-nw}
308 | sp_nw <- SPLIT::build_spatial_network(
309 |   xe, 
310 |   reduction = "spatial",
311 |   dims = 1:2, 
312 |   DO_prune = TRUE, 
313 |   rad_pruning = 15, # remove connections further than 15um
314 |   k_knn = 20
315 |   )
316 | 
317 | sp_nw <- SPLIT::add_spatial_metric(spatial_neighborhood = sp_nw, rctd = RCTD)
318 | sp_neigh_df <- SPLIT::neighborhood_analysis_to_metadata(sp_nw)
319 | 
320 | xe <- AddMetaData(xe, sp_neigh_df)
321 | 
322 | rm(sp_nw, sp_neigh_df)
323 | ```
324 | 
325 | ### Visualize local diffusion of secondary cell type 
326 | The score `neighborhood_weights_second_type` corresponds to the average weight of the secondary cell type in cell's spatial neighborhood.
327 | ```{r plot-neigh-weight-second-type, message=FALSE}
328 | # Plot magnitude of local diffusion on UMAP
329 | FeaturePlot(xe, features = c("neighborhood_weights_second_type")) + theme_void() + theme(aspect.ratio = 1)
330 | 
331 | # Plot distribution of local diffusion value
332 | hist(xe$neighborhood_weights_second_type)
333 | 
334 | # Plot distribution of local diffusion value per `spot_class`
335 | xe@meta.data %>% filter(!is.na(spot_class)) %>% 
336 |   ggplot(aes(x = spot_class, y = neighborhood_weights_second_type, color = spot_class)) + geom_boxplot() + labs(title = "Local neighbohood diffusion by spot class") + theme_minimal() 
337 | ``` 
338 | 
339 | 
340 | We now purify cells that have secondary signal in their spatial neighborhood (e.g., `neighborhood_weights_second_type`) and keep other cells unchanged
341 | ```{r spatially-aware-split, fig.width=12, message=FALSE}
342 | xe_purified_balanced_score <- SPLIT::balance_raw_and_purified_data_by_score(
343 |   xe_raw = xe,
344 |   xe_purified = xe_purified,
345 |   default_assay = "Xenium", # should be param, but can wait 
346 |   spot_class_key = "spot_class",
347 |   threshold = 0.05, # lower -> more cells will be purified
348 |   score_name = "neighborhood_weights_second_type"
349 | )
350 | 
351 | # Optional: Filter, normalize and visualize
352 | xe_purified_balanced_score <- subset(xe_purified_balanced_score, subset = nCount_Xenium > 5)
353 | xe_purified_balanced_score <- xe_purified_balanced_score %>%
354 |   SCTransform(assay = "Xenium", verbose = FALSE) %>%
355 |   RunPCA(verbose = FALSE) %>%
356 |   RunUMAP(dims = 1:20, verbose = FALSE)
357 | 
358 | p5 <- UMAPPlot(xe_purified_balanced_score, group.by = c("first_type"), label = T, repel = T, cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "none") + ggtitle("Spatially-aware SPLIT-purified Xenium data")
359 | p6 <- UMAPPlot(xe_purified_balanced_score, group.by = c("purification_status")) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom")
360 | 
361 | p5|p6
362 | ```
363 | 
364 | ## SPLIT-shift
365 | In some cases, the contamination signal is so strong that RCTD assigns the cell to its secondary cell type. To address this, we introduce SPLIT-shift—an approach that refines phenotype assignments by swapping the primary and secondary cell type labels based on transcriptional neighborhood homogeneity.
366 | 
367 | For this, we need to compute transcriptomics neighborhood 
368 | ```{r transcriptomics-nw}
369 | tr_nw <- build_transcriptomics_network(
370 |   xe,
371 |   DO_prune = FALSE,
372 |   k_knn = 100
373 | )
374 | tr_nw <- add_transcriptomics_metric(transcriptomics_neighborhood = tr_nw, rctd = RCTD) 
375 | tr_neigh_df <- neighborhood_analysis_to_metadata(tr_nw)
376 | xe <- AddMetaData(xe, tr_neigh_df)
377 | 
378 | rm(tr_nw, tr_neigh_df)
379 | ```
380 | 
381 | And then, we set `DO_swap_lables = TRUE` to allow SPLIT-shift
382 | ```{r run-split-shift}
383 | xe_split_shift <- SPLIT::balance_raw_and_purified_data_by_score(
384 |   xe_raw = xe,
385 |   xe_purified = xe_purified,
386 |   default_assay = "Xenium",
387 |   spot_class_key = "spot_class",
388 |   threshold = 0.05, # to be consistent with spatially-aware SPLIT results
389 |   score_name = "neighborhood_weights_second_type",
390 |   DO_swap_lables = TRUE
391 | )
392 | 
393 | # Optional: Filter, normalize and visualize
394 | xe_split_shift <- subset(xe_split_shift, subset = nCount_Xenium > 5)
395 | xe_split_shift <- xe_split_shift %>%
396 |   SCTransform(assay = "Xenium", verbose = FALSE) %>%
397 |   RunPCA(verbose = FALSE) %>%
398 |   RunUMAP(dims = 1:20, verbose = FALSE)
399 | ```
400 | 
401 | ### Visualize SPLIT-shift
402 | ```{r vis-split-shift, fig.width=16, message=FALSE}
403 | p7 <- UMAPPlot(xe_split_shift, group.by = c("first_type"), label = T, repel = T, cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "none") + ggtitle("SPLIT-shift-purified Xenium data")
404 | p8 <- UMAPPlot(xe_split_shift, group.by = c("purification_status")) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom")
405 | p9 <- UMAPPlot(xe_split_shift, group.by = c("swap")) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom")
406 | 
407 | p7|p8|p9
408 | ```
409 | 
410 | ```{r vis-split-shift-swap-facet, fig.width=12, message=FALSE}
411 | # Visualize results faceted by swapping status
412 | p10 <- UMAPPlot(xe_split_shift, group.by = c("first_type"), split.by = "swap", raster = F, cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom") + ggtitle("SPLIT-shift-purified Xenium data faceted by lable swapping status")
413 | p10
414 | ```
415 | 
416 | # Summary
417 | 
418 | To run SPLIT, you need a single-cell reference with reliable cell type labels, which is used to annotate Xenium data using RCTD in doublet mode.  
419 | We strongly recommend providing a broader-level mapping of the reference cell types to higher-level classes. This helps RCTD produce more robust results and reduces the number of rejected cells, which are excluded from downstream analysis.
420 | 
421 | After annotation, SPLIT can be run in several **combinable** modes to purify the data:
422 | 
423 | 1. **Default SPLIT** purifies all `doublets_certain`, all `doublets_uncertain`, and—if `DO_purify_singlets = TRUE` — singlets that show signs of a secondary cell type.  
424 |    Rejected cells are always removed as unreliable.
425 | 
426 | 2. **Spatially-aware SPLIT** purifies any cells that show signs of contamination based on local spatial diffusion — i.e., having secondary signal in their spatial neighborhood.
427 | 
428 | 3. **SPLIT-shift** allows swapping the primary and secondary cell type labels based on transcriptional neighborhood homogeneity.
429 | 
430 | ```{r summary-plot, fig.width=18, fig.height=7, message=FALSE}
431 | p1 | p2+theme(legend.position = "bottom") | p5 | p7
432 | ```
433 | 
434 | # Bonus: Pie visualization
435 | 
436 | We can visualize individual cells as pie charts, where colors represent the primary and secondary cell types, and sector areas are proportional to their respective weights. 
437 | 
438 | In this example, we display a randomly selected T cell that exhibits strong local diffusion of its secondary cell type (ie., there is high signal of secondary cell type in its spatial neighborhood). The plot includes this cell and all neighboring cells within a 50μm radius.
439 | 
440 | ```{r, fig.width=12}
441 | pie_df <- get_pieplot_df(rctd = RCTD)
442 | cois <- xe@meta.data %>% filter(neighborhood_weights_second_type > .2, first_type %in% c("CD4+_T_Cells", "CD8+_T_Cells"), spot_class != "reject") %>% rownames()
443 | 
444 | p <- plot_pie_around_cell(pie_df, cell_id = cois[1], radius = 50, cols = pal)
445 | plot(p)
446 | ```
447 | 
448 | The **border color** of each pie indicates the **primary cell type**. Additional markers within the pies reflect RCTD spot class (ie., annotation confidence):
449 | 
450 | - A **black dot** indicates a `doublet_uncertain` cell.  
451 | - A **colored dot** (matching the secondary cell type) indicates a `doublet_certain`.  
452 | - A **cross (×)** marks a `reject` cell.  
453 | - Pies with no internal symbol represent confidently assigned `singlet` cells.
454 | 


--------------------------------------------------------------------------------
/inst/CITATION:
--------------------------------------------------------------------------------
 1 | citHeader("To cite SPLIT in publications, please use:")
 2 | 
 3 | bibentry(
 4 |   bibtype = "Article",
 5 |   title = "From Transcripts to Cells: Dissecting Sensitivity, Signal Contamination, and Specificity in Xenium Spatial Transcriptomics",
 6 |   author = c(
 7 |     person("Mariia", "Bilous"),
 8 |     person("Daria", "Buszta"),
 9 |     person("Jonathan", "Bac"),
10 |     person("Senbai", "Kang"),
11 |     person("Yixing", "Dong"),
12 |     person("Stephanie", "Tissot"),
13 |     person("Sylvie", "Andre"),
14 |     person("Marina", "Alexandre-Gaveta"),
15 |     person("Christel", "Voize"),
16 |     person("Solange", "Peters"),
17 |     person("Krisztian", "Homicsko"),
18 |     person("Raphael", "Gottardo")
19 |   ),
20 |   year = "2025",
21 |   journal = "bioRxiv",
22 |   note = "doi:10.1101/2025.04.23.649965",
23 |   url = "https://doi.org/10.1101/2025.04.23.649965"
24 | )
25 | 


--------------------------------------------------------------------------------
/vignettes/Run_RCTD_and_SPLIT_on_Xenium.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "RCTD Annotation and SPLIT Purification for Xenium Sample"
  3 | output: 
  4 |   rmarkdown::github_document:
  5 |     toc: true
  6 |     toc_depth: 2
  7 | vignette: >
  8 |   %\VignetteIndexEntry{Run RCTD and SPLIT on Xenium}
  9 |   %\VignetteEngine{knitr::rmarkdown}
 10 |   %\VignetteEncoding{UTF-8}
 11 | ---
 12 | 
 13 | # Introduction
 14 | 
 15 | This vignette demonstrates how to annotate Xenium spatial transcriptomics data using RCTD, followed by purification with SPLIT.
 16 | 
 17 | ⚠️ **Important Notice**
 18 | 
 19 | SPLIT currently requires **doublet-mode** RCTD results generated with the original [spacexr GitHub repository](https://github.com/dmcable/spacexr) or the faster [HD fork](https://github.com/jpromeror/spacexr/tree/HD).  
 20 | 🚧 **Compatibility with the newly released [Bioconductor version](https://www.bioconductor.org/packages/release/bioc/html/spacexr.html) of spacexr is under development.**
 21 | 
 22 | # Overview
 23 | 
 24 | In this vignette, we:
 25 | 
 26 | 1. **Run RCTD Annotation**
 27 |     - We begin by running RCTD annotation on a Xenium sample, using matched Chromium data from a public 10x Genomics dataset as the reference.
 28 |     
 29 | 2. **Apply Default SPLIT Purification**
 30 |     - The default SPLIT purification method is then applied to clean the annotated Xenium sample, refining the initial annotations.
 31 | 
 32 | 3. **Apply Spatially-Aware SPLIT**
 33 |     - We apply spatially-aware SPLIT, which purifies cells showing signs of contamination based on local spatial diffusion patterns.
 34 |     
 35 | 4. **Apply SPLIT-Shift**
 36 |     - Finally, SPLIT-shift is applied to swap primary and secondary labels based on transcriptomic neighborhood heterogeneity, improving the accuracy of cell type assignments.
 37 | 
 38 | This pipeline assumes that cell type assignments — originally derived from the Chromium reference — are refined and reliable for downstream analysis.
 39 | 
 40 | 
 41 | ```{r libs, message=FALSE}
 42 | if(!requireNamespace("spacexr", quietly = TRUE)){
 43 |   remotes::install_github("dmcable/spacexr") ## or remotes::install_github("jpromeror/spacexr@HD") for implementation of the doublet mode.
 44 | }
 45 | library(spacexr)
 46 | 
 47 | if(!requireNamespace("SPLIT", quietly = TRUE)){
 48 |   remotes::install_github("bdsc-tds/SPLIT") 
 49 | }
 50 | library(SPLIT)
 51 | 
 52 | library(dplyr)
 53 | library(Seurat)
 54 | library(readxl)
 55 | library(SingleCellExperiment)
 56 | library(httr)
 57 | library(ggplot2)
 58 | ```
 59 | 
 60 | ## Load Data
 61 | 
 62 | For this vignette, we use a publicly available Xenium dataset from the 10x Genomics database, originating from:
 63 | 
 64 | > **Janesick, A., Shelansky, R., Gottscho, A.D. et al.**  
 65 | > *High resolution mapping of the tumor microenvironment using integrated single-cell, spatial and in situ analysis.*  
 66 | > *Nature Communications* 14, 8353 (2023).  
 67 | > [https://doi.org/10.1038/s41467-023-43458-x](https://doi.org/10.1038/s41467-023-43458-x)
 68 | 
 69 | This dataset provides high-resolution spatial transcriptomics data suitable for downstream analysis with RCTD and SPLIT.
 70 | 
 71 | ---
 72 | 
 73 | ### Load Chromium Dataset (Reference)
 74 | 
 75 | We manually load metadata for the Chromium single-cell dataset and metadata from the same study, which will serve as the **reference** for RCTD annotation.
 76 | 
 77 | ```{r load-chormium-metadata}
 78 | # read metadata
 79 | url <- "https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-023-43458-x/MediaObjects/41467_2023_43458_MOESM4_ESM.xlsx"
 80 | temp_file <- tempfile(fileext = ".xlsx")
 81 | GET(url, write_disk(temp_file, overwrite = TRUE))
 82 | 
 83 | chrom_metadata <- read_excel(temp_file, sheet = 1) %>% as.data.frame()
 84 | rownames(chrom_metadata) <- chrom_metadata$Barcode
 85 | ```
 86 | 
 87 | Manually load Chromium from 10x 
 88 | ```{r load-chromium}
 89 | # read Chromium 
 90 | url <- "https://cf.10xgenomics.com/samples/cell-exp/7.0.1/Chromium_FFPE_Human_Breast_Cancer_Chromium_FFPE_Human_Breast_Cancer/Chromium_FFPE_Human_Breast_Cancer_Chromium_FFPE_Human_Breast_Cancer_count_sample_filtered_feature_bc_matrix.h5"
 91 | temp_file <- tempfile(fileext = ".h5")
 92 | GET(url, write_disk(temp_file, overwrite = TRUE))
 93 | 
 94 | chrom_counts <- Read10X_h5(temp_file)
 95 | chrom <- CreateSeuratObject(counts = chrom_counts, assay = "RNA", meta.data = chrom_metadata)
 96 | 
 97 | chrom$QC <- !is.na(chrom$Annotation)
 98 | chrom$is_hybrid <- grepl("Hybrid", chrom$Annotation, ignore.case = TRUE)
 99 | chrom <- subset(chrom, subset = QC == TRUE & is_hybrid == FALSE) # remove not annotated cells that cells that have sign of doublets
100 | 
101 | ```
102 | 
103 | #### Generate broader level annoation (optional, but **highly recommended** for more robust RCTD annotation)
104 | Providing higher-level cell type classes improves RCTD accuracy and **significantly** reduces the number of rejected cells, preserving more cells (rejects are excluded from the downstream analyses and do not undergo SPLIT purification).
105 | ```{r class-df}
106 | cell_type_to_class <- c(
107 |   "B_Cells" = "B cell",
108 |   "CD4+_T_Cells" = "T cell",
109 |   "CD8+_T_Cells" = "T cell",
110 |   "IRF7+_DCs" = "Myeloid",
111 |   "LAMP3+_DCs" = "Myeloid",
112 |   "Macrophages_1" = "Myeloid",
113 |   "Macrophages_2" = "Myeloid",
114 |   "Mast_Cells" = "Myeloid",
115 |   "DCIS 1" = "Epithelial",
116 |   "DCIS 2" = "Epithelial",
117 |   "Invasive_Tumor" = "Epithelial",
118 |   "Prolif_Invasive_Tumor" = "Epithelial",
119 |   "Myoepi_ACTA2+" = "Myoepithelial",
120 |   "Myoepi_KRT15+" = "Myoepithelial",
121 |   "Stromal" = "Stromal",
122 |   "Perivascular-Like" = "Stromal",
123 |   "Endothelial" = "Endothelial"
124 | )
125 | 
126 | class_df <- data.frame(class = cell_type_to_class)
127 | 
128 | # and define colors for reproducibility 
129 | library(RColorBrewer)
130 | 
131 | cell_types <- unique(chrom$Annotation)
132 | colors <- brewer.pal(n = max(3, min(length(cell_types), 12)), name = "Set3")
133 | # Recycle colors if not enough
134 | colors <- rep(colors, length.out = length(cell_types))
135 | pal <- setNames(colors, cell_types)
136 | ```
137 | 
138 | 
139 | ### Load Xenium Dataset
140 | 
141 | We load the Xenium spatial transcriptomics data using the `STexampleData` package, which provides convenient access to example spatial datasets for analysis and demonstration.
142 | 
143 | ```{r load-xenium}
144 | if(!requireNamespace("STexampleData", quietly = TRUE))
145 |   remotes::install_github("lmweber/STexampleData")
146 | xe_full_seu <- STexampleData::Janesick_breastCancer_Xenium_rep1() 
147 | 
148 | ## Convert to Seurat to stay consistent with chromium object
149 | sp_coords <- spatialCoords(xe_full_seu)
150 | colnames(sp_coords) <- c("ST_1", "ST_2")
151 | 
152 | xe_full <- CreateSeuratObject(
153 |   counts = counts(xe_full_seu),
154 |   assay = "Xenium",
155 |   meta.data = as.data.frame(colData(xe_full_seu))
156 | )
157 | 
158 | xe_full[["spatial"]] <- CreateDimReducObject(sp_coords, assay = "Xenium", key = "ST_")
159 | 
160 | xe_full$x <- sp_coords[,1]
161 | xe_full$y <- sp_coords[,2]
162 | rm(xe_full_seu)
163 | ```
164 | 
165 | #### Downsample Xenium Dataset for Faster RCTD (Optional)
166 | 
167 | Running RCTD can be time-consuming, especially on large datasets like Xenium. To speed up computation during this tutorial, we optionally provide code to downsample the dataset. We recommend spatial cropping (rather than random sampling) to preserve neighborhood structure, which is important for downstream analysis.
168 | 
169 | That said, downsampling is **not required **, we provide precomputed RCTD results below so you can skip running RCTD altogether if desired.  
170 | 
171 | ```{r downsampling}
172 | DO_subset_xe <- FALSE 
173 | X_lim <- c(6000, Inf) # cropping area 
174 | Y_lim <- c(4000, Inf) # cropping area 
175 | 
176 | if(DO_subset_xe){
177 |   xe <- subset(xe_full, subset = x > min(X_lim) & x < max(X_lim) & y > min(Y_lim) & y < max(Y_lim))
178 | } else {
179 |   xe <- xe_full
180 | }
181 | ```
182 | 
183 | # RCDT annotation
184 | Run RCTD Annotation on Xenium
185 | Running RCTD on large datasets can be computationally intensive and may take several hours. To streamline the workflow, we provide the full code for reproducibility. However, we recommend loading a pre-computed RCTD object by setting `DO_run_RCTD <- FALSE`.
186 | ```{r rctd}
187 | DO_run_RCTC <- FALSE # FALSE to load pre-computed results
188 | 
189 | common_genes <- intersect(rownames(xe), rownames(chrom))
190 | ref_labels <- chrom$Annotation %>% as.factor()
191 | 
192 | ref.obj <- Reference(GetAssayData(chrom, "RNA", "counts")[common_genes, ],
193 |                      cell_types = ref_labels, min_UMI = 10, require_int = TRUE)
194 | 
195 | test.obj <- SpatialRNA(coords = xe@reductions$spatial@cell.embeddings %>% as.data.frame(),
196 |                        counts = GetAssayData(xe, assay = "Xenium", layer = "counts")[common_genes, ],
197 |                        require_int = TRUE)
198 | 
199 | if(!exists("class_df")) 
200 |   class_df <- NULL
201 | 
202 | RCTD <- create.RCTD(
203 |   test.obj, 
204 |   ref.obj, 
205 |   UMI_min = 10, 
206 |   counts_MIN = 10, 
207 |   UMI_min_sigma = 100,
208 |   max_cores = BiocParallel::multicoreWorkers() - 1, 
209 |   CELL_MIN_INSTANCE = 25,
210 |   class_df = class_df # highly recommended 
211 |   )
212 | 
213 | if(DO_run_RCTC){
214 |   RCTD <- run.RCTD(RCTD, doublet_mode = "doublet")
215 |   saveRDS(RCTD, "~/precomp_rctd_class_aware.rds")
216 | } else {
217 |   message("reading precomp RCTD results")
218 |   
219 |   # Install googledrive if you haven't already
220 |   if (!requireNamespace("googledrive", quietly = TRUE)) {
221 |     install.packages("googledrive")
222 |   }
223 |   library(googledrive)
224 |   drive_deauth()
225 |   # Define the file ID from the Google Drive link
226 |   file_id <- "1pTUKq49JbUFwVk7vttjZIFqkx-AKznRF" #"1DCalFIZJywOvrSGBSPqrHh9QINeQp8aq"
227 |   local_path <- tempfile(fileext = ".rds")
228 |   drive_download(as_id(file_id), path = local_path, overwrite = TRUE)
229 |   RCTD <- readRDS(local_path)
230 |   
231 | }
232 | 
233 | ```
234 | 
235 | Visualize RCTD Annotation
236 | Post-process RCDT output and add reselts into Xenium object
237 | ```{r post-rctd}
238 | RCTD <- SPLIT::run_post_process_RCTD(RCTD)
239 | xe <- AddMetaData(xe, RCTD@results$results_df)
240 | xe <- subset(xe, subset = nCount_Xenium >= 10)
241 | 
242 | cat("Proprtion of spot classes")
243 | (xe$spot_class %>% table())/ncol(xe)*100 
244 | ```
245 | 
246 | ```{r plot-raw-xenium, fig.width=16, message=FALSE}
247 | 
248 | xe <- xe %>% SCTransform(assay = "Xenium", verbose = FALSE) %>% RunPCA(verbose = FALSE) %>% RunUMAP(dims = 1:50, verbose = FALSE)
249 | p1 <- UMAPPlot(xe, group.by = "first_type", label = T, repel = T, cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "none")
250 | p2 <- UMAPPlot(xe, group.by = "second_type", cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom")
251 | p3 <- UMAPPlot(xe, group.by = "spot_class") + theme_void() + theme(aspect.ratio = 1, legend.position = "right")
252 |  
253 | p1 | p2 | p3
254 | 
255 | ```
256 | Spatial Visualization
257 | ```{r spatial-plot, fig.width=12, warning=FALSE}
258 | DimPlot(xe, reduction = "spatial", group.by = "first_type", raster = TRUE, cols = pal) + coord_fixed()
259 | ```
260 | 
261 | # Purification
262 | ## SPLIT (default)
263 | This section runs the default SPLIT purification and visualizes purified data.
264 | 
265 | ```{r SPLIT, message=FALSE}
266 | # Run SPLIT purification
267 | res_split <- SPLIT::purify(
268 |   counts = GetAssayData(xe, assay = 'Xenium', layer = 'counts'), # or any gene x cells counts matrix
269 |   rctd = RCTD,
270 |   DO_purify_singlets = TRUE # Optional. If TRUE, singlets with an available secondary type are purified the same way as doublets_certain; otherwise, left unchanged.
271 |   )
272 | 
273 | # Create a purified Seurat object
274 | xe_purified <- CreateSeuratObject(
275 |   counts = res_split$purified_counts,
276 |   meta.data = res_split$cell_meta,
277 |   assay = "Xenium"
278 | )
279 | 
280 | # Optional: Filter, normalize and visualize
281 | xe_purified <- subset(xe_purified, subset = nCount_Xenium > 5)
282 | xe_purified <- xe_purified %>%
283 |   SCTransform(assay = "Xenium", verbose = FALSE) %>%
284 |   RunPCA(verbose = FALSE) %>%
285 |   RunUMAP(dims = 1:20, verbose = FALSE)
286 | 
287 | #UMAPPlot(xe_purified, group.by = c("first_type"), label = T, repel = T) + theme_void() + theme(aspect.ratio = 1, legend.position = "none")
288 | ```
289 | ### Visually compare results of Raw and SPLIT-Purified data
290 | ```{r plot-raw-split-purified, fig.width=12, message=FALSE}
291 | p1 <- UMAPPlot(xe, group.by = c("first_type"), label = T, repel = T, cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "none") + ggtitle("Raw Xenium data")
292 | 
293 | p2 <- UMAPPlot(xe_purified, group.by = c("first_type"), label = T, repel = T, cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "right") + ggtitle("SPLIT-purified Xenium data")
294 | 
295 | p3 <- UMAPPlot(xe_purified, group.by = c("spot_class")) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom") + ggtitle("SPLIT-purified Xenium data colored by spot class")
296 | p4 <- UMAPPlot(xe_purified, group.by = c("purification_status")) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom") + ggtitle("SPLIT-purified Xenium data colored by purification status")
297 | 
298 | 
299 | (p1|p2) 
300 | (p3|p4)
301 | 
302 | ```
303 | 
304 | ## Spatially-aware SPLIT
305 | SPLIT can leverage spatial information to assess the abundance of secondary signals in the local neighborhood (i.e., local diffusion potential), enabling selective decomposition only when contamination is likely. This spatially informed strategy helps prevent overcorrection of phenotypes that may be underrepresented or absent in the reference. 
306 | Specifically, we first compute the spatial neighborhood for each cell, then identify and purify cells that have sign of local diffusion of the secondary cell type.
307 | ```{r spatial-nw}
308 | sp_nw <- SPLIT::build_spatial_network(
309 |   xe, 
310 |   reduction = "spatial",
311 |   dims = 1:2, 
312 |   DO_prune = TRUE, 
313 |   rad_pruning = 15, # remove connections further than 15um
314 |   k_knn = 20
315 |   )
316 | 
317 | sp_nw <- SPLIT::add_spatial_metric(spatial_neighborhood = sp_nw, rctd = RCTD)
318 | sp_neigh_df <- SPLIT::neighborhood_analysis_to_metadata(sp_nw)
319 | 
320 | xe <- AddMetaData(xe, sp_neigh_df)
321 | 
322 | rm(sp_nw, sp_neigh_df)
323 | ```
324 | 
325 | ### Visualize local diffusion of secondary cell type 
326 | The score `neighborhood_weights_second_type` corresponds to the average weight of the secondary cell type in cell's spatial neighborhood.
327 | ```{r plot-neigh-weight-second-type, message=FALSE}
328 | # Plot magnitude of local diffusion on UMAP
329 | FeaturePlot(xe, features = c("neighborhood_weights_second_type")) + theme_void() + theme(aspect.ratio = 1)
330 | 
331 | # Plot distribution of local diffusion value
332 | hist(xe$neighborhood_weights_second_type)
333 | 
334 | # Plot distribution of local diffusion value per `spot_class`
335 | xe@meta.data %>% filter(!is.na(spot_class)) %>% 
336 |   ggplot(aes(x = spot_class, y = neighborhood_weights_second_type, color = spot_class)) + geom_boxplot() + labs(title = "Local neighbohood diffusion by spot class") + theme_minimal() 
337 | ``` 
338 | 
339 | 
340 | We now purify cells that have secondary signal in their spatial neighborhood (e.g., `neighborhood_weights_second_type`) and keep other cells unchanged
341 | ```{r spatially-aware-split, fig.width=12, message=FALSE}
342 | xe_purified_balanced_score <- SPLIT::balance_raw_and_purified_data_by_score(
343 |   xe_raw = xe,
344 |   xe_purified = xe_purified,
345 |   default_assay = "Xenium", # should be param, but can wait 
346 |   spot_class_key = "spot_class",
347 |   threshold = 0.05, # lower -> more cells will be purified
348 |   score_name = "neighborhood_weights_second_type"
349 | )
350 | 
351 | # Optional: Filter, normalize and visualize
352 | xe_purified_balanced_score <- subset(xe_purified_balanced_score, subset = nCount_Xenium > 5)
353 | xe_purified_balanced_score <- xe_purified_balanced_score %>%
354 |   SCTransform(assay = "Xenium", verbose = FALSE) %>%
355 |   RunPCA(verbose = FALSE) %>%
356 |   RunUMAP(dims = 1:20, verbose = FALSE)
357 | 
358 | p5 <- UMAPPlot(xe_purified_balanced_score, group.by = c("first_type"), label = T, repel = T, cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "none") + ggtitle("Spatially-aware SPLIT-purified Xenium data")
359 | p6 <- UMAPPlot(xe_purified_balanced_score, group.by = c("purification_status")) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom")
360 | 
361 | p5|p6
362 | ```
363 | 
364 | ## SPLIT-shift
365 | In some cases, the contamination signal is so strong that RCTD assigns the cell to its secondary cell type. To address this, we introduce SPLIT-shift—an approach that refines phenotype assignments by swapping the primary and secondary cell type labels based on transcriptional neighborhood homogeneity.
366 | 
367 | For this, we need to compute transcriptomics neighborhood 
368 | ```{r transcriptomics-nw}
369 | tr_nw <- build_transcriptomics_network(
370 |   xe,
371 |   DO_prune = FALSE,
372 |   k_knn = 100
373 | )
374 | tr_nw <- add_transcriptomics_metric(transcriptomics_neighborhood = tr_nw, rctd = RCTD) 
375 | tr_neigh_df <- neighborhood_analysis_to_metadata(tr_nw)
376 | xe <- AddMetaData(xe, tr_neigh_df)
377 | 
378 | rm(tr_nw, tr_neigh_df)
379 | ```
380 | 
381 | And then, we set `DO_swap_lables = TRUE` to allow SPLIT-shift
382 | ```{r run-split-shift}
383 | xe_split_shift <- SPLIT::balance_raw_and_purified_data_by_score(
384 |   xe_raw = xe,
385 |   xe_purified = xe_purified,
386 |   default_assay = "Xenium",
387 |   spot_class_key = "spot_class",
388 |   threshold = 0.05, # to be consistent with spatially-aware SPLIT results
389 |   score_name = "neighborhood_weights_second_type",
390 |   DO_swap_lables = TRUE
391 | )
392 | 
393 | # Optional: Filter, normalize and visualize
394 | xe_split_shift <- subset(xe_split_shift, subset = nCount_Xenium > 5)
395 | xe_split_shift <- xe_split_shift %>%
396 |   SCTransform(assay = "Xenium", verbose = FALSE) %>%
397 |   RunPCA(verbose = FALSE) %>%
398 |   RunUMAP(dims = 1:20, verbose = FALSE)
399 | ```
400 | 
401 | ### Visualize SPLIT-shift
402 | ```{r vis-split-shift, fig.width=16, message=FALSE}
403 | p7 <- UMAPPlot(xe_split_shift, group.by = c("first_type"), label = T, repel = T, cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "none") + ggtitle("SPLIT-shift-purified Xenium data")
404 | p8 <- UMAPPlot(xe_split_shift, group.by = c("purification_status")) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom")
405 | p9 <- UMAPPlot(xe_split_shift, group.by = c("swap")) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom")
406 | 
407 | p7|p8|p9
408 | ```
409 | 
410 | ```{r vis-split-shift-swap-facet, fig.width=12, message=FALSE}
411 | # Visualize results faceted by swapping status
412 | p10 <- UMAPPlot(xe_split_shift, group.by = c("first_type"), split.by = "swap", raster = F, cols = pal) + theme_void() + theme(aspect.ratio = 1, legend.position = "bottom") + ggtitle("SPLIT-shift-purified Xenium data faceted by lable swapping status")
413 | p10
414 | ```
415 | 
416 | # Summary
417 | 
418 | To run SPLIT, you need a single-cell reference with reliable cell type labels, which is used to annotate Xenium data using RCTD in doublet mode.  
419 | We strongly recommend providing a broader-level mapping of the reference cell types to higher-level classes. This helps RCTD produce more robust results and reduces the number of rejected cells, which are excluded from downstream analysis.
420 | 
421 | After annotation, SPLIT can be run in several **combinable** modes to purify the data:
422 | 
423 | 1. **Default SPLIT** purifies all `doublets_certain`, all `doublets_uncertain`, and—if `DO_purify_singlets = TRUE` — singlets that show signs of a secondary cell type.  
424 |    Rejected cells are always removed as unreliable.
425 | 
426 | 2. **Spatially-aware SPLIT** purifies any cells that show signs of contamination based on local spatial diffusion — i.e., having secondary signal in their spatial neighborhood.
427 | 
428 | 3. **SPLIT-shift** allows swapping the primary and secondary cell type labels based on transcriptional neighborhood homogeneity.
429 | 
430 | ```{r summary-plot, fig.width=18, fig.height=7, message=FALSE}
431 | p1 | p2+theme(legend.position = "bottom") | p5 | p7
432 | ```
433 | 
434 | # Bonus: Pie visualization
435 | 
436 | We can visualize individual cells as pie charts, where colors represent the primary and secondary cell types, and sector areas are proportional to their respective weights. 
437 | 
438 | In this example, we display a randomly selected T cell that exhibits strong local diffusion of its secondary cell type (ie., there is high signal of secondary cell type in its spatial neighborhood). The plot includes this cell and all neighboring cells within a 50μm radius.
439 | 
440 | ```{r, fig.width=12}
441 | pie_df <- get_pieplot_df(rctd = RCTD)
442 | cois <- xe@meta.data %>% filter(neighborhood_weights_second_type > .2, first_type %in% c("CD4+_T_Cells", "CD8+_T_Cells"), spot_class != "reject") %>% rownames()
443 | 
444 | p <- plot_pie_around_cell(pie_df, cell_id = cois[1], radius = 50, cols = pal)
445 | plot(p)
446 | ```
447 | 
448 | The **border color** of each pie indicates the **primary cell type**. Additional markers within the pies reflect RCTD spot class (ie., annotation confidence):
449 | 
450 | - A **black dot** indicates a `doublet_uncertain` cell.  
451 | - A **colored dot** (matching the secondary cell type) indicates a `doublet_certain`.  
452 | - A **cross (×)** marks a `reject` cell.  
453 | - Pies with no internal symbol represent confidently assigned `singlet` cells.
454 | 


--------------------------------------------------------------------------------
/vignettes/plots/SPLIT_schema.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bdsc-tds/SPLIT/cd90fa7572b7a2ad6ac864e07659b8777a5d8ac0/vignettes/plots/SPLIT_schema.png


--------------------------------------------------------------------------------