├── .Rbuildignore
├── .gitignore
├── DESCRIPTION
├── LICENSE
├── LICENSE.md
├── NAMESPACE
├── NEWS.md
├── R
    ├── find_connected_components.R
    ├── find_linear_dependent_columns.R
    ├── make_full_rank_matrix.R
    └── validate_column_names.R
├── README.Rmd
├── README.md
├── cran-comments.md
├── inst
    ├── CITATION
    └── WORDLIST
├── man
    ├── figures
    │   ├── example_vectors.png
    │   └── fullRankMatrix.png
    ├── find_connected_components.Rd
    ├── find_linear_dependent_columns.Rd
    ├── make_full_rank_matrix.Rd
    └── validate_column_names.Rd
├── tests
    ├── spelling.R
    ├── testthat.R
    └── testthat
    │   ├── test-find_connected_components.R
    │   ├── test-find_linear_dependent_columns.R
    │   └── test-make_full_rank_matrix.R
└── vignettes
    ├── .gitignore
    ├── fullrankmat-comparison.Rmd
    ├── fullrankmat-example.Rmd
    └── man
        └── figures
            ├── example_vectors.png
            └── fullRankMatrix.png


/.Rbuildignore:
--------------------------------------------------------------------------------
 1 | ^renv$
 2 | ^renv\.lock$
 3 | ^fullRankMatrix\.Rproj$
 4 | ^\.Rproj\.user$
 5 | ^LICENSE\.md$
 6 | ^README\.Rmd$
 7 | ^SPACES$
 8 | ^cran-comments\.md$
 9 | ^CRAN-SUBMISSION$
10 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | .Rproj.user
 2 | *.Rproj
 3 | .Rprofile
 4 | .Rhistory
 5 | renv
 6 | renv.lock
 7 | .Rdata
 8 | .httr-oauth
 9 | .DS_Store
10 | inst/doc
11 | 


--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
 1 | Package: fullRankMatrix
 2 | Title: Generation of Full Rank Design Matrix
 3 | Version: 0.1.0.9000
 4 | Authors@R: 
 5 |     c(
 6 |     person("Paula", "Weidemueller", , "paulahw3214@gmail.com", role = c("aut", "cre", "cph"), comment = c(ORCID = "0000-0003-3867-3131", Twitter = "@PaulaH_W")),
 7 |     person("Constantin", "Ahlmann-Eltze", , "artjom31415@googlemail.com", role = c("aut"), comment = c(ORCID = "0000-0002-3762-068X", Twitter = "@const_ae"))
 8 |     )
 9 | Description: Creates a full rank matrix out of a given matrix.
10 |   The intended use is for one-hot encoded design matrices that should be used in linear models to ensure that significant associations can be correctly interpreted. However, 'fullRankMatrix' can be applied to any matrix to make it full rank.
11 |   It removes columns with only 0's, merges duplicated columns and discovers linearly dependent columns and replaces them with linearly independent columns that span the space of the original columns. Columns are renamed to reflect those modifications.
12 |   This results in a full rank matrix that can be used as a design matrix in linear models. The algorithm and some functions are inspired by Kuhn, M. (2008) <doi:10.18637/jss.v028.i05>.
13 | License: MIT + file LICENSE
14 | Encoding: UTF-8
15 | Roxygen: list(markdown = TRUE)
16 | RoxygenNote: 7.3.1
17 | Suggests: 
18 |     knitr,
19 |     rmarkdown,
20 |     igraph,
21 |     testthat (>= 3.0.0),
22 |     WeightIt,
23 |     caret,
24 |     plm,
25 |     spelling
26 | Config/testthat/edition: 3
27 | VignetteBuilder: knitr
28 | URL: https://github.com/Pweidemueller/fullRankMatrix
29 | BugReports: https://github.com/Pweidemueller/fullRankMatrix/issues
30 | Language: en-US
31 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | YEAR: 2024
2 | COPYRIGHT HOLDER: fullRankMatrix authors
3 | 


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
 1 | # MIT License
 2 | 
 3 | Copyright (c) 2024 fullRankMatrix authors
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/NAMESPACE:
--------------------------------------------------------------------------------
1 | # Generated by roxygen2: do not edit by hand
2 | 
3 | export(find_connected_components)
4 | export(find_linear_dependent_columns)
5 | export(make_full_rank_matrix)
6 | export(validate_column_names)
7 | importFrom(stats,lm.fit)
8 | 


--------------------------------------------------------------------------------
/NEWS.md:
--------------------------------------------------------------------------------
1 | # fullRankMatrix (development version)
2 | 
3 | * fix typos
4 | 
5 | # fullRankMatrix 0.1.0
6 | 
7 | * Initial CRAN submission.
8 | 


--------------------------------------------------------------------------------
/R/find_connected_components.R:
--------------------------------------------------------------------------------
 1 | #' Find connected components in a graph
 2 | #'
 3 | #' The function performs a depths-first search to find all connected components.
 4 | #'
 5 | #' @param connections a list where each element is a vector with connected nodes.
 6 | #'   Each node must be either a character or an integer.
 7 | #'
 8 | #' @return a list where each element is a set of connected items.
 9 | #' @export
10 | #' @examples
11 | #'   find_connected_components(list(c(1,2), c(1,3), c(4,5)))
12 | #'
13 | #'
14 | find_connected_components <- function(connections){
15 |   stopifnot(is.list(connections))
16 |   is_char <- vapply(connections, \(con) is.character(con), FUN.VALUE = logical(1L))
17 |   is_int <- vapply(connections, \(con) is.integer(con) || (is.numeric(con) && all(con < 2^31) && all(con %% 1 == 0)), FUN.VALUE = logical(1L))
18 |   if(all(is_char)){
19 |     # Do nothing
20 |   }else if(all(is_int)){
21 |     connections <- lapply(connections, \(con) as.character(as.integer(con)))
22 |   }else{
23 |     stop("Elements in 'connections' must be either characters or integers.")
24 |   }
25 |   nodes <- unique(unlist(connections, use.names = FALSE))
26 | 
27 |   # Keep track which nodes I have visited
28 |   visited <- new.env(parent = emptyenv(), size = length(nodes))
29 |   for(n in nodes){
30 |     visited[[n]] <- FALSE
31 |   }
32 | 
33 |   # Efficient access to neighbors of each node
34 |   connection_graph <- new.env(parent = emptyenv(), size = length(nodes))
35 |   for(con in connections){
36 |     for(n in con){
37 |       connection_graph[[n]] <- union(connection_graph[[n]], con)
38 |     }
39 |   }
40 | 
41 |   # Depth-first search
42 |   dfs <- function(graph, head){
43 |     if(is.null(head)){
44 |       return(character(0L))
45 |     }
46 |     queue <- NULL
47 |     visited[[head]] <- TRUE
48 |     for(n in connection_graph[[head]]){
49 |       if(! visited[[n]]){
50 |         queue <- list(head = n, tail = queue)
51 |       }
52 |     }
53 | 
54 |     res <- character(length(nodes))
55 |     res[1] <- head
56 |     counter <- 2
57 |     while(! is.null(queue)){
58 |       head <- queue$head
59 |       queue <- queue$tail
60 |       if(visited[[head]]){
61 |       }else{
62 |         res[counter] <- head
63 |         counter <- counter + 1
64 |         visited[[head]] <- TRUE
65 |         for(n in connection_graph[[head]]){
66 |           if(! visited[[n]]){
67 |             queue <- list(head = n, tail = queue)
68 |           }
69 |         }
70 |       }
71 |     }
72 |     sort(res[seq_len(counter-1)])
73 |   }
74 | 
75 |   result <- replicate(length(nodes), NULL)
76 |   counter <- 1
77 |   for(n in nodes){
78 |     if(! visited[[n]]){
79 |       result[[counter]] <- dfs(connection_graph, n)
80 |       counter <- counter + 1
81 |     }
82 |   }
83 |   if(all(is_char)){
84 |     result[seq_len(counter-1)]
85 |   }else{
86 |     lapply(result[seq_len(counter-1)], as.integer)
87 |   }
88 | }
89 | 
90 | 


--------------------------------------------------------------------------------
/R/find_linear_dependent_columns.R:
--------------------------------------------------------------------------------
 1 | 
 2 | #' Find linear dependent columns in a design matrix
 3 | #'
 4 | #' @importFrom stats lm.fit
 5 | #'
 6 | #' @param mat a matrix
 7 | #' @param tol a double that specifies the numeric tolerance
 8 | #'
 9 | #' @return a list with vectors containing the indices of linearly dependent columns
10 | #'
11 | #' @seealso
12 | #' The algorithm and function is inspired by the `internalEnumLC`
13 | #' function in the 'caret' package ([GitHub](https://github.com/topepo/caret/blob/679eabaac7e54f4e87efa6c3bff75659cb457d8b/pkg/caret/R/findLinearCombos.R#L33))
14 | #'
15 | #' @examples
16 | #'   mat <- matrix(rnorm(3 * 10), nrow = 10, ncol = 3)
17 | #'   mat <- cbind(mat, mat[,1] + 0.5 * mat[,3])
18 | #'   find_linear_dependent_columns(mat)  # returns list(c(1,3,4))
19 | #'
20 | #' @export
21 | find_linear_dependent_columns <- function(mat, tol = 1e-12){
22 |   stopifnot(is.matrix(mat))
23 |   stopifnot(is.numeric(tol), length(tol) == 1)
24 |   qr_mat <- qr(mat)
25 |   mat_rank <- qr_mat$rank
26 | 
27 |   if(mat_rank == ncol(mat)){
28 |     # The matrix is full rank, so return immediately
29 |     list()
30 |   }else{
31 |     # The QR decomposition arranges the values such that the first `#rank` columns are independent
32 |     independent_columns <- seq_len(mat_rank)
33 |     dependent_columns <- mat_rank + seq_len(ncol(mat) - mat_rank)
34 |     # Solving `Bad_Matrix = Good_Matrix %*% coef`
35 |     # Any place where `coef` is non-null, we have some linear dependency
36 |     coef <- lm.fit(qr.R(qr_mat)[independent_columns,independent_columns,drop=FALSE], qr.R(qr_mat)[independent_columns,dependent_columns,drop=FALSE])$coef
37 |     coef <- matrix(coef, ncol = length(dependent_columns))
38 |     # The pivot relates columns in the QR decomposition object to the columns in the original matrix
39 |     # The `sort` makes sure that the result looks good
40 |     lindep_sets <- lapply(seq_along(dependent_columns), function(i) sort(c(qr_mat$pivot[mat_rank + i], qr_mat$pivot[which(abs(coef[,i]) > tol)])))
41 | 
42 |     connected_lindep_components <- find_connected_components(lindep_sets)
43 |     connected_lindep_components
44 |   }
45 | }
46 | 
47 | 
48 | 


--------------------------------------------------------------------------------
/R/make_full_rank_matrix.R:
--------------------------------------------------------------------------------
  1 | #' Create a full rank matrix
  2 | #'
  3 | #' First remove empty columns. Then discover linear dependent columns. For each set of linearly dependent columns, create orthogonal vectors that span the space. Add these vectors as columns to the final matrix to replace the linearly dependent columns.
  4 | #'
  5 | #' @param mat A matrix.
  6 | #' @param verbose Print how column numbers change with each operation.
  7 | #'
  8 | #' @return a list containing:
  9 | #'    * `matrix`: A matrix of full rank. Column headers will be renamed to reflect how columns depend on each other.
 10 | #'        * `(c1_AND_c2)` If multiple columns are exactly identical, only a single instance is retained.
 11 | #'        * `SPACE_<i>_AXIS<j>` For each set of linearly dependent columns, a space `i` with `max(j)` dimensions was created using orthogonal axes to replace the original columns.
 12 | #'    * `space_list`: A named list where each element corresponds to a space and contains the names of the original linearly dependent columns that are contained within that space.
 13 | #'
 14 | #' @export
 15 | #'
 16 | #' @examples
 17 | #' # Create a 1-hot encoded (zero/one) matrix
 18 | #' c1 <- rbinom(10, 1, .4)
 19 | #' c2 <- 1-c1
 20 | #' c3 <- integer(10)
 21 | #' c4 <- c1
 22 | #' c5 <- 2*c2
 23 | #' c6 <- rbinom(10, 1, .8)
 24 | #' c7 <- c5+c6
 25 | #' # Turn into matrix
 26 | #' mat <- cbind(c1, c2, c3, c4, c5, c6, c7)
 27 | #' # Turn the matrix into full rank, this will:
 28 | #' # 1. remove empty columns (all zero)
 29 | #' # 2. merge columns with the same entries (duplicates)
 30 | #' # 3. identify linearly dependent columns
 31 | #' # 4. replace them with orthogonal vectors that span the same space
 32 | #' result <- make_full_rank_matrix(mat)
 33 | #' # verbose=TRUE will give details on how many columns are removed in every step
 34 | #' result <- make_full_rank_matrix(mat, verbose=TRUE)
 35 | #' # look at the create full rank matrix
 36 | #' mat_full <- result$matrix
 37 | #' # check which linearly dependent columns spanned the identified spaces
 38 | #' spaces <- result$space_list
 39 | 
 40 | make_full_rank_matrix <- function(mat, verbose=FALSE){
 41 | 
 42 |   if (!is.matrix(mat)) {
 43 |     stop("The input is not a matrix.")
 44 |   }
 45 |   if (any(is.na(mat))) {
 46 |     stop("Error: The matrix contains NA values.")
 47 |   }
 48 | 
 49 |   validate_column_names(colnames(mat))
 50 |   if (verbose){
 51 |     message(sprintf("The original matrix contains %i rows and %i columns. The matrix has rank %i.",
 52 |                   dim(mat)[1], dim(mat)[2], qr(mat)$rank))
 53 |   }
 54 |   mat_mod <- remove_empty_columns(mat, verbose=verbose)
 55 |   mat_mod <- merge_duplicated(mat_mod, verbose=verbose)
 56 |   result <- collapse_linearly_dependent_columns(mat_mod, verbose=verbose)
 57 |   mat_mod <- result$matrix
 58 | 
 59 |   if (ncol(mat_mod) > qr(mat)$rank){
 60 |     stop(message("The modified matrix still has more columns than implied by rank. Check manually why modified matrix is not full rank after applying make_full_rank_matrix()."))
 61 |   }
 62 |   if (verbose){
 63 |     message("The matrix is now full rank.")
 64 |   }
 65 |   return(result)
 66 | }
 67 | 
 68 | find_empty_columns <- function(mat, tol = 1e-12, return_names=FALSE){
 69 |   empty_col <- apply(mat, MARGIN = 2, FUN = function(x) {all(abs(x) < tol)})
 70 |   if (return_names){
 71 |     names(which(empty_col))
 72 |   }else{
 73 |     empty_col
 74 |   }
 75 | }
 76 | 
 77 | remove_empty_columns <- function(mat, tol = 1e-12, verbose=FALSE) {
 78 |   empty_col <- find_empty_columns(mat, tol)
 79 |   if (sum(empty_col) > tol){
 80 |     mat <- mat[, !empty_col, drop=FALSE]
 81 |   }
 82 |   if (verbose){
 83 |     message(sprintf("%i empty columns were removed. After removing empty columns the matrix contains %i columns.",
 84 |                   sum(empty_col, na.rm = TRUE), ncol(mat)))
 85 | 
 86 |   }
 87 |   return(mat)
 88 | }
 89 | 
 90 | find_duplicated_columns <- function(mat, verbose=FALSE) {
 91 |   stopifnot(is.matrix(mat))
 92 |   mat_duplicated <- mat[, duplicated(mat, MARGIN = 2), drop=FALSE]
 93 |   colnames(mat_duplicated)
 94 | }
 95 | 
 96 | merge_duplicated <- function(mat, tol = 1e-12, verbose=FALSE) {
 97 |   stopifnot(is.matrix(mat))
 98 |   if (any(is.na(mat))) {
 99 |     stop("Error: The matrix contains NA values.")
100 |   }
101 | 
102 |   mat_unique <- unique(mat, MARGIN = 2)
103 |   colnames_unique <- colnames(mat_unique)
104 |   colnames_duplicated <- find_duplicated_columns(mat)
105 |   if (length(colnames_duplicated) > tol){
106 |     for (c in seq_len(length(colnames_duplicated))){
107 |       keep <- which(duplicated(cbind(mat[, colnames_duplicated[c], drop=FALSE], mat_unique), MARGIN = 2))-1
108 |       if (length(keep) > 1){
109 |         stop(message("More than one matching column detected. Something is wrong with this algorithm."))
110 |       }
111 |       if (grepl("_AND_", colnames_unique[keep], fixed=TRUE)){
112 |         colnames_unique[keep] <- gsub('[()]', '', colnames_unique[keep])
113 |       }
114 |       colnames_unique[keep] <- paste0("(", colnames_unique[keep], "_AND_", colnames_duplicated[c], ")")
115 |     }
116 |     colnames(mat_unique) <- colnames_unique
117 |     mat <- mat_unique
118 |   }
119 |   if (verbose){
120 |     message(sprintf("%i duplicated columns were detected. After merging duplicated columns the matrix contains %i columns.",
121 |                   length(colnames_duplicated), ncol(mat)))
122 |   }
123 |   return(mat)
124 | }
125 | 
126 | collapse_linearly_dependent_columns <- function(mat, tol = 1e-12, verbose = FALSE){
127 |   stopifnot(is.matrix(mat))
128 |   if (any(is.na(mat))) {
129 |     stop("Error: The matrix contains NA values.")
130 |   }
131 |   validate_column_names(colnames(mat))
132 | 
133 |   linear_dependencies <- find_linear_dependent_columns(mat, tol = tol)
134 | 
135 |   space_counter <- 1
136 |   space_list <- list()
137 | 
138 |   while(length(linear_dependencies) > 0){
139 |     dependent_set <- linear_dependencies[[1]]
140 |     dependent_columns <- mat[,dependent_set,drop=FALSE]
141 |     mat <- mat[,-dependent_set,drop=FALSE]
142 |     qr_space <- qr(dependent_columns)
143 |     rank_of_set <- qr_space$rank
144 |     new_space <- qr.Q(qr_space)[,seq_len(rank_of_set),drop=FALSE]
145 | 
146 |     # Handle names
147 |     # if a lot of linearly dependencies exist, adding the original column names to the new column names might get prohibitively large
148 |     # instead label each new space by a number and save which original columns it was composed of in a corresponding file
149 |     space_name <- paste0("SPACE_", space_counter)
150 |     space_list[[space_name]] <- colnames(dependent_columns)
151 | 
152 |     new_names <- paste0("SPACE_", space_counter, "_AXIS", seq_len(rank_of_set))
153 | 
154 |     colnames(new_space) <- new_names
155 |     mat <- cbind(mat, new_space)
156 | 
157 |     # Changing the matrix could introduce new dependencies
158 |     linear_dependencies <- find_linear_dependent_columns(mat, tol = tol)
159 |     space_counter <- space_counter + 1
160 |   }
161 |   if (verbose){
162 |         message(sprintf("The matrix after collapsing linearly dependent columns contains %i rows and %i columns.",
163 |                       nrow(mat), ncol(mat)))
164 |       }
165 |   list(matrix = mat, space_list = space_list)
166 | }
167 | 
168 | 


--------------------------------------------------------------------------------
/R/validate_column_names.R:
--------------------------------------------------------------------------------
 1 | #' Validate Column Names
 2 | #'
 3 | #' This function checks a vector of column names to ensure they are valid. It performs the following checks:
 4 | #' - The column names must not be `NULL`.
 5 | #' - The column names must not contain empty strings.
 6 | #' - The column names must not contain `NA` values.
 7 | #' - The column names must be unique.
 8 | #'
 9 | #' @param names A character vector of column names to validate.
10 | #'
11 | #' @return Returns `TRUE` if all checks pass. If any check fails, the function stops and returns an error message.
12 | #' @export
13 | #'
14 | #' @examples
15 | #' validate_column_names(c("name", "age", "gender"))
16 | #'
17 | validate_column_names <- function(names){
18 |   if(is.null(names)){
19 |     stop("The column names must not be `NULL`.")
20 |   }
21 |   if(any(grepl("^$", names))){
22 |     stop("The empty string `\"\"` is not a valid column name. For example, name at index: ", which(grepl("^$", names))[1], " is empty.")
23 |   }
24 |   if(any(is.na(names))){
25 |     stop("None of the names must be `NA`.")
26 |   }
27 |   if(any(duplicated(names))){
28 |     stop("The column names must be unique.")
29 |   }
30 | }
31 | 
32 | 


--------------------------------------------------------------------------------
/README.Rmd:
--------------------------------------------------------------------------------
 1 | ---
 2 | output: github_document
 3 | ---
 4 | 
 5 | <!-- README.md is generated from README.Rmd. Please edit that file -->
 6 | 
 7 | ```{r, include = FALSE}
 8 | knitr::opts_chunk$set(
 9 |   collapse = TRUE,
10 |   comment = "#>",
11 |   fig.path = "man/figures/README-",
12 |   out.width = "100%"
13 | )
14 | ```
15 | 
16 | # fullRankMatrix - Generation of Full Rank Design Matrix
17 | 
18 | ```{r child = "vignettes/fullrankmat-example.Rmd"}
19 | 
20 | ```
21 | 
22 | ```{r child = "vignettes/fullrankmat-comparison.Rmd"}
23 | 
24 | ```
25 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | <!-- README.md is generated from README.Rmd. Please edit that file -->
  3 | 
  4 | # fullRankMatrix - Generation of Full Rank Design Matrix
  5 | 
  6 | <!-- badges: start -->
  7 | <!-- badges: end -->
  8 | 
  9 | <img src="man/figures/fullRankMatrix.png" width="20%" style="display: block; margin: auto auto auto 0;" />
 10 | 
 11 | We developed `fullRankMatrix` primarily for one-hot encoded design
 12 | matrices used in linear models. In our case, we were faced with a 1-hot
 13 | encoded design matrix, that had a lot of linearly dependent columns.
 14 | This happened when modeling a lot of interaction terms. Since fitting a
 15 | linear model on a design matrix with linearly dependent columns will
 16 | produce results that can lead to misleading interpretation (s. example
 17 | below), we decided to develop a package that will help with identifying
 18 | linearly dependent columns and replacing them with columns constructed
 19 | of orthogonal vectors that span the space of the previously linearly
 20 | dependent columns.
 21 | 
 22 | The goal of `fullRankMatrix` is to remove empty columns (contain only
 23 | 0s), merge duplicated columns (containing the same entries) and merge
 24 | linearly dependent columns. These operations will create a matrix of
 25 | full rank. The changes made to the columns are reflected in the column
 26 | headers such that the columns can still be interpreted if the matrix is
 27 | used in e.g. a linear model fit.
 28 | 
 29 | ## Installation
 30 | 
 31 | You can install `fullRankMatrix` directly from CRAN. Just paste the
 32 | following snippet into your R console:
 33 | 
 34 | ``` r
 35 | install.packages("fullRankMatrix")
 36 | ```
 37 | 
 38 | You can install the development version of `fullRankMatrix` from
 39 | [GitHub](https://github.com/Pweidemueller/fullRankMatrix) with:
 40 | 
 41 | ``` r
 42 | # install.packages("devtools")
 43 | devtools::install_github("Pweidemueller/fullRankMatrix")
 44 | ```
 45 | 
 46 | ## Citation
 47 | 
 48 | If you want to cite this package in a publication, you can run the
 49 | following command in your R console:
 50 | 
 51 | ``` r
 52 | citation("fullRankMatrix")
 53 | #> To cite package 'fullRankMatrix' in publications use:
 54 | #> 
 55 | #>   Weidemüller P, Ahlmann-Eltze C (2024). _fullRankMatrix: Produce a
 56 | #>   Full Rank Matrix_. R package version 0.1.0,
 57 | #>   <https://github.com/Pweidemueller/fullRankMatrix>.
 58 | #> 
 59 | #> A BibTeX entry for LaTeX users is
 60 | #> 
 61 | #>   @Manual{,
 62 | #>     title = {fullRankMatrix: Produce a Full Rank Matrix},
 63 | #>     author = {Paula Weidemüller and Constantin Ahlmann-Eltze},
 64 | #>     year = {2024},
 65 | #>     note = {R package version 0.1.0},
 66 | #>     url = {https://github.com/Pweidemueller/fullRankMatrix},
 67 | #>   }
 68 | ```
 69 | 
 70 | ## Linearly dependent columns span a space of a certain dimension
 71 | 
 72 | In order to visualize it, let’s look at a very simple example. Say we
 73 | have a matrix with three columns, each with three entries. These columns
 74 | can be visualized as vectors in a coordinate system with 3 axes (s.
 75 | image). The first vector points into the plane spanned by the first and
 76 | third axis. The second and third vectors lie in the plane spanned by the
 77 | first and second axis. Since this is a very simple example, we
 78 | immediately spot that the third column is a multiple of the second
 79 | column. Their corresponding vectors lie perfectly on top of each other.
 80 | This means instead of the two columns spanning a 2D space they just
 81 | occupy a line, i.e. a 1D space. This is identified by `fullRankMatrix`,
 82 | which replaces these two linearly dependent columns with one vector that
 83 | describes the 1D space in which column 2 and column 3 used to lie. The
 84 | resulting matrix is now full rank with no linearly dependent columns.
 85 | 
 86 | ``` r
 87 | library(fullRankMatrix)
 88 | ```
 89 | 
 90 | ``` r
 91 | c1 <- c(1, 0, 1)
 92 | c2 <- c(1, 2, 0)
 93 | c3 <- c(2, 4, 0)
 94 | 
 95 | mat <- cbind(c1, c2, c3)
 96 | 
 97 | make_full_rank_matrix(mat)
 98 | #> $matrix
 99 | #>      c1 SPACE_1_AXIS1
100 | #> [1,]  1    -0.4472136
101 | #> [2,]  0    -0.8944272
102 | #> [3,]  1     0.0000000
103 | #> 
104 | #> $space_list
105 | #> $space_list$SPACE_1
106 | #> [1] "c2" "c3"
107 | ```
108 | 
109 | ``` r
110 | knitr::include_graphics("man/figures/example_vectors.png")
111 | ```
112 | 
113 | <div class="figure">
114 | 
115 | <img src="man/figures/example_vectors.png" alt="Visualisation of identifying and replacing linearly dependent columns." width="100%" />
116 | <p class="caption">
117 | Visualisation of identifying and replacing linearly dependent columns.
118 | </p>
119 | 
120 | </div>
121 | 
122 | ## Worked through example
123 | 
124 | Above was a rather abstract example that was easy to visualize, let’s
125 | now walk through the utilities of `fullRankMatrix` when applied to a
126 | more realistic design matrix.
127 | 
128 | When using linear models you should check if any of the columns in your
129 | design matrix are linearly dependent. If there are, this will alter the
130 | interpretation of the fit. Here is a rather constructed example where we
131 | are interested in identifying which ingredients contribute mostly to the
132 | sweetness of fruit salads.
133 | 
134 | ``` r
135 | # let's say we have 10 fruit salads and indicate which ingredients are present in each salad
136 | strawberry <- c(1,1,1,1,0,0,0,0,0,0)
137 | poppyseed <- c(0,0,0,0,1,1,1,0,0,0)
138 | orange <- c(1,1,1,1,1,1,1,0,0,0)
139 | pear <- c(0,0,0,1,0,0,0,1,1,1)
140 | mint <- c(1,1,0,0,0,0,0,0,0,0)
141 | apple <- c(0,0,0,0,0,0,1,1,1,1)
142 | 
143 | # let's pretend we know how each fruit influences the sweetness of a fruit salad
144 | # in this case we say that strawberries and oranges have the biggest influence on sweetness
145 | set.seed(30)
146 | strawberry_sweet <- strawberry * rnorm(10, 4)
147 | poppyseed_sweet <- poppyseed * rnorm(10, 0.1)
148 | orange_sweet <- orange * rnorm(10, 5)
149 | pear_sweet <- pear * rnorm(10, 0.5)
150 | mint_sweet <- mint * rnorm(10, 1)
151 | apple_sweet <- apple * rnorm(10, 2)
152 | 
153 | sweetness <- strawberry_sweet + poppyseed_sweet+ orange_sweet + pear_sweet +
154 |   mint_sweet + apple_sweet 
155 | 
156 | mat <- cbind(strawberry,poppyseed,orange,pear,mint,apple)
157 | 
158 | fit <- lm(sweetness ~ mat + 0)
159 | print(summary(fit))
160 | #> 
161 | #> Call:
162 | #> lm(formula = sweetness ~ mat + 0)
163 | #> 
164 | #> Residuals:
165 | #>        1        2        3        4        5        6        7        8 
166 | #> -2.00934  2.00934 -1.34248  1.34248  0.92807 -2.27054  1.34248 -0.01963 
167 | #>        9       10 
168 | #>  1.26385 -2.58670 
169 | #> 
170 | #> Coefficients: (1 not defined because of singularities)
171 | #>               Estimate Std. Error t value Pr(>|t|)   
172 | #> matstrawberry   8.9087     2.0267   4.396  0.00705 **
173 | #> matpoppyseed    6.5427     1.5544   4.209  0.00842 **
174 | #> matorange           NA         NA      NA       NA   
175 | #> matpear         1.2800     2.3056   0.555  0.60269   
176 | #> matmint         0.6582     2.6242   0.251  0.81193   
177 | #> matapple        1.2595     2.2526   0.559  0.60019   
178 | #> ---
179 | #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
180 | #> 
181 | #> Residual standard error: 2.357 on 5 degrees of freedom
182 | #> Multiple R-squared:  0.9504, Adjusted R-squared:  0.9007 
183 | #> F-statistic: 19.15 on 5 and 5 DF,  p-value: 0.002824
184 | ```
185 | 
186 | As you can see `lm` realizes that “1 \[column\] not defined because of
187 | singularities” (`matorange` is not defined) but it doesn’t indicate what
188 | columns it is linearly dependent with.
189 | 
190 | So if you would just look at the columns and not consider the `NA`
191 | further, you would interpret that `strawberry` and `poppyseed` are the
192 | biggest contributors to the sweetness of fruit salads.
193 | 
194 | However, when you look at the model matrix you can see that the `orange`
195 | column is a linear combination of the `strawberry` and `poppyseed`
196 | columns (or vice versa). So truly any of the three factors could
197 | contribute to the sweetness of a fruit salad, the linear model has no
198 | way of recovering which one given these 10 examples. And since we
199 | constructed this example we know that `orange` and `strawberry` are the
200 | sweetest and `poppyseed` contributes least to the sweetness.
201 | 
202 | ``` r
203 | mat
204 | #>       strawberry poppyseed orange pear mint apple
205 | #>  [1,]          1         0      1    0    1     0
206 | #>  [2,]          1         0      1    0    1     0
207 | #>  [3,]          1         0      1    0    0     0
208 | #>  [4,]          1         0      1    1    0     0
209 | #>  [5,]          0         1      1    0    0     0
210 | #>  [6,]          0         1      1    0    0     0
211 | #>  [7,]          0         1      1    0    0     1
212 | #>  [8,]          0         0      0    1    0     1
213 | #>  [9,]          0         0      0    1    0     1
214 | #> [10,]          0         0      0    1    0     1
215 | ```
216 | 
217 | To make such cases more obvious and to be able to still correctly
218 | interpret the linear model fit, we wrote `fullRankMatrix`. It removes
219 | linearly dependent columns and renames the remaining columns to make the
220 | dependencies clear using the `make_full_rank_matrix()` function.
221 | 
222 | ``` r
223 | library(fullRankMatrix)
224 | result <- make_full_rank_matrix(mat)
225 | mat_fr <- result$matrix
226 | space_list <- result$space_list
227 | mat_fr
228 | #>       pear mint apple SPACE_1_AXIS1 SPACE_1_AXIS2
229 | #>  [1,]    0    1     0          -0.5     0.0000000
230 | #>  [2,]    0    1     0          -0.5     0.0000000
231 | #>  [3,]    0    0     0          -0.5     0.0000000
232 | #>  [4,]    1    0     0          -0.5     0.0000000
233 | #>  [5,]    0    0     0           0.0    -0.5773503
234 | #>  [6,]    0    0     0           0.0    -0.5773503
235 | #>  [7,]    0    0     1           0.0    -0.5773503
236 | #>  [8,]    1    0     1           0.0     0.0000000
237 | #>  [9,]    1    0     1           0.0     0.0000000
238 | #> [10,]    1    0     1           0.0     0.0000000
239 | ```
240 | 
241 | ``` r
242 | fit <- lm(sweetness ~ mat_fr + 0)
243 | print(summary(fit))
244 | #> 
245 | #> Call:
246 | #> lm(formula = sweetness ~ mat_fr + 0)
247 | #> 
248 | #> Residuals:
249 | #>        1        2        3        4        5        6        7        8 
250 | #> -2.00934  2.00934 -1.34248  1.34248  0.92807 -2.27054  1.34248 -0.01963 
251 | #>        9       10 
252 | #>  1.26385 -2.58670 
253 | #> 
254 | #> Coefficients:
255 | #>                     Estimate Std. Error t value Pr(>|t|)   
256 | #> mat_frpear            1.2800     2.3056   0.555  0.60269   
257 | #> mat_frmint            0.6582     2.6242   0.251  0.81193   
258 | #> mat_frapple           1.2595     2.2526   0.559  0.60019   
259 | #> mat_frSPACE_1_AXIS1 -17.8174     4.0535  -4.396  0.00705 **
260 | #> mat_frSPACE_1_AXIS2 -11.3322     2.6924  -4.209  0.00842 **
261 | #> ---
262 | #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
263 | #> 
264 | #> Residual standard error: 2.357 on 5 degrees of freedom
265 | #> Multiple R-squared:  0.9504, Adjusted R-squared:  0.9007 
266 | #> F-statistic: 19.15 on 5 and 5 DF,  p-value: 0.002824
267 | ```
268 | 
269 | You can see that there are no more undefined columns. The columns
270 | `strawberry`, `orange` and `poppyseed` were removed and replaced with
271 | two columns (`SPACE_1_AXIS1`, `SPACE_1_AXIS2`) that are linearly
272 | independent (orthogonal) vectors that span the space in which the
273 | linearly dependent columns `strawberry`, `orange` and `poppyseed` lied.
274 | 
275 | The original columns that are contained within a space can be viewed in
276 | the returned `space_list`:
277 | 
278 | ``` r
279 | space_list
280 | #> $SPACE_1
281 | #> [1] "strawberry" "poppyseed"  "orange"
282 | ```
283 | 
284 | In terms of interpretation the individual axes of the constructed spaces
285 | are difficult to interpret, but we see that the axes of the space of
286 | `strawberry`, `orange` and `poppyseed` show a significant association
287 | with the sweetness of fruit salads. A further resolution which of the
288 | three terms is most strongly associated with `sweetness` is not possible
289 | with the given number of observations, but there is definitely an
290 | association of `sweetness` with the space spanned by the three terms.
291 | 
292 | If only a subset of all axes of a space show a significant association
293 | in the linear model fit, this could indicate that only a subset of
294 | linearly dependent columns that lie within the space spanned by the
295 | significantly associated axes drive this association. This would require
296 | some more detailed investigation by the user that would be specific to
297 | the use case.
298 | 
299 | ## Other available packages that detect linear dependent columns
300 | 
301 | There are already a few other packages out there that offer functions to
302 | detect linear dependent columns. Here are the ones we are aware of:
303 | 
304 | ``` r
305 | library(fullRankMatrix)
306 | 
307 | # let's say we have 10 fruit salads and indicate which ingredients are present in each salad
308 | strawberry <- c(1,1,1,1,0,0,0,0,0,0)
309 | poppyseed <- c(0,0,0,0,1,1,1,0,0,0)
310 | orange <- c(1,1,1,1,1,1,1,0,0,0)
311 | pear <- c(0,0,0,1,0,0,0,1,1,1)
312 | mint <- c(1,1,0,0,0,0,0,0,0,0)
313 | apple <- c(0,0,0,0,0,0,1,1,1,1)
314 | 
315 | # let's pretend we know how each fruit influences the sweetness of a fruit salad
316 | # in this case we say that strawberries and oranges have the biggest influence on sweetness
317 | set.seed(30)
318 | strawberry_sweet <- strawberry * rnorm(10, 4)
319 | poppyseed_sweet <- poppyseed * rnorm(10, 0.1)
320 | orange_sweet <- orange * rnorm(10, 5)
321 | pear_sweet <- pear * rnorm(10, 0.5)
322 | mint_sweet <- mint * rnorm(10, 1)
323 | apple_sweet <- apple * rnorm(10, 2)
324 | 
325 | sweetness <- strawberry_sweet + poppyseed_sweet+ orange_sweet + pear_sweet +
326 |   mint_sweet + apple_sweet 
327 | 
328 | mat <- cbind(strawberry,poppyseed,orange,pear,mint,apple)
329 | ```
330 | 
331 | **`caret::findLinearCombos()`**:
332 | <https://rdrr.io/cran/caret/man/findLinearCombos.html>
333 | 
334 | This function identifies which columns are linearly dependent and
335 | suggests which columns to remove. But it doesn’t provide appropriate
336 | naming for the remaining columns to indicate that any significant
337 | associations with the remaining columns are actually associations with
338 | the space spanned by the originally linearly dependent columns. Just
339 | removing the indicated columns and then fitting the linear model would
340 | lead to erroneous interpretation.
341 | 
342 | ``` r
343 | caret_result <- caret::findLinearCombos(mat)
344 | ```
345 | 
346 | Fitting a linear model with the `orange` column removed would lead to
347 | the erroneous interpretation that `strawberry` and `poppyseed` have the
348 | biggest influence on the fruit salad `sweetness`, but we know it is
349 | actually `strawberry` and `orange`.
350 | 
351 | ``` r
352 | mat_caret <- mat[, -caret_result$remove]
353 | fit <- lm(sweetness ~ mat_caret + 0)
354 | print(summary(fit))
355 | #> 
356 | #> Call:
357 | #> lm(formula = sweetness ~ mat_caret + 0)
358 | #> 
359 | #> Residuals:
360 | #>        1        2        3        4        5        6        7        8 
361 | #> -2.00934  2.00934 -1.34248  1.34248  0.92807 -2.27054  1.34248 -0.01963 
362 | #>        9       10 
363 | #>  1.26385 -2.58670 
364 | #> 
365 | #> Coefficients:
366 | #>                     Estimate Std. Error t value Pr(>|t|)   
367 | #> mat_caretstrawberry   8.9087     2.0267   4.396  0.00705 **
368 | #> mat_caretpoppyseed    6.5427     1.5544   4.209  0.00842 **
369 | #> mat_caretpear         1.2800     2.3056   0.555  0.60269   
370 | #> mat_caretmint         0.6582     2.6242   0.251  0.81193   
371 | #> mat_caretapple        1.2595     2.2526   0.559  0.60019   
372 | #> ---
373 | #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
374 | #> 
375 | #> Residual standard error: 2.357 on 5 degrees of freedom
376 | #> Multiple R-squared:  0.9504, Adjusted R-squared:  0.9007 
377 | #> F-statistic: 19.15 on 5 and 5 DF,  p-value: 0.002824
378 | ```
379 | 
380 | **`WeightIt::make_full_rank()`**:
381 | <https://rdrr.io/cran/WeightIt/man/make_full_rank.html>
382 | 
383 | This function removes some of the linearly dependent columns to create a
384 | full rank matrix, but doesn’t rename the remaining columns accordingly.
385 | For the user it isn’t clear which columns were linearly dependent and
386 | they can’t choose which column will be removed.
387 | 
388 | ``` r
389 | mat_weightit <- WeightIt::make_full_rank(mat, with.intercept = FALSE)
390 | mat_weightit
391 | #>       strawberry poppyseed pear mint apple
392 | #>  [1,]          1         0    0    1     0
393 | #>  [2,]          1         0    0    1     0
394 | #>  [3,]          1         0    0    0     0
395 | #>  [4,]          1         0    1    0     0
396 | #>  [5,]          0         1    0    0     0
397 | #>  [6,]          0         1    0    0     0
398 | #>  [7,]          0         1    0    0     1
399 | #>  [8,]          0         0    1    0     1
400 | #>  [9,]          0         0    1    0     1
401 | #> [10,]          0         0    1    0     1
402 | ```
403 | 
404 | As above fitting a linear model with this full rank matrix would lead to
405 | erroneous interpretation that `strawberry` and `poppyseed` influence the
406 | `sweetness`, but we know it is actually `strawberry` and `orange`.
407 | 
408 | ``` r
409 | fit <- lm(sweetness ~ mat_weightit + 0)
410 | print(summary(fit))
411 | #> 
412 | #> Call:
413 | #> lm(formula = sweetness ~ mat_weightit + 0)
414 | #> 
415 | #> Residuals:
416 | #>        1        2        3        4        5        6        7        8 
417 | #> -2.00934  2.00934 -1.34248  1.34248  0.92807 -2.27054  1.34248 -0.01963 
418 | #>        9       10 
419 | #>  1.26385 -2.58670 
420 | #> 
421 | #> Coefficients:
422 | #>                        Estimate Std. Error t value Pr(>|t|)   
423 | #> mat_weightitstrawberry   8.9087     2.0267   4.396  0.00705 **
424 | #> mat_weightitpoppyseed    6.5427     1.5544   4.209  0.00842 **
425 | #> mat_weightitpear         1.2800     2.3056   0.555  0.60269   
426 | #> mat_weightitmint         0.6582     2.6242   0.251  0.81193   
427 | #> mat_weightitapple        1.2595     2.2526   0.559  0.60019   
428 | #> ---
429 | #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
430 | #> 
431 | #> Residual standard error: 2.357 on 5 degrees of freedom
432 | #> Multiple R-squared:  0.9504, Adjusted R-squared:  0.9007 
433 | #> F-statistic: 19.15 on 5 and 5 DF,  p-value: 0.002824
434 | ```
435 | 
436 | **`plm::detect.lindep()`:**
437 | <https://rdrr.io/cran/plm/man/detect.lindep.html>
438 | 
439 | The function returns which columns are potentially linearly dependent.
440 | 
441 | ``` r
442 | plm::detect.lindep(mat)
443 | #> [1] "Suspicious column number(s): 1, 2, 3"
444 | #> [1] "Suspicious column name(s):   strawberry, poppyseed, orange"
445 | ```
446 | 
447 | However it doesn’t capture all cases. For example here
448 | `plm::detect.lindep()` says there are no dependent columns, while there
449 | are several:
450 | 
451 | ``` r
452 | c1 <- rbinom(10, 1, .4)
453 | c2 <- 1-c1
454 | c3 <- integer(10)
455 | c4 <- c1
456 | c5 <- 2*c2
457 | c6 <- rbinom(10, 1, .8)
458 | c7 <- c5+c6
459 | mat_test <- as.matrix(data.frame(c1,c2,c3,c4,c5,c6,c7))
460 | 
461 | plm::detect.lindep(mat_test)
462 | #> [1] "No linear dependent column(s) detected."
463 | ```
464 | 
465 | `fullRankMatrix` captures these cases:
466 | 
467 | ``` r
468 | result <- make_full_rank_matrix(mat_test)
469 | result$matrix
470 | #>       (c1_AND_c4) SPACE_1_AXIS1 SPACE_1_AXIS2
471 | #>  [1,]           1     0.0000000  4.111431e-16
472 | #>  [2,]           0    -0.4082483 -5.419613e-17
473 | #>  [3,]           1     0.0000000  7.071068e-01
474 | #>  [4,]           0    -0.4082483  1.083923e-17
475 | #>  [5,]           1     0.0000000  7.071068e-01
476 | #>  [6,]           0    -0.4082483  1.083923e-17
477 | #>  [7,]           0    -0.4082483  1.083923e-17
478 | #>  [8,]           0    -0.4082483  1.083923e-17
479 | #>  [9,]           1     0.0000000  0.000000e+00
480 | #> [10,]           0    -0.4082483  1.083923e-17
481 | ```
482 | 
483 | **`Smisc::findDepMat()`**:
484 | <https://rdrr.io/cran/Smisc/man/findDepMat.html>
485 | 
486 | **NOTE**: this package was removed from CRAN as of 2020-01-26
487 | (<https://CRAN.R-project.org/package=Smisc>) due to failing checks.
488 | 
489 | This function indicates linearly dependent rows/columns, but it doesn’t
490 | state which rows/columns are linearly dependent with each other.
491 | 
492 | However, this function seems to not work well for one-hot encoded
493 | matrices and the package doesn’t seem to be updated anymore (s. this
494 | issue: <https://github.com/pnnl/Smisc/issues/24>).
495 | 
496 |     # example provided by Smisc documentation
497 |     Y <- matrix(c(1, 3, 4,
498 |                   2, 6, 8,
499 |                   7, 2, 9,
500 |                   4, 1, 7,
501 |                   3.5, 1, 4.5), byrow = TRUE, ncol = 3)
502 |     Smisc::findDepMat(t(Y), rows = FALSE)
503 | 
504 | Trying with the model matrix from our example above:
505 | 
506 |     Smisc::findDepMat(mat, rows=FALSE)
507 |     #> Error in if (!depends[j]) { : missing value where TRUE/FALSE needed
508 | 


--------------------------------------------------------------------------------
/cran-comments.md:
--------------------------------------------------------------------------------
1 | ## R CMD check results
2 | 
3 | 0 errors | 0 warnings | 1 note
4 | 
5 | * This is a new release.
6 | 


--------------------------------------------------------------------------------
/inst/CITATION:
--------------------------------------------------------------------------------
 1 | bibentry(
 2 |   bibtype = "Manual",
 3 |   title = "fullRankMatrix: Produce a Full Rank Matrix",
 4 |   author = c(
 5 |     person("Paula", "Weidemüller", role = c("aut", "cre"),
 6 |            email = "p.weidemueller@posteo.de",
 7 |            comment = c(ORCID = "0000-0003-3867-3131")),
 8 |     person("Constantin", "Ahlmann-Eltze", role = c("aut"),
 9 |            email = "abc@gmail.com")
10 |   ),
11 |   year = 2024,
12 |   note = "R package version 0.1.0",
13 |   url = "https://github.com/Pweidemueller/fullRankMatrix"
14 | )
15 | 


--------------------------------------------------------------------------------
/inst/WORDLIST:
--------------------------------------------------------------------------------
1 | doi
2 | jss
3 | 


--------------------------------------------------------------------------------
/man/figures/example_vectors.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pweidemueller/fullRankMatrix/ad4bc2dc58642eb7e639ee3e3d79e5dca28f4c4c/man/figures/example_vectors.png


--------------------------------------------------------------------------------
/man/figures/fullRankMatrix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pweidemueller/fullRankMatrix/ad4bc2dc58642eb7e639ee3e3d79e5dca28f4c4c/man/figures/fullRankMatrix.png


--------------------------------------------------------------------------------
/man/find_connected_components.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/find_connected_components.R
 3 | \name{find_connected_components}
 4 | \alias{find_connected_components}
 5 | \title{Find connected components in a graph}
 6 | \usage{
 7 | find_connected_components(connections)
 8 | }
 9 | \arguments{
10 | \item{connections}{a list where each element is a vector with connected nodes.
11 | Each node must be either a character or an integer.}
12 | }
13 | \value{
14 | a list where each element is a set of connected items.
15 | }
16 | \description{
17 | The function performs a depths-first search to find all connected components.
18 | }
19 | \examples{
20 |   find_connected_components(list(c(1,2), c(1,3), c(4,5)))
21 | 
22 | 
23 | }
24 | 


--------------------------------------------------------------------------------
/man/find_linear_dependent_columns.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/find_linear_dependent_columns.R
 3 | \name{find_linear_dependent_columns}
 4 | \alias{find_linear_dependent_columns}
 5 | \title{Find linear dependent columns in a design matrix}
 6 | \usage{
 7 | find_linear_dependent_columns(mat, tol = 1e-12)
 8 | }
 9 | \arguments{
10 | \item{mat}{a matrix}
11 | 
12 | \item{tol}{a double that specifies the numeric tolerance}
13 | }
14 | \value{
15 | a list with vectors containing the indices of linearly dependent columns
16 | }
17 | \description{
18 | Find linear dependent columns in a design matrix
19 | }
20 | \examples{
21 |   mat <- matrix(rnorm(3 * 10), nrow = 10, ncol = 3)
22 |   mat <- cbind(mat, mat[,1] + 0.5 * mat[,3])
23 |   find_linear_dependent_columns(mat)  # returns list(c(1,3,4))
24 | 
25 | }
26 | \seealso{
27 | The algorithm and function is inspired by the \code{internalEnumLC}
28 | function in the 'caret' package (\href{https://github.com/topepo/caret/blob/679eabaac7e54f4e87efa6c3bff75659cb457d8b/pkg/caret/R/findLinearCombos.R#L33}{GitHub})
29 | }
30 | 


--------------------------------------------------------------------------------
/man/make_full_rank_matrix.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/make_full_rank_matrix.R
 3 | \name{make_full_rank_matrix}
 4 | \alias{make_full_rank_matrix}
 5 | \title{Create a full rank matrix}
 6 | \usage{
 7 | make_full_rank_matrix(mat, verbose = FALSE)
 8 | }
 9 | \arguments{
10 | \item{mat}{A matrix.}
11 | 
12 | \item{verbose}{Print how column numbers change with each operation.}
13 | }
14 | \value{
15 | a list containing:
16 | \itemize{
17 | \item \code{matrix}: A matrix of full rank. Column headers will be renamed to reflect how columns depend on each other.
18 | \itemize{
19 | \item \code{(c1_AND_c2)} If multiple columns are exactly identical, only a single instance is retained.
20 | \item \verb{SPACE_<i>_AXIS<j>} For each set of linearly dependent columns, a space \code{i} with \code{max(j)} dimensions was created using orthogonal axes to replace the original columns.
21 | }
22 | \item \code{space_list}: A named list where each element corresponds to a space and contains the names of the original linearly dependent columns that are contained within that space.
23 | }
24 | }
25 | \description{
26 | First remove empty columns. Then discover linear dependent columns. For each set of linearly dependent columns, create orthogonal vectors that span the space. Add these vectors as columns to the final matrix to replace the linearly dependent columns.
27 | }
28 | \examples{
29 | # Create a 1-hot encoded (zero/one) matrix
30 | c1 <- rbinom(10, 1, .4)
31 | c2 <- 1-c1
32 | c3 <- integer(10)
33 | c4 <- c1
34 | c5 <- 2*c2
35 | c6 <- rbinom(10, 1, .8)
36 | c7 <- c5+c6
37 | # Turn into matrix
38 | mat <- cbind(c1, c2, c3, c4, c5, c6, c7)
39 | # Turn the matrix into full rank, this will:
40 | # 1. remove empty columns (all zero)
41 | # 2. merge columns with the same entries (duplicates)
42 | # 3. identify linearly dependent columns
43 | # 4. replace them with orthogonal vectors that span the same space
44 | result <- make_full_rank_matrix(mat)
45 | # verbose=TRUE will give details on how many columns are removed in every step
46 | result <- make_full_rank_matrix(mat, verbose=TRUE)
47 | # look at the create full rank matrix
48 | mat_full <- result$matrix
49 | # check which linearly dependent columns spanned the identified spaces
50 | spaces <- result$space_list
51 | }
52 | 


--------------------------------------------------------------------------------
/man/validate_column_names.Rd:
--------------------------------------------------------------------------------
 1 | % Generated by roxygen2: do not edit by hand
 2 | % Please edit documentation in R/validate_column_names.R
 3 | \name{validate_column_names}
 4 | \alias{validate_column_names}
 5 | \title{Validate Column Names}
 6 | \usage{
 7 | validate_column_names(names)
 8 | }
 9 | \arguments{
10 | \item{names}{A character vector of column names to validate.}
11 | }
12 | \value{
13 | Returns \code{TRUE} if all checks pass. If any check fails, the function stops and returns an error message.
14 | }
15 | \description{
16 | This function checks a vector of column names to ensure they are valid. It performs the following checks:
17 | \itemize{
18 | \item The column names must not be \code{NULL}.
19 | \item The column names must not contain empty strings.
20 | \item The column names must not contain \code{NA} values.
21 | \item The column names must be unique.
22 | }
23 | }
24 | \examples{
25 | validate_column_names(c("name", "age", "gender"))
26 | 
27 | }
28 | 


--------------------------------------------------------------------------------
/tests/spelling.R:
--------------------------------------------------------------------------------
1 | if(requireNamespace('spelling', quietly = TRUE))
2 |   spelling::spell_check_test(vignettes = TRUE, error = FALSE,
3 |                              skip_on_cran = TRUE)
4 | 


--------------------------------------------------------------------------------
/tests/testthat.R:
--------------------------------------------------------------------------------
1 | library(testthat)
2 | library(fullRankMatrix)
3 | 
4 | test_check("fullRankMatrix")
5 | 


--------------------------------------------------------------------------------
/tests/testthat/test-find_connected_components.R:
--------------------------------------------------------------------------------
 1 | test_that("find_connected_components", {
 2 | 
 3 |   connections <- list(
 4 |     c(1,2), c(2, 5), c(4, 3, 5),
 5 |     c(6, 7),
 6 |     c(8)
 7 |   )
 8 | 
 9 |   res <- find_connected_components(connections)
10 |   expect_setequal(res[[1]], c(1,2,3,4,5))
11 |   expect_setequal(res[[2]], c(6, 7))
12 |   expect_setequal(res[[3]], c(8))
13 |   expect_equal(length(res), 3)
14 | 
15 |   system.time(
16 |     res <- find_connected_components(list(1:1000))
17 |   )
18 |   expect_setequal(res[[1]], 1:1000)
19 | })
20 | 
21 | test_that("find_connected_components finds same components as igraph", {
22 |   testthat::skip_if_not_installed("igraph")
23 |   # Check results against igraph
24 |   edges <- as.character(as.integer(sample(1:500, size = 600, replace = TRUE)))
25 |   gr <- igraph::make_undirected_graph(edges)
26 |   connections <- asplit(igraph::as_edgelist(gr), 1)
27 |   res <- find_connected_components(connections)
28 |   ires <- igraph::components(gr)
29 |   expect_equal(sort(lengths(res)), sort(ires$csize))
30 | 
31 |   mem <- ires$membership
32 |   ires_list <- lapply(seq_len(max(mem)), \(idx){
33 |     names(mem)[mem == idx]
34 |   })
35 |   for(comp in res){
36 |     matched <- FALSE
37 |     for(comp2 in ires_list){
38 |       if(length(comp) == length(comp2) && all(sort(comp) == sort(comp2))){
39 |         matched <- TRUE
40 |         break
41 |       }
42 |     }
43 |     expect_true(matched)
44 |   }
45 | 
46 | })
47 | 


--------------------------------------------------------------------------------
/tests/testthat/test-find_linear_dependent_columns.R:
--------------------------------------------------------------------------------
 1 | test_that("find_linear_dependent_columns works", {
 2 | 
 3 |   mat <- matrix(rnorm(n = 10 * 4), nrow = 10, ncol = 4)
 4 |   mat <- cbind(mat, mat[,1] + 2 * mat[,2], rnorm(10))
 5 |   expect_equal(find_linear_dependent_columns(mat), list(c(1,2,5)))
 6 | 
 7 |   mat <- cbind(mat, 0.3 * mat[,3] + 0.4 * mat[,4], rnorm(10))
 8 |   expect_equal(find_linear_dependent_columns(mat), list(c(1,2,5), c(3,4,7)))
 9 | 
10 |   mat <- cbind(mat, 2 * mat[,1] + 4 * mat[,4])
11 |   expect_equal(find_linear_dependent_columns(mat), list(c(1,2,3,4,5,7,9)))
12 | 
13 | })
14 | 
15 | test_that("find_linear_dependent_columns errors if input is not of type matrix", {
16 |   set.seed(1000)
17 |   c1 <- rbinom(10, 1, .4)
18 |   c2 <- 1-c1
19 |   c3 <- integer(10)
20 |   c4 <- c1
21 |   mat <- data.frame(c1, c2, c3, c4)
22 |   expect_error(find_linear_dependent_columns(mat))
23 | })
24 | 


--------------------------------------------------------------------------------
/tests/testthat/test-make_full_rank_matrix.R:
--------------------------------------------------------------------------------
  1 | test_that("check that a full rank matrix is returned unchanged", {
  2 | 
  3 |   df <- data.frame(col = sample(letters[1:3], 10, replace = 20))
  4 |   mat <- model.matrix(~ col - 1, data = df)
  5 |   result <- make_full_rank_matrix(mat)
  6 |   mat2 <- result$matrix
  7 |   expect_equal(mat, mat2, ignore_attr = TRUE)
  8 |   expect_equal(length(result$space_list), 0)
  9 | 
 10 | })
 11 | 
 12 | 
 13 | test_that("removing empty columns as expected", {
 14 |   set.seed(1000)
 15 |   c1 <- rbinom(10, 1, .4)
 16 |   c2 <- 1-c1
 17 |   c3 <- integer(10)
 18 |   c4 <- integer(10)
 19 |   mat <- as.matrix(data.frame(c1, c2, c3, c4))
 20 |   mat_empty <- as.matrix(data.frame(c1, c2))
 21 |   mat2 <- remove_empty_columns(mat)
 22 |   expect_equal(mat2, mat_empty)
 23 | 
 24 |   mat <- as.matrix(data.frame(c1, c2))
 25 |   mat2 <- remove_empty_columns(mat)
 26 |   expect_equal(mat2, mat)
 27 | 
 28 | 
 29 | })
 30 | 
 31 | test_that("returning names of empty columns", {
 32 |   set.seed(1000)
 33 |   c1 <- rbinom(10, 1, .4)
 34 |   c2 <- 1-c1
 35 |   c3 <- integer(10)
 36 |   c4 <- integer(10)
 37 |   mat <- as.matrix(data.frame(c1, c2, c3, c4))
 38 |   empty_cols <- find_empty_columns(mat, return_names=TRUE)
 39 |   expect_equal(empty_cols, c("c3", "c4"))
 40 | 
 41 |   mat <- as.matrix(data.frame(c1, c2))
 42 |   mat2 <- remove_empty_columns(mat)
 43 |   expect_equal(mat2, mat)
 44 | })
 45 | 
 46 | test_that("merging duplicated columns as expected", {
 47 |   set.seed(1000)
 48 |   c1 <- rbinom(10, 1, .4)
 49 |   c2 <- 1-c1
 50 |   c3 <- integer(10)
 51 |   c4 <- c1
 52 |   mat <- as.matrix(data.frame(c1, c2, c3, c4))
 53 |   mat_dedupl <- as.matrix(data.frame(c1, c2, c3))
 54 |   colnames(mat_dedupl) <- c('(c1_AND_c4)', 'c2', 'c3')
 55 |   mat2 <- merge_duplicated(mat)
 56 |   expect_equal(mat2, mat_dedupl)
 57 | 
 58 |   mat <- as.matrix(data.frame(c1, c2))
 59 |   mat2 <- merge_duplicated(mat)
 60 |   expect_equal(mat2, mat)
 61 | })
 62 | 
 63 | test_that("make full rank column as expected", {
 64 |   intercept <- rep(1,10)
 65 |   c1 <- rbinom(10, 1, .4)
 66 |   c2 <- 1-c1
 67 |   c3 <- rnorm(10)
 68 |   c4 <- 10*c3
 69 |   c5 <- c1
 70 |   c6 <- integer(10)
 71 | 
 72 |   mat <- as.matrix(data.frame(intercept, c1, c2, c3, c4, c5, c6))
 73 |   result <- make_full_rank_matrix(mat)
 74 |   red_mat <- result$matrix
 75 |   space_list <- result$space_list
 76 |   expect_equal(ncol(red_mat), 3)
 77 |   expect_equal(qr(red_mat)$rank, 3)
 78 |   expect_equal(colnames(red_mat), c("SPACE_1_AXIS1", "SPACE_1_AXIS2", "SPACE_2_AXIS1"))
 79 |   expect_equal(length(space_list), 2)
 80 |   expect_equal(space_list$SPACE_1, c("intercept", "(c1_AND_c5)", "c2"))
 81 |   expect_equal(space_list$SPACE_2, c("c3", "c4"))
 82 | 
 83 |   mat <- as.matrix(data.frame(c1, c2, c3, c4, c5, c6))
 84 |   result <- make_full_rank_matrix(mat)
 85 |   red_mat <- result$matrix
 86 |   space_list <- result$space_list
 87 |   expect_equal(ncol(red_mat), 3)
 88 |   expect_equal(qr(red_mat)$rank, 3)
 89 |   expect_equal(colnames(red_mat), c("(c1_AND_c5)", "c2", "SPACE_1_AXIS1"))
 90 | })
 91 | 
 92 | test_that("merge_duplicated errors if input is not of type matrix", {
 93 |   set.seed(1000)
 94 |   c1 <- rbinom(10, 1, .4)
 95 |   c2 <- 1-c1
 96 |   c3 <- integer(10)
 97 |   c4 <- c1
 98 |   mat <- data.frame(c1, c2, c3, c4)
 99 |   expect_error(merge_duplicated(mat))
100 | })
101 | 
102 | test_that("collapse_linearly_dependent_columns works", {
103 | 
104 |   mat <- matrix(rnorm(n = 10 * 4), nrow = 10, ncol = 4, dimnames = list(character(0L), paste0("col_", 1:4)))
105 |   mat <- cbind(mat, comb_12 = mat[,1] + 2 * mat[,2], col_6 = rnorm(10))
106 |   result <- collapse_linearly_dependent_columns(mat)
107 |   red_mat <- result$matrix
108 |   space_list <- result$space_list
109 |   expect_equal(ncol(red_mat), 5)
110 |   expect_equal(qr(red_mat)$rank, 5)
111 |   expect_equal(colnames(red_mat), c("col_3", "col_4", "col_6", "SPACE_1_AXIS1", "SPACE_1_AXIS2"))
112 |   expect_equal(length(space_list), 1)
113 |   expect_equal(space_list[[1]], c("col_1", "col_2", "comb_12"))
114 | 
115 |   mat <- cbind(mat, comb_34 = 0.3 * mat[,3] + 0.4 * mat[,4], col_8 = rnorm(10))
116 |   result <- collapse_linearly_dependent_columns(mat)
117 |   red_mat <- result$matrix
118 |   space_list <- result$space_list
119 |   expect_equal(ncol(red_mat), 6)
120 |   expect_equal(qr(red_mat)$rank, 6)
121 |   expect_equal(colnames(red_mat), c("col_6", "col_8", "SPACE_1_AXIS1", "SPACE_1_AXIS2", "SPACE_2_AXIS1", "SPACE_2_AXIS2"))
122 | 
123 |   c1 <- rbinom(10, 1, .4)
124 |   c2 <- c1*2
125 |   c3 <- c1*3
126 |   c4 <- c1*0.5
127 |   mat <- cbind(c1,c2,c3,c4)
128 |   result <- make_full_rank_matrix(mat)
129 |   red_mat <- result$matrix
130 |   space_list <- result$space_list
131 |   expect_equal(colnames(red_mat), c("SPACE_1_AXIS1"))
132 |   expect_equal(space_list$SPACE_1, c("c1","c2","c3","c4"))
133 | 
134 | })
135 | 


--------------------------------------------------------------------------------
/vignettes/.gitignore:
--------------------------------------------------------------------------------
1 | *.html
2 | *.R
3 | 


--------------------------------------------------------------------------------
/vignettes/fullrankmat-comparison.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "fullRankMatrix - Comparison to other packages"
  3 | output: rmarkdown::html_vignette
  4 | vignette: >
  5 |   %\VignetteIndexEntry{fullrankmat-comparison}
  6 |   %\VignetteEngine{knitr::rmarkdown}
  7 |   %\VignetteEncoding{UTF-8}
  8 | ---
  9 | 
 10 | ```{r, include = FALSE}
 11 | knitr::opts_chunk$set(
 12 |   collapse = TRUE,
 13 |   comment = "#>"
 14 | )
 15 | ```
 16 | 
 17 | 
 18 | ## Other available packages that detect linear dependent columns
 19 | There are already a few other packages out there that offer functions to detect linear dependent columns. Here are the ones we are aware of:
 20 | 
 21 | ```{r}
 22 | library(fullRankMatrix)
 23 | 
 24 | # let's say we have 10 fruit salads and indicate which ingredients are present in each salad
 25 | strawberry <- c(1,1,1,1,0,0,0,0,0,0)
 26 | poppyseed <- c(0,0,0,0,1,1,1,0,0,0)
 27 | orange <- c(1,1,1,1,1,1,1,0,0,0)
 28 | pear <- c(0,0,0,1,0,0,0,1,1,1)
 29 | mint <- c(1,1,0,0,0,0,0,0,0,0)
 30 | apple <- c(0,0,0,0,0,0,1,1,1,1)
 31 | 
 32 | # let's pretend we know how each fruit influences the sweetness of a fruit salad
 33 | # in this case we say that strawberries and oranges have the biggest influence on sweetness
 34 | set.seed(30)
 35 | strawberry_sweet <- strawberry * rnorm(10, 4)
 36 | poppyseed_sweet <- poppyseed * rnorm(10, 0.1)
 37 | orange_sweet <- orange * rnorm(10, 5)
 38 | pear_sweet <- pear * rnorm(10, 0.5)
 39 | mint_sweet <- mint * rnorm(10, 1)
 40 | apple_sweet <- apple * rnorm(10, 2)
 41 | 
 42 | sweetness <- strawberry_sweet + poppyseed_sweet+ orange_sweet + pear_sweet +
 43 |   mint_sweet + apple_sweet 
 44 | 
 45 | mat <- cbind(strawberry,poppyseed,orange,pear,mint,apple)
 46 | 
 47 | ```
 48 | 
 49 | 
 50 | 
 51 | **`caret::findLinearCombos()`**: https://rdrr.io/cran/caret/man/findLinearCombos.html
 52 | 
 53 | This function identifies which columns are linearly dependent and suggests which columns to remove. But it doesn't provide appropriate naming for the remaining columns to indicate that any significant associations with the remaining columns are actually associations with the space spanned by the originally linearly dependent columns. Just removing the indicated columns and then fitting the linear model would lead to erroneous interpretation.
 54 | ```{r}
 55 | caret_result <- caret::findLinearCombos(mat)
 56 | ```
 57 | Fitting a linear model with the `orange` column removed would lead to the erroneous interpretation that `strawberry` and `poppyseed` have the biggest influence on the fruit salad `sweetness`, but we know it is actually `strawberry` and `orange`.
 58 | ```{r}
 59 | mat_caret <- mat[, -caret_result$remove]
 60 | fit <- lm(sweetness ~ mat_caret + 0)
 61 | print(summary(fit))
 62 | ```
 63 | 
 64 | 
 65 | **`WeightIt::make_full_rank()`**: https://rdrr.io/cran/WeightIt/man/make_full_rank.html
 66 | 
 67 | This function removes some of the linearly dependent columns to create a full rank matrix, but doesn't rename the remaining columns accordingly. For the user it isn't clear which columns were linearly dependent and they can't choose which column will be removed.
 68 | ```{r}
 69 | mat_weightit <- WeightIt::make_full_rank(mat, with.intercept = FALSE)
 70 | mat_weightit
 71 | ```
 72 | As above fitting a linear model with this full rank matrix would lead to erroneous interpretation that `strawberry` and `poppyseed` influence the `sweetness`, but we know it is actually `strawberry` and `orange`.
 73 | 
 74 | ```{r}
 75 | fit <- lm(sweetness ~ mat_weightit + 0)
 76 | print(summary(fit))
 77 | ```
 78 | 
 79 | 
 80 | **`plm::detect.lindep()`:** https://rdrr.io/cran/plm/man/detect.lindep.html
 81 | 
 82 | The function returns which columns are potentially linearly dependent.
 83 | ```{r}
 84 | plm::detect.lindep(mat)
 85 | ```
 86 | 
 87 | However it doesn't capture all cases. For example here `plm::detect.lindep()` says there are no dependent columns, while there are several:
 88 | ```{r}
 89 | c1 <- rbinom(10, 1, .4)
 90 | c2 <- 1-c1
 91 | c3 <- integer(10)
 92 | c4 <- c1
 93 | c5 <- 2*c2
 94 | c6 <- rbinom(10, 1, .8)
 95 | c7 <- c5+c6
 96 | mat_test <- as.matrix(data.frame(c1,c2,c3,c4,c5,c6,c7))
 97 | 
 98 | plm::detect.lindep(mat_test)
 99 | ```
100 | 
101 | `fullRankMatrix` captures these cases:
102 | ```{r}
103 | result <- make_full_rank_matrix(mat_test)
104 | result$matrix
105 | ```
106 | **`Smisc::findDepMat()`**: https://rdrr.io/cran/Smisc/man/findDepMat.html
107 | 
108 | **NOTE**: this package was removed from CRAN as of 2020-01-26 (https://CRAN.R-project.org/package=Smisc) due to failing checks.
109 | 
110 | This function indicates linearly dependent rows/columns, but it doesn't state which rows/columns are linearly dependent with each other.
111 | 
112 | However, this function seems to not work well for one-hot encoded matrices and the package doesn't seem to be updated anymore (s. this issue: https://github.com/pnnl/Smisc/issues/24).
113 | ```
114 | # example provided by Smisc documentation
115 | Y <- matrix(c(1, 3, 4,
116 |               2, 6, 8,
117 |               7, 2, 9,
118 |               4, 1, 7,
119 |               3.5, 1, 4.5), byrow = TRUE, ncol = 3)
120 | Smisc::findDepMat(t(Y), rows = FALSE)
121 | ```
122 | 
123 | Trying with the model matrix from our example above:
124 | ```
125 | Smisc::findDepMat(mat, rows=FALSE)
126 | #> Error in if (!depends[j]) { : missing value where TRUE/FALSE needed
127 | ```
128 | 


--------------------------------------------------------------------------------
/vignettes/fullrankmat-example.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "fullRankMatrix - Generation of Full Rank Design Matrix"
  3 | output: rmarkdown::html_vignette
  4 | vignette: >
  5 |   %\VignetteIndexEntry{fullrankmat-example}
  6 |   %\VignetteEngine{knitr::rmarkdown}
  7 |   %\VignetteEncoding{UTF-8}
  8 | ---
  9 | 
 10 | ```{r, include = FALSE}
 11 | knitr::opts_chunk$set(
 12 |   collapse = TRUE,
 13 |   comment = "#>"
 14 | )
 15 | ```
 16 | 
 17 | <!-- badges: start -->
 18 | <!-- badges: end -->
 19 | 
 20 | ```{r, echo=FALSE, out.width="20%", fig.align = 'left'}
 21 | knitr::include_graphics("man/figures/fullRankMatrix.png")
 22 | ```
 23 | 
 24 | We developed `fullRankMatrix` primarily for one-hot encoded design matrices used in linear models. In our case, we were faced with a 1-hot encoded design matrix, that had a lot of linearly dependent columns. This happened when modeling a lot of interaction terms. Since fitting a linear model on a design matrix with linearly dependent columns will produce results that can lead to misleading interpretation (s. example below), we decided to develop a package that will help with identifying linearly dependent columns and replacing them with columns constructed of orthogonal vectors that span the space of the previously linearly dependent columns.
 25 | 
 26 | The goal of `fullRankMatrix` is to remove empty columns (contain only 0s), merge duplicated columns (containing the same entries) and merge linearly dependent columns. These operations will create a matrix of full rank. The changes made to the columns are reflected in the column headers such that the columns can still be interpreted if the matrix is used in e.g. a linear model fit.
 27 | 
 28 | ## Installation
 29 | 
 30 | You can install `fullRankMatrix` directly from CRAN. Just paste the following snippet into your R console:
 31 | 
 32 | ```r
 33 | install.packages("fullRankMatrix")
 34 | ```
 35 | 
 36 | 
 37 | You can install the development version of `fullRankMatrix` from [GitHub](https://github.com/Pweidemueller/fullRankMatrix) with:
 38 | 
 39 | ```r
 40 | # install.packages("devtools")
 41 | devtools::install_github("Pweidemueller/fullRankMatrix")
 42 | ```
 43 | 
 44 | ## Citation
 45 | 
 46 | If you want to cite this package in a publication, you can run the following
 47 | command in your R console:
 48 | 
 49 | ```{r citation}
 50 | citation("fullRankMatrix")
 51 | ```
 52 | 
 53 | ## Linearly dependent columns span a space of a certain dimension
 54 | In order to visualize it, let's look at a very simple example. Say we have a matrix with three columns, each with three entries. These columns can be visualized as vectors in a coordinate system with 3 axes (s. image). The first vector points into the plane spanned by the first and third axis. The second and third vectors lie in the plane spanned by the first and second axis. Since this is a very simple example, we immediately spot that the third column is a multiple of the second column. Their corresponding vectors lie perfectly on top of each other. This means instead of the two columns spanning a 2D space they just occupy a line, i.e. a 1D space. This is identified by `fullRankMatrix`, which replaces these two linearly dependent columns with one vector that describes the 1D space in which column 2 and column 3 used to lie. The resulting matrix is now full rank with no linearly dependent columns.
 55 | 
 56 | 
 57 | ```{r setup}
 58 | library(fullRankMatrix)
 59 | ```
 60 | 
 61 | ```{r}
 62 | c1 <- c(1, 0, 1)
 63 | c2 <- c(1, 2, 0)
 64 | c3 <- c(2, 4, 0)
 65 | 
 66 | mat <- cbind(c1, c2, c3)
 67 | 
 68 | make_full_rank_matrix(mat)
 69 | ```
 70 | ```{r, out.width="100%", fig.cap="Visualisation of identifying and replacing linearly dependent columns."}
 71 | knitr::include_graphics("man/figures/example_vectors.png")
 72 | ```
 73 | 
 74 | ## Worked through example
 75 | 
 76 | Above was a rather abstract example that was easy to visualize, let's now walk through the utilities of `fullRankMatrix` when applied to a more realistic design matrix.
 77 | 
 78 | When using linear models you should check if any of the columns in your design matrix are linearly dependent. If there are, this will alter the interpretation of the fit. Here is a rather constructed example where we are interested in identifying which ingredients contribute mostly to the sweetness of fruit salads. 
 79 | ```{r}
 80 | # let's say we have 10 fruit salads and indicate which ingredients are present in each salad
 81 | strawberry <- c(1,1,1,1,0,0,0,0,0,0)
 82 | poppyseed <- c(0,0,0,0,1,1,1,0,0,0)
 83 | orange <- c(1,1,1,1,1,1,1,0,0,0)
 84 | pear <- c(0,0,0,1,0,0,0,1,1,1)
 85 | mint <- c(1,1,0,0,0,0,0,0,0,0)
 86 | apple <- c(0,0,0,0,0,0,1,1,1,1)
 87 | 
 88 | # let's pretend we know how each fruit influences the sweetness of a fruit salad
 89 | # in this case we say that strawberries and oranges have the biggest influence on sweetness
 90 | set.seed(30)
 91 | strawberry_sweet <- strawberry * rnorm(10, 4)
 92 | poppyseed_sweet <- poppyseed * rnorm(10, 0.1)
 93 | orange_sweet <- orange * rnorm(10, 5)
 94 | pear_sweet <- pear * rnorm(10, 0.5)
 95 | mint_sweet <- mint * rnorm(10, 1)
 96 | apple_sweet <- apple * rnorm(10, 2)
 97 | 
 98 | sweetness <- strawberry_sweet + poppyseed_sweet+ orange_sweet + pear_sweet +
 99 |   mint_sweet + apple_sweet 
100 | 
101 | mat <- cbind(strawberry,poppyseed,orange,pear,mint,apple)
102 | 
103 | fit <- lm(sweetness ~ mat + 0)
104 | print(summary(fit))
105 | ```
106 | As you can see `lm` realizes that "1 [column] not defined because of singularities" (`matorange` is not defined) but it doesn't indicate what columns it is linearly dependent with.
107 | 
108 | So if you would just look at the columns and not consider the `NA` further, you would interpret that `strawberry` and `poppyseed` are the biggest contributors to the sweetness of fruit salads.
109 | 
110 | However, when you look at the model matrix you can see that the `orange` column is a linear combination of the `strawberry` and `poppyseed` columns (or vice versa). So truly any of the three factors could contribute to the sweetness of a fruit salad, the linear model has no way of recovering which one given these 10 examples. And since we constructed this example we know that `orange` and `strawberry` are the sweetest and `poppyseed` contributes least to the sweetness.
111 | ```{r}
112 | mat
113 | ```
114 | To make such cases more obvious and to be able to still correctly interpret the linear model fit, we wrote `fullRankMatrix`. It removes linearly dependent columns and renames the remaining columns to make the dependencies clear using the `make_full_rank_matrix()` function.
115 | 
116 | ```{r}
117 | library(fullRankMatrix)
118 | result <- make_full_rank_matrix(mat)
119 | mat_fr <- result$matrix
120 | space_list <- result$space_list
121 | mat_fr
122 | ```
123 | 
124 | ```{r}
125 | fit <- lm(sweetness ~ mat_fr + 0)
126 | print(summary(fit))
127 | ```
128 | You can see that there are no more undefined columns. The columns `strawberry`, `orange` and `poppyseed` were removed and replaced with two columns (`SPACE_1_AXIS1`, `SPACE_1_AXIS2`) that are linearly independent (orthogonal) vectors that span the space in which the linearly dependent columns `strawberry`, `orange` and `poppyseed` lied.
129 | 
130 | The original columns that are contained within a space can be viewed in the returned `space_list`:
131 | ```{r}
132 | space_list
133 | ```
134 | 
135 | In terms of interpretation the individual axes of the constructed spaces are difficult to interpret, but we see that the axes of the space of `strawberry`, `orange` and `poppyseed` show a significant association with the sweetness of fruit salads. A further resolution which of the three terms is most strongly associated with `sweetness` is not possible with the given number of observations, but there is definitely an association of `sweetness` with the space spanned by the three terms.
136 | 
137 | If only a subset of all axes of a space show a significant association in the linear model fit, this could indicate that only a subset of linearly dependent columns that lie within the space spanned by the significantly associated axes drive this association. This would require some more detailed investigation by the user that would be specific to the use case.
138 | 


--------------------------------------------------------------------------------
/vignettes/man/figures/example_vectors.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pweidemueller/fullRankMatrix/ad4bc2dc58642eb7e639ee3e3d79e5dca28f4c4c/vignettes/man/figures/example_vectors.png


--------------------------------------------------------------------------------
/vignettes/man/figures/fullRankMatrix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pweidemueller/fullRankMatrix/ad4bc2dc58642eb7e639ee3e3d79e5dca28f4c4c/vignettes/man/figures/fullRankMatrix.png


--------------------------------------------------------------------------------