├── .Rbuildignore ├── .gitignore ├── DESCRIPTION ├── LICENSE ├── LICENSE.md ├── NAMESPACE ├── NEWS.md ├── R ├── find_connected_components.R ├── find_linear_dependent_columns.R ├── make_full_rank_matrix.R └── validate_column_names.R ├── README.Rmd ├── README.md ├── cran-comments.md ├── inst ├── CITATION └── WORDLIST ├── man ├── figures │ ├── example_vectors.png │ └── fullRankMatrix.png ├── find_connected_components.Rd ├── find_linear_dependent_columns.Rd ├── make_full_rank_matrix.Rd └── validate_column_names.Rd ├── tests ├── spelling.R ├── testthat.R └── testthat │ ├── test-find_connected_components.R │ ├── test-find_linear_dependent_columns.R │ └── test-make_full_rank_matrix.R └── vignettes ├── .gitignore ├── fullrankmat-comparison.Rmd ├── fullrankmat-example.Rmd └── man └── figures ├── example_vectors.png └── fullRankMatrix.png /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^renv$ 2 | ^renv\.lock$ 3 | ^fullRankMatrix\.Rproj$ 4 | ^\.Rproj\.user$ 5 | ^LICENSE\.md$ 6 | ^README\.Rmd$ 7 | ^SPACES$ 8 | ^cran-comments\.md$ 9 | ^CRAN-SUBMISSION$ 10 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | *.Rproj 3 | .Rprofile 4 | .Rhistory 5 | renv 6 | renv.lock 7 | .Rdata 8 | .httr-oauth 9 | .DS_Store 10 | inst/doc 11 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: fullRankMatrix 2 | Title: Generation of Full Rank Design Matrix 3 | Version: 0.1.0.9000 4 | Authors@R: 5 | c( 6 | person("Paula", "Weidemueller", , "paulahw3214@gmail.com", role = c("aut", "cre", "cph"), comment = c(ORCID = "0000-0003-3867-3131", Twitter = "@PaulaH_W")), 7 | person("Constantin", "Ahlmann-Eltze", , "artjom31415@googlemail.com", role = c("aut"), comment = c(ORCID = "0000-0002-3762-068X", Twitter = "@const_ae")) 8 | ) 9 | Description: Creates a full rank matrix out of a given matrix. 10 | The intended use is for one-hot encoded design matrices that should be used in linear models to ensure that significant associations can be correctly interpreted. However, 'fullRankMatrix' can be applied to any matrix to make it full rank. 11 | It removes columns with only 0's, merges duplicated columns and discovers linearly dependent columns and replaces them with linearly independent columns that span the space of the original columns. Columns are renamed to reflect those modifications. 12 | This results in a full rank matrix that can be used as a design matrix in linear models. The algorithm and some functions are inspired by Kuhn, M. (2008) . 13 | License: MIT + file LICENSE 14 | Encoding: UTF-8 15 | Roxygen: list(markdown = TRUE) 16 | RoxygenNote: 7.3.1 17 | Suggests: 18 | knitr, 19 | rmarkdown, 20 | igraph, 21 | testthat (>= 3.0.0), 22 | WeightIt, 23 | caret, 24 | plm, 25 | spelling 26 | Config/testthat/edition: 3 27 | VignetteBuilder: knitr 28 | URL: https://github.com/Pweidemueller/fullRankMatrix 29 | BugReports: https://github.com/Pweidemueller/fullRankMatrix/issues 30 | Language: en-US 31 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | YEAR: 2024 2 | COPYRIGHT HOLDER: fullRankMatrix authors 3 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | # MIT License 2 | 3 | Copyright (c) 2024 fullRankMatrix authors 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | export(find_connected_components) 4 | export(find_linear_dependent_columns) 5 | export(make_full_rank_matrix) 6 | export(validate_column_names) 7 | importFrom(stats,lm.fit) 8 | -------------------------------------------------------------------------------- /NEWS.md: -------------------------------------------------------------------------------- 1 | # fullRankMatrix (development version) 2 | 3 | * fix typos 4 | 5 | # fullRankMatrix 0.1.0 6 | 7 | * Initial CRAN submission. 8 | -------------------------------------------------------------------------------- /R/find_connected_components.R: -------------------------------------------------------------------------------- 1 | #' Find connected components in a graph 2 | #' 3 | #' The function performs a depths-first search to find all connected components. 4 | #' 5 | #' @param connections a list where each element is a vector with connected nodes. 6 | #' Each node must be either a character or an integer. 7 | #' 8 | #' @return a list where each element is a set of connected items. 9 | #' @export 10 | #' @examples 11 | #' find_connected_components(list(c(1,2), c(1,3), c(4,5))) 12 | #' 13 | #' 14 | find_connected_components <- function(connections){ 15 | stopifnot(is.list(connections)) 16 | is_char <- vapply(connections, \(con) is.character(con), FUN.VALUE = logical(1L)) 17 | is_int <- vapply(connections, \(con) is.integer(con) || (is.numeric(con) && all(con < 2^31) && all(con %% 1 == 0)), FUN.VALUE = logical(1L)) 18 | if(all(is_char)){ 19 | # Do nothing 20 | }else if(all(is_int)){ 21 | connections <- lapply(connections, \(con) as.character(as.integer(con))) 22 | }else{ 23 | stop("Elements in 'connections' must be either characters or integers.") 24 | } 25 | nodes <- unique(unlist(connections, use.names = FALSE)) 26 | 27 | # Keep track which nodes I have visited 28 | visited <- new.env(parent = emptyenv(), size = length(nodes)) 29 | for(n in nodes){ 30 | visited[[n]] <- FALSE 31 | } 32 | 33 | # Efficient access to neighbors of each node 34 | connection_graph <- new.env(parent = emptyenv(), size = length(nodes)) 35 | for(con in connections){ 36 | for(n in con){ 37 | connection_graph[[n]] <- union(connection_graph[[n]], con) 38 | } 39 | } 40 | 41 | # Depth-first search 42 | dfs <- function(graph, head){ 43 | if(is.null(head)){ 44 | return(character(0L)) 45 | } 46 | queue <- NULL 47 | visited[[head]] <- TRUE 48 | for(n in connection_graph[[head]]){ 49 | if(! visited[[n]]){ 50 | queue <- list(head = n, tail = queue) 51 | } 52 | } 53 | 54 | res <- character(length(nodes)) 55 | res[1] <- head 56 | counter <- 2 57 | while(! is.null(queue)){ 58 | head <- queue$head 59 | queue <- queue$tail 60 | if(visited[[head]]){ 61 | }else{ 62 | res[counter] <- head 63 | counter <- counter + 1 64 | visited[[head]] <- TRUE 65 | for(n in connection_graph[[head]]){ 66 | if(! visited[[n]]){ 67 | queue <- list(head = n, tail = queue) 68 | } 69 | } 70 | } 71 | } 72 | sort(res[seq_len(counter-1)]) 73 | } 74 | 75 | result <- replicate(length(nodes), NULL) 76 | counter <- 1 77 | for(n in nodes){ 78 | if(! visited[[n]]){ 79 | result[[counter]] <- dfs(connection_graph, n) 80 | counter <- counter + 1 81 | } 82 | } 83 | if(all(is_char)){ 84 | result[seq_len(counter-1)] 85 | }else{ 86 | lapply(result[seq_len(counter-1)], as.integer) 87 | } 88 | } 89 | 90 | -------------------------------------------------------------------------------- /R/find_linear_dependent_columns.R: -------------------------------------------------------------------------------- 1 | 2 | #' Find linear dependent columns in a design matrix 3 | #' 4 | #' @importFrom stats lm.fit 5 | #' 6 | #' @param mat a matrix 7 | #' @param tol a double that specifies the numeric tolerance 8 | #' 9 | #' @return a list with vectors containing the indices of linearly dependent columns 10 | #' 11 | #' @seealso 12 | #' The algorithm and function is inspired by the `internalEnumLC` 13 | #' function in the 'caret' package ([GitHub](https://github.com/topepo/caret/blob/679eabaac7e54f4e87efa6c3bff75659cb457d8b/pkg/caret/R/findLinearCombos.R#L33)) 14 | #' 15 | #' @examples 16 | #' mat <- matrix(rnorm(3 * 10), nrow = 10, ncol = 3) 17 | #' mat <- cbind(mat, mat[,1] + 0.5 * mat[,3]) 18 | #' find_linear_dependent_columns(mat) # returns list(c(1,3,4)) 19 | #' 20 | #' @export 21 | find_linear_dependent_columns <- function(mat, tol = 1e-12){ 22 | stopifnot(is.matrix(mat)) 23 | stopifnot(is.numeric(tol), length(tol) == 1) 24 | qr_mat <- qr(mat) 25 | mat_rank <- qr_mat$rank 26 | 27 | if(mat_rank == ncol(mat)){ 28 | # The matrix is full rank, so return immediately 29 | list() 30 | }else{ 31 | # The QR decomposition arranges the values such that the first `#rank` columns are independent 32 | independent_columns <- seq_len(mat_rank) 33 | dependent_columns <- mat_rank + seq_len(ncol(mat) - mat_rank) 34 | # Solving `Bad_Matrix = Good_Matrix %*% coef` 35 | # Any place where `coef` is non-null, we have some linear dependency 36 | coef <- lm.fit(qr.R(qr_mat)[independent_columns,independent_columns,drop=FALSE], qr.R(qr_mat)[independent_columns,dependent_columns,drop=FALSE])$coef 37 | coef <- matrix(coef, ncol = length(dependent_columns)) 38 | # The pivot relates columns in the QR decomposition object to the columns in the original matrix 39 | # The `sort` makes sure that the result looks good 40 | lindep_sets <- lapply(seq_along(dependent_columns), function(i) sort(c(qr_mat$pivot[mat_rank + i], qr_mat$pivot[which(abs(coef[,i]) > tol)]))) 41 | 42 | connected_lindep_components <- find_connected_components(lindep_sets) 43 | connected_lindep_components 44 | } 45 | } 46 | 47 | 48 | -------------------------------------------------------------------------------- /R/make_full_rank_matrix.R: -------------------------------------------------------------------------------- 1 | #' Create a full rank matrix 2 | #' 3 | #' First remove empty columns. Then discover linear dependent columns. For each set of linearly dependent columns, create orthogonal vectors that span the space. Add these vectors as columns to the final matrix to replace the linearly dependent columns. 4 | #' 5 | #' @param mat A matrix. 6 | #' @param verbose Print how column numbers change with each operation. 7 | #' 8 | #' @return a list containing: 9 | #' * `matrix`: A matrix of full rank. Column headers will be renamed to reflect how columns depend on each other. 10 | #' * `(c1_AND_c2)` If multiple columns are exactly identical, only a single instance is retained. 11 | #' * `SPACE__AXIS` For each set of linearly dependent columns, a space `i` with `max(j)` dimensions was created using orthogonal axes to replace the original columns. 12 | #' * `space_list`: A named list where each element corresponds to a space and contains the names of the original linearly dependent columns that are contained within that space. 13 | #' 14 | #' @export 15 | #' 16 | #' @examples 17 | #' # Create a 1-hot encoded (zero/one) matrix 18 | #' c1 <- rbinom(10, 1, .4) 19 | #' c2 <- 1-c1 20 | #' c3 <- integer(10) 21 | #' c4 <- c1 22 | #' c5 <- 2*c2 23 | #' c6 <- rbinom(10, 1, .8) 24 | #' c7 <- c5+c6 25 | #' # Turn into matrix 26 | #' mat <- cbind(c1, c2, c3, c4, c5, c6, c7) 27 | #' # Turn the matrix into full rank, this will: 28 | #' # 1. remove empty columns (all zero) 29 | #' # 2. merge columns with the same entries (duplicates) 30 | #' # 3. identify linearly dependent columns 31 | #' # 4. replace them with orthogonal vectors that span the same space 32 | #' result <- make_full_rank_matrix(mat) 33 | #' # verbose=TRUE will give details on how many columns are removed in every step 34 | #' result <- make_full_rank_matrix(mat, verbose=TRUE) 35 | #' # look at the create full rank matrix 36 | #' mat_full <- result$matrix 37 | #' # check which linearly dependent columns spanned the identified spaces 38 | #' spaces <- result$space_list 39 | 40 | make_full_rank_matrix <- function(mat, verbose=FALSE){ 41 | 42 | if (!is.matrix(mat)) { 43 | stop("The input is not a matrix.") 44 | } 45 | if (any(is.na(mat))) { 46 | stop("Error: The matrix contains NA values.") 47 | } 48 | 49 | validate_column_names(colnames(mat)) 50 | if (verbose){ 51 | message(sprintf("The original matrix contains %i rows and %i columns. The matrix has rank %i.", 52 | dim(mat)[1], dim(mat)[2], qr(mat)$rank)) 53 | } 54 | mat_mod <- remove_empty_columns(mat, verbose=verbose) 55 | mat_mod <- merge_duplicated(mat_mod, verbose=verbose) 56 | result <- collapse_linearly_dependent_columns(mat_mod, verbose=verbose) 57 | mat_mod <- result$matrix 58 | 59 | if (ncol(mat_mod) > qr(mat)$rank){ 60 | stop(message("The modified matrix still has more columns than implied by rank. Check manually why modified matrix is not full rank after applying make_full_rank_matrix().")) 61 | } 62 | if (verbose){ 63 | message("The matrix is now full rank.") 64 | } 65 | return(result) 66 | } 67 | 68 | find_empty_columns <- function(mat, tol = 1e-12, return_names=FALSE){ 69 | empty_col <- apply(mat, MARGIN = 2, FUN = function(x) {all(abs(x) < tol)}) 70 | if (return_names){ 71 | names(which(empty_col)) 72 | }else{ 73 | empty_col 74 | } 75 | } 76 | 77 | remove_empty_columns <- function(mat, tol = 1e-12, verbose=FALSE) { 78 | empty_col <- find_empty_columns(mat, tol) 79 | if (sum(empty_col) > tol){ 80 | mat <- mat[, !empty_col, drop=FALSE] 81 | } 82 | if (verbose){ 83 | message(sprintf("%i empty columns were removed. After removing empty columns the matrix contains %i columns.", 84 | sum(empty_col, na.rm = TRUE), ncol(mat))) 85 | 86 | } 87 | return(mat) 88 | } 89 | 90 | find_duplicated_columns <- function(mat, verbose=FALSE) { 91 | stopifnot(is.matrix(mat)) 92 | mat_duplicated <- mat[, duplicated(mat, MARGIN = 2), drop=FALSE] 93 | colnames(mat_duplicated) 94 | } 95 | 96 | merge_duplicated <- function(mat, tol = 1e-12, verbose=FALSE) { 97 | stopifnot(is.matrix(mat)) 98 | if (any(is.na(mat))) { 99 | stop("Error: The matrix contains NA values.") 100 | } 101 | 102 | mat_unique <- unique(mat, MARGIN = 2) 103 | colnames_unique <- colnames(mat_unique) 104 | colnames_duplicated <- find_duplicated_columns(mat) 105 | if (length(colnames_duplicated) > tol){ 106 | for (c in seq_len(length(colnames_duplicated))){ 107 | keep <- which(duplicated(cbind(mat[, colnames_duplicated[c], drop=FALSE], mat_unique), MARGIN = 2))-1 108 | if (length(keep) > 1){ 109 | stop(message("More than one matching column detected. Something is wrong with this algorithm.")) 110 | } 111 | if (grepl("_AND_", colnames_unique[keep], fixed=TRUE)){ 112 | colnames_unique[keep] <- gsub('[()]', '', colnames_unique[keep]) 113 | } 114 | colnames_unique[keep] <- paste0("(", colnames_unique[keep], "_AND_", colnames_duplicated[c], ")") 115 | } 116 | colnames(mat_unique) <- colnames_unique 117 | mat <- mat_unique 118 | } 119 | if (verbose){ 120 | message(sprintf("%i duplicated columns were detected. After merging duplicated columns the matrix contains %i columns.", 121 | length(colnames_duplicated), ncol(mat))) 122 | } 123 | return(mat) 124 | } 125 | 126 | collapse_linearly_dependent_columns <- function(mat, tol = 1e-12, verbose = FALSE){ 127 | stopifnot(is.matrix(mat)) 128 | if (any(is.na(mat))) { 129 | stop("Error: The matrix contains NA values.") 130 | } 131 | validate_column_names(colnames(mat)) 132 | 133 | linear_dependencies <- find_linear_dependent_columns(mat, tol = tol) 134 | 135 | space_counter <- 1 136 | space_list <- list() 137 | 138 | while(length(linear_dependencies) > 0){ 139 | dependent_set <- linear_dependencies[[1]] 140 | dependent_columns <- mat[,dependent_set,drop=FALSE] 141 | mat <- mat[,-dependent_set,drop=FALSE] 142 | qr_space <- qr(dependent_columns) 143 | rank_of_set <- qr_space$rank 144 | new_space <- qr.Q(qr_space)[,seq_len(rank_of_set),drop=FALSE] 145 | 146 | # Handle names 147 | # if a lot of linearly dependencies exist, adding the original column names to the new column names might get prohibitively large 148 | # instead label each new space by a number and save which original columns it was composed of in a corresponding file 149 | space_name <- paste0("SPACE_", space_counter) 150 | space_list[[space_name]] <- colnames(dependent_columns) 151 | 152 | new_names <- paste0("SPACE_", space_counter, "_AXIS", seq_len(rank_of_set)) 153 | 154 | colnames(new_space) <- new_names 155 | mat <- cbind(mat, new_space) 156 | 157 | # Changing the matrix could introduce new dependencies 158 | linear_dependencies <- find_linear_dependent_columns(mat, tol = tol) 159 | space_counter <- space_counter + 1 160 | } 161 | if (verbose){ 162 | message(sprintf("The matrix after collapsing linearly dependent columns contains %i rows and %i columns.", 163 | nrow(mat), ncol(mat))) 164 | } 165 | list(matrix = mat, space_list = space_list) 166 | } 167 | 168 | -------------------------------------------------------------------------------- /R/validate_column_names.R: -------------------------------------------------------------------------------- 1 | #' Validate Column Names 2 | #' 3 | #' This function checks a vector of column names to ensure they are valid. It performs the following checks: 4 | #' - The column names must not be `NULL`. 5 | #' - The column names must not contain empty strings. 6 | #' - The column names must not contain `NA` values. 7 | #' - The column names must be unique. 8 | #' 9 | #' @param names A character vector of column names to validate. 10 | #' 11 | #' @return Returns `TRUE` if all checks pass. If any check fails, the function stops and returns an error message. 12 | #' @export 13 | #' 14 | #' @examples 15 | #' validate_column_names(c("name", "age", "gender")) 16 | #' 17 | validate_column_names <- function(names){ 18 | if(is.null(names)){ 19 | stop("The column names must not be `NULL`.") 20 | } 21 | if(any(grepl("^$", names))){ 22 | stop("The empty string `\"\"` is not a valid column name. For example, name at index: ", which(grepl("^$", names))[1], " is empty.") 23 | } 24 | if(any(is.na(names))){ 25 | stop("None of the names must be `NA`.") 26 | } 27 | if(any(duplicated(names))){ 28 | stop("The column names must be unique.") 29 | } 30 | } 31 | 32 | -------------------------------------------------------------------------------- /README.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: github_document 3 | --- 4 | 5 | 6 | 7 | ```{r, include = FALSE} 8 | knitr::opts_chunk$set( 9 | collapse = TRUE, 10 | comment = "#>", 11 | fig.path = "man/figures/README-", 12 | out.width = "100%" 13 | ) 14 | ``` 15 | 16 | # fullRankMatrix - Generation of Full Rank Design Matrix 17 | 18 | ```{r child = "vignettes/fullrankmat-example.Rmd"} 19 | 20 | ``` 21 | 22 | ```{r child = "vignettes/fullrankmat-comparison.Rmd"} 23 | 24 | ``` 25 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | # fullRankMatrix - Generation of Full Rank Design Matrix 5 | 6 | 7 | 8 | 9 | 10 | 11 | We developed `fullRankMatrix` primarily for one-hot encoded design 12 | matrices used in linear models. In our case, we were faced with a 1-hot 13 | encoded design matrix, that had a lot of linearly dependent columns. 14 | This happened when modeling a lot of interaction terms. Since fitting a 15 | linear model on a design matrix with linearly dependent columns will 16 | produce results that can lead to misleading interpretation (s. example 17 | below), we decided to develop a package that will help with identifying 18 | linearly dependent columns and replacing them with columns constructed 19 | of orthogonal vectors that span the space of the previously linearly 20 | dependent columns. 21 | 22 | The goal of `fullRankMatrix` is to remove empty columns (contain only 23 | 0s), merge duplicated columns (containing the same entries) and merge 24 | linearly dependent columns. These operations will create a matrix of 25 | full rank. The changes made to the columns are reflected in the column 26 | headers such that the columns can still be interpreted if the matrix is 27 | used in e.g. a linear model fit. 28 | 29 | ## Installation 30 | 31 | You can install `fullRankMatrix` directly from CRAN. Just paste the 32 | following snippet into your R console: 33 | 34 | ``` r 35 | install.packages("fullRankMatrix") 36 | ``` 37 | 38 | You can install the development version of `fullRankMatrix` from 39 | [GitHub](https://github.com/Pweidemueller/fullRankMatrix) with: 40 | 41 | ``` r 42 | # install.packages("devtools") 43 | devtools::install_github("Pweidemueller/fullRankMatrix") 44 | ``` 45 | 46 | ## Citation 47 | 48 | If you want to cite this package in a publication, you can run the 49 | following command in your R console: 50 | 51 | ``` r 52 | citation("fullRankMatrix") 53 | #> To cite package 'fullRankMatrix' in publications use: 54 | #> 55 | #> Weidemüller P, Ahlmann-Eltze C (2024). _fullRankMatrix: Produce a 56 | #> Full Rank Matrix_. R package version 0.1.0, 57 | #> . 58 | #> 59 | #> A BibTeX entry for LaTeX users is 60 | #> 61 | #> @Manual{, 62 | #> title = {fullRankMatrix: Produce a Full Rank Matrix}, 63 | #> author = {Paula Weidemüller and Constantin Ahlmann-Eltze}, 64 | #> year = {2024}, 65 | #> note = {R package version 0.1.0}, 66 | #> url = {https://github.com/Pweidemueller/fullRankMatrix}, 67 | #> } 68 | ``` 69 | 70 | ## Linearly dependent columns span a space of a certain dimension 71 | 72 | In order to visualize it, let’s look at a very simple example. Say we 73 | have a matrix with three columns, each with three entries. These columns 74 | can be visualized as vectors in a coordinate system with 3 axes (s. 75 | image). The first vector points into the plane spanned by the first and 76 | third axis. The second and third vectors lie in the plane spanned by the 77 | first and second axis. Since this is a very simple example, we 78 | immediately spot that the third column is a multiple of the second 79 | column. Their corresponding vectors lie perfectly on top of each other. 80 | This means instead of the two columns spanning a 2D space they just 81 | occupy a line, i.e. a 1D space. This is identified by `fullRankMatrix`, 82 | which replaces these two linearly dependent columns with one vector that 83 | describes the 1D space in which column 2 and column 3 used to lie. The 84 | resulting matrix is now full rank with no linearly dependent columns. 85 | 86 | ``` r 87 | library(fullRankMatrix) 88 | ``` 89 | 90 | ``` r 91 | c1 <- c(1, 0, 1) 92 | c2 <- c(1, 2, 0) 93 | c3 <- c(2, 4, 0) 94 | 95 | mat <- cbind(c1, c2, c3) 96 | 97 | make_full_rank_matrix(mat) 98 | #> $matrix 99 | #> c1 SPACE_1_AXIS1 100 | #> [1,] 1 -0.4472136 101 | #> [2,] 0 -0.8944272 102 | #> [3,] 1 0.0000000 103 | #> 104 | #> $space_list 105 | #> $space_list$SPACE_1 106 | #> [1] "c2" "c3" 107 | ``` 108 | 109 | ``` r 110 | knitr::include_graphics("man/figures/example_vectors.png") 111 | ``` 112 | 113 |
114 | 115 | Visualisation of identifying and replacing linearly dependent columns. 116 |

117 | Visualisation of identifying and replacing linearly dependent columns. 118 |

119 | 120 |
121 | 122 | ## Worked through example 123 | 124 | Above was a rather abstract example that was easy to visualize, let’s 125 | now walk through the utilities of `fullRankMatrix` when applied to a 126 | more realistic design matrix. 127 | 128 | When using linear models you should check if any of the columns in your 129 | design matrix are linearly dependent. If there are, this will alter the 130 | interpretation of the fit. Here is a rather constructed example where we 131 | are interested in identifying which ingredients contribute mostly to the 132 | sweetness of fruit salads. 133 | 134 | ``` r 135 | # let's say we have 10 fruit salads and indicate which ingredients are present in each salad 136 | strawberry <- c(1,1,1,1,0,0,0,0,0,0) 137 | poppyseed <- c(0,0,0,0,1,1,1,0,0,0) 138 | orange <- c(1,1,1,1,1,1,1,0,0,0) 139 | pear <- c(0,0,0,1,0,0,0,1,1,1) 140 | mint <- c(1,1,0,0,0,0,0,0,0,0) 141 | apple <- c(0,0,0,0,0,0,1,1,1,1) 142 | 143 | # let's pretend we know how each fruit influences the sweetness of a fruit salad 144 | # in this case we say that strawberries and oranges have the biggest influence on sweetness 145 | set.seed(30) 146 | strawberry_sweet <- strawberry * rnorm(10, 4) 147 | poppyseed_sweet <- poppyseed * rnorm(10, 0.1) 148 | orange_sweet <- orange * rnorm(10, 5) 149 | pear_sweet <- pear * rnorm(10, 0.5) 150 | mint_sweet <- mint * rnorm(10, 1) 151 | apple_sweet <- apple * rnorm(10, 2) 152 | 153 | sweetness <- strawberry_sweet + poppyseed_sweet+ orange_sweet + pear_sweet + 154 | mint_sweet + apple_sweet 155 | 156 | mat <- cbind(strawberry,poppyseed,orange,pear,mint,apple) 157 | 158 | fit <- lm(sweetness ~ mat + 0) 159 | print(summary(fit)) 160 | #> 161 | #> Call: 162 | #> lm(formula = sweetness ~ mat + 0) 163 | #> 164 | #> Residuals: 165 | #> 1 2 3 4 5 6 7 8 166 | #> -2.00934 2.00934 -1.34248 1.34248 0.92807 -2.27054 1.34248 -0.01963 167 | #> 9 10 168 | #> 1.26385 -2.58670 169 | #> 170 | #> Coefficients: (1 not defined because of singularities) 171 | #> Estimate Std. Error t value Pr(>|t|) 172 | #> matstrawberry 8.9087 2.0267 4.396 0.00705 ** 173 | #> matpoppyseed 6.5427 1.5544 4.209 0.00842 ** 174 | #> matorange NA NA NA NA 175 | #> matpear 1.2800 2.3056 0.555 0.60269 176 | #> matmint 0.6582 2.6242 0.251 0.81193 177 | #> matapple 1.2595 2.2526 0.559 0.60019 178 | #> --- 179 | #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 180 | #> 181 | #> Residual standard error: 2.357 on 5 degrees of freedom 182 | #> Multiple R-squared: 0.9504, Adjusted R-squared: 0.9007 183 | #> F-statistic: 19.15 on 5 and 5 DF, p-value: 0.002824 184 | ``` 185 | 186 | As you can see `lm` realizes that “1 \[column\] not defined because of 187 | singularities” (`matorange` is not defined) but it doesn’t indicate what 188 | columns it is linearly dependent with. 189 | 190 | So if you would just look at the columns and not consider the `NA` 191 | further, you would interpret that `strawberry` and `poppyseed` are the 192 | biggest contributors to the sweetness of fruit salads. 193 | 194 | However, when you look at the model matrix you can see that the `orange` 195 | column is a linear combination of the `strawberry` and `poppyseed` 196 | columns (or vice versa). So truly any of the three factors could 197 | contribute to the sweetness of a fruit salad, the linear model has no 198 | way of recovering which one given these 10 examples. And since we 199 | constructed this example we know that `orange` and `strawberry` are the 200 | sweetest and `poppyseed` contributes least to the sweetness. 201 | 202 | ``` r 203 | mat 204 | #> strawberry poppyseed orange pear mint apple 205 | #> [1,] 1 0 1 0 1 0 206 | #> [2,] 1 0 1 0 1 0 207 | #> [3,] 1 0 1 0 0 0 208 | #> [4,] 1 0 1 1 0 0 209 | #> [5,] 0 1 1 0 0 0 210 | #> [6,] 0 1 1 0 0 0 211 | #> [7,] 0 1 1 0 0 1 212 | #> [8,] 0 0 0 1 0 1 213 | #> [9,] 0 0 0 1 0 1 214 | #> [10,] 0 0 0 1 0 1 215 | ``` 216 | 217 | To make such cases more obvious and to be able to still correctly 218 | interpret the linear model fit, we wrote `fullRankMatrix`. It removes 219 | linearly dependent columns and renames the remaining columns to make the 220 | dependencies clear using the `make_full_rank_matrix()` function. 221 | 222 | ``` r 223 | library(fullRankMatrix) 224 | result <- make_full_rank_matrix(mat) 225 | mat_fr <- result$matrix 226 | space_list <- result$space_list 227 | mat_fr 228 | #> pear mint apple SPACE_1_AXIS1 SPACE_1_AXIS2 229 | #> [1,] 0 1 0 -0.5 0.0000000 230 | #> [2,] 0 1 0 -0.5 0.0000000 231 | #> [3,] 0 0 0 -0.5 0.0000000 232 | #> [4,] 1 0 0 -0.5 0.0000000 233 | #> [5,] 0 0 0 0.0 -0.5773503 234 | #> [6,] 0 0 0 0.0 -0.5773503 235 | #> [7,] 0 0 1 0.0 -0.5773503 236 | #> [8,] 1 0 1 0.0 0.0000000 237 | #> [9,] 1 0 1 0.0 0.0000000 238 | #> [10,] 1 0 1 0.0 0.0000000 239 | ``` 240 | 241 | ``` r 242 | fit <- lm(sweetness ~ mat_fr + 0) 243 | print(summary(fit)) 244 | #> 245 | #> Call: 246 | #> lm(formula = sweetness ~ mat_fr + 0) 247 | #> 248 | #> Residuals: 249 | #> 1 2 3 4 5 6 7 8 250 | #> -2.00934 2.00934 -1.34248 1.34248 0.92807 -2.27054 1.34248 -0.01963 251 | #> 9 10 252 | #> 1.26385 -2.58670 253 | #> 254 | #> Coefficients: 255 | #> Estimate Std. Error t value Pr(>|t|) 256 | #> mat_frpear 1.2800 2.3056 0.555 0.60269 257 | #> mat_frmint 0.6582 2.6242 0.251 0.81193 258 | #> mat_frapple 1.2595 2.2526 0.559 0.60019 259 | #> mat_frSPACE_1_AXIS1 -17.8174 4.0535 -4.396 0.00705 ** 260 | #> mat_frSPACE_1_AXIS2 -11.3322 2.6924 -4.209 0.00842 ** 261 | #> --- 262 | #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 263 | #> 264 | #> Residual standard error: 2.357 on 5 degrees of freedom 265 | #> Multiple R-squared: 0.9504, Adjusted R-squared: 0.9007 266 | #> F-statistic: 19.15 on 5 and 5 DF, p-value: 0.002824 267 | ``` 268 | 269 | You can see that there are no more undefined columns. The columns 270 | `strawberry`, `orange` and `poppyseed` were removed and replaced with 271 | two columns (`SPACE_1_AXIS1`, `SPACE_1_AXIS2`) that are linearly 272 | independent (orthogonal) vectors that span the space in which the 273 | linearly dependent columns `strawberry`, `orange` and `poppyseed` lied. 274 | 275 | The original columns that are contained within a space can be viewed in 276 | the returned `space_list`: 277 | 278 | ``` r 279 | space_list 280 | #> $SPACE_1 281 | #> [1] "strawberry" "poppyseed" "orange" 282 | ``` 283 | 284 | In terms of interpretation the individual axes of the constructed spaces 285 | are difficult to interpret, but we see that the axes of the space of 286 | `strawberry`, `orange` and `poppyseed` show a significant association 287 | with the sweetness of fruit salads. A further resolution which of the 288 | three terms is most strongly associated with `sweetness` is not possible 289 | with the given number of observations, but there is definitely an 290 | association of `sweetness` with the space spanned by the three terms. 291 | 292 | If only a subset of all axes of a space show a significant association 293 | in the linear model fit, this could indicate that only a subset of 294 | linearly dependent columns that lie within the space spanned by the 295 | significantly associated axes drive this association. This would require 296 | some more detailed investigation by the user that would be specific to 297 | the use case. 298 | 299 | ## Other available packages that detect linear dependent columns 300 | 301 | There are already a few other packages out there that offer functions to 302 | detect linear dependent columns. Here are the ones we are aware of: 303 | 304 | ``` r 305 | library(fullRankMatrix) 306 | 307 | # let's say we have 10 fruit salads and indicate which ingredients are present in each salad 308 | strawberry <- c(1,1,1,1,0,0,0,0,0,0) 309 | poppyseed <- c(0,0,0,0,1,1,1,0,0,0) 310 | orange <- c(1,1,1,1,1,1,1,0,0,0) 311 | pear <- c(0,0,0,1,0,0,0,1,1,1) 312 | mint <- c(1,1,0,0,0,0,0,0,0,0) 313 | apple <- c(0,0,0,0,0,0,1,1,1,1) 314 | 315 | # let's pretend we know how each fruit influences the sweetness of a fruit salad 316 | # in this case we say that strawberries and oranges have the biggest influence on sweetness 317 | set.seed(30) 318 | strawberry_sweet <- strawberry * rnorm(10, 4) 319 | poppyseed_sweet <- poppyseed * rnorm(10, 0.1) 320 | orange_sweet <- orange * rnorm(10, 5) 321 | pear_sweet <- pear * rnorm(10, 0.5) 322 | mint_sweet <- mint * rnorm(10, 1) 323 | apple_sweet <- apple * rnorm(10, 2) 324 | 325 | sweetness <- strawberry_sweet + poppyseed_sweet+ orange_sweet + pear_sweet + 326 | mint_sweet + apple_sweet 327 | 328 | mat <- cbind(strawberry,poppyseed,orange,pear,mint,apple) 329 | ``` 330 | 331 | **`caret::findLinearCombos()`**: 332 | 333 | 334 | This function identifies which columns are linearly dependent and 335 | suggests which columns to remove. But it doesn’t provide appropriate 336 | naming for the remaining columns to indicate that any significant 337 | associations with the remaining columns are actually associations with 338 | the space spanned by the originally linearly dependent columns. Just 339 | removing the indicated columns and then fitting the linear model would 340 | lead to erroneous interpretation. 341 | 342 | ``` r 343 | caret_result <- caret::findLinearCombos(mat) 344 | ``` 345 | 346 | Fitting a linear model with the `orange` column removed would lead to 347 | the erroneous interpretation that `strawberry` and `poppyseed` have the 348 | biggest influence on the fruit salad `sweetness`, but we know it is 349 | actually `strawberry` and `orange`. 350 | 351 | ``` r 352 | mat_caret <- mat[, -caret_result$remove] 353 | fit <- lm(sweetness ~ mat_caret + 0) 354 | print(summary(fit)) 355 | #> 356 | #> Call: 357 | #> lm(formula = sweetness ~ mat_caret + 0) 358 | #> 359 | #> Residuals: 360 | #> 1 2 3 4 5 6 7 8 361 | #> -2.00934 2.00934 -1.34248 1.34248 0.92807 -2.27054 1.34248 -0.01963 362 | #> 9 10 363 | #> 1.26385 -2.58670 364 | #> 365 | #> Coefficients: 366 | #> Estimate Std. Error t value Pr(>|t|) 367 | #> mat_caretstrawberry 8.9087 2.0267 4.396 0.00705 ** 368 | #> mat_caretpoppyseed 6.5427 1.5544 4.209 0.00842 ** 369 | #> mat_caretpear 1.2800 2.3056 0.555 0.60269 370 | #> mat_caretmint 0.6582 2.6242 0.251 0.81193 371 | #> mat_caretapple 1.2595 2.2526 0.559 0.60019 372 | #> --- 373 | #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 374 | #> 375 | #> Residual standard error: 2.357 on 5 degrees of freedom 376 | #> Multiple R-squared: 0.9504, Adjusted R-squared: 0.9007 377 | #> F-statistic: 19.15 on 5 and 5 DF, p-value: 0.002824 378 | ``` 379 | 380 | **`WeightIt::make_full_rank()`**: 381 | 382 | 383 | This function removes some of the linearly dependent columns to create a 384 | full rank matrix, but doesn’t rename the remaining columns accordingly. 385 | For the user it isn’t clear which columns were linearly dependent and 386 | they can’t choose which column will be removed. 387 | 388 | ``` r 389 | mat_weightit <- WeightIt::make_full_rank(mat, with.intercept = FALSE) 390 | mat_weightit 391 | #> strawberry poppyseed pear mint apple 392 | #> [1,] 1 0 0 1 0 393 | #> [2,] 1 0 0 1 0 394 | #> [3,] 1 0 0 0 0 395 | #> [4,] 1 0 1 0 0 396 | #> [5,] 0 1 0 0 0 397 | #> [6,] 0 1 0 0 0 398 | #> [7,] 0 1 0 0 1 399 | #> [8,] 0 0 1 0 1 400 | #> [9,] 0 0 1 0 1 401 | #> [10,] 0 0 1 0 1 402 | ``` 403 | 404 | As above fitting a linear model with this full rank matrix would lead to 405 | erroneous interpretation that `strawberry` and `poppyseed` influence the 406 | `sweetness`, but we know it is actually `strawberry` and `orange`. 407 | 408 | ``` r 409 | fit <- lm(sweetness ~ mat_weightit + 0) 410 | print(summary(fit)) 411 | #> 412 | #> Call: 413 | #> lm(formula = sweetness ~ mat_weightit + 0) 414 | #> 415 | #> Residuals: 416 | #> 1 2 3 4 5 6 7 8 417 | #> -2.00934 2.00934 -1.34248 1.34248 0.92807 -2.27054 1.34248 -0.01963 418 | #> 9 10 419 | #> 1.26385 -2.58670 420 | #> 421 | #> Coefficients: 422 | #> Estimate Std. Error t value Pr(>|t|) 423 | #> mat_weightitstrawberry 8.9087 2.0267 4.396 0.00705 ** 424 | #> mat_weightitpoppyseed 6.5427 1.5544 4.209 0.00842 ** 425 | #> mat_weightitpear 1.2800 2.3056 0.555 0.60269 426 | #> mat_weightitmint 0.6582 2.6242 0.251 0.81193 427 | #> mat_weightitapple 1.2595 2.2526 0.559 0.60019 428 | #> --- 429 | #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 430 | #> 431 | #> Residual standard error: 2.357 on 5 degrees of freedom 432 | #> Multiple R-squared: 0.9504, Adjusted R-squared: 0.9007 433 | #> F-statistic: 19.15 on 5 and 5 DF, p-value: 0.002824 434 | ``` 435 | 436 | **`plm::detect.lindep()`:** 437 | 438 | 439 | The function returns which columns are potentially linearly dependent. 440 | 441 | ``` r 442 | plm::detect.lindep(mat) 443 | #> [1] "Suspicious column number(s): 1, 2, 3" 444 | #> [1] "Suspicious column name(s): strawberry, poppyseed, orange" 445 | ``` 446 | 447 | However it doesn’t capture all cases. For example here 448 | `plm::detect.lindep()` says there are no dependent columns, while there 449 | are several: 450 | 451 | ``` r 452 | c1 <- rbinom(10, 1, .4) 453 | c2 <- 1-c1 454 | c3 <- integer(10) 455 | c4 <- c1 456 | c5 <- 2*c2 457 | c6 <- rbinom(10, 1, .8) 458 | c7 <- c5+c6 459 | mat_test <- as.matrix(data.frame(c1,c2,c3,c4,c5,c6,c7)) 460 | 461 | plm::detect.lindep(mat_test) 462 | #> [1] "No linear dependent column(s) detected." 463 | ``` 464 | 465 | `fullRankMatrix` captures these cases: 466 | 467 | ``` r 468 | result <- make_full_rank_matrix(mat_test) 469 | result$matrix 470 | #> (c1_AND_c4) SPACE_1_AXIS1 SPACE_1_AXIS2 471 | #> [1,] 1 0.0000000 4.111431e-16 472 | #> [2,] 0 -0.4082483 -5.419613e-17 473 | #> [3,] 1 0.0000000 7.071068e-01 474 | #> [4,] 0 -0.4082483 1.083923e-17 475 | #> [5,] 1 0.0000000 7.071068e-01 476 | #> [6,] 0 -0.4082483 1.083923e-17 477 | #> [7,] 0 -0.4082483 1.083923e-17 478 | #> [8,] 0 -0.4082483 1.083923e-17 479 | #> [9,] 1 0.0000000 0.000000e+00 480 | #> [10,] 0 -0.4082483 1.083923e-17 481 | ``` 482 | 483 | **`Smisc::findDepMat()`**: 484 | 485 | 486 | **NOTE**: this package was removed from CRAN as of 2020-01-26 487 | () due to failing checks. 488 | 489 | This function indicates linearly dependent rows/columns, but it doesn’t 490 | state which rows/columns are linearly dependent with each other. 491 | 492 | However, this function seems to not work well for one-hot encoded 493 | matrices and the package doesn’t seem to be updated anymore (s. this 494 | issue: ). 495 | 496 | # example provided by Smisc documentation 497 | Y <- matrix(c(1, 3, 4, 498 | 2, 6, 8, 499 | 7, 2, 9, 500 | 4, 1, 7, 501 | 3.5, 1, 4.5), byrow = TRUE, ncol = 3) 502 | Smisc::findDepMat(t(Y), rows = FALSE) 503 | 504 | Trying with the model matrix from our example above: 505 | 506 | Smisc::findDepMat(mat, rows=FALSE) 507 | #> Error in if (!depends[j]) { : missing value where TRUE/FALSE needed 508 | -------------------------------------------------------------------------------- /cran-comments.md: -------------------------------------------------------------------------------- 1 | ## R CMD check results 2 | 3 | 0 errors | 0 warnings | 1 note 4 | 5 | * This is a new release. 6 | -------------------------------------------------------------------------------- /inst/CITATION: -------------------------------------------------------------------------------- 1 | bibentry( 2 | bibtype = "Manual", 3 | title = "fullRankMatrix: Produce a Full Rank Matrix", 4 | author = c( 5 | person("Paula", "Weidemüller", role = c("aut", "cre"), 6 | email = "p.weidemueller@posteo.de", 7 | comment = c(ORCID = "0000-0003-3867-3131")), 8 | person("Constantin", "Ahlmann-Eltze", role = c("aut"), 9 | email = "abc@gmail.com") 10 | ), 11 | year = 2024, 12 | note = "R package version 0.1.0", 13 | url = "https://github.com/Pweidemueller/fullRankMatrix" 14 | ) 15 | -------------------------------------------------------------------------------- /inst/WORDLIST: -------------------------------------------------------------------------------- 1 | doi 2 | jss 3 | -------------------------------------------------------------------------------- /man/figures/example_vectors.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pweidemueller/fullRankMatrix/ad4bc2dc58642eb7e639ee3e3d79e5dca28f4c4c/man/figures/example_vectors.png -------------------------------------------------------------------------------- /man/figures/fullRankMatrix.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pweidemueller/fullRankMatrix/ad4bc2dc58642eb7e639ee3e3d79e5dca28f4c4c/man/figures/fullRankMatrix.png -------------------------------------------------------------------------------- /man/find_connected_components.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/find_connected_components.R 3 | \name{find_connected_components} 4 | \alias{find_connected_components} 5 | \title{Find connected components in a graph} 6 | \usage{ 7 | find_connected_components(connections) 8 | } 9 | \arguments{ 10 | \item{connections}{a list where each element is a vector with connected nodes. 11 | Each node must be either a character or an integer.} 12 | } 13 | \value{ 14 | a list where each element is a set of connected items. 15 | } 16 | \description{ 17 | The function performs a depths-first search to find all connected components. 18 | } 19 | \examples{ 20 | find_connected_components(list(c(1,2), c(1,3), c(4,5))) 21 | 22 | 23 | } 24 | -------------------------------------------------------------------------------- /man/find_linear_dependent_columns.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/find_linear_dependent_columns.R 3 | \name{find_linear_dependent_columns} 4 | \alias{find_linear_dependent_columns} 5 | \title{Find linear dependent columns in a design matrix} 6 | \usage{ 7 | find_linear_dependent_columns(mat, tol = 1e-12) 8 | } 9 | \arguments{ 10 | \item{mat}{a matrix} 11 | 12 | \item{tol}{a double that specifies the numeric tolerance} 13 | } 14 | \value{ 15 | a list with vectors containing the indices of linearly dependent columns 16 | } 17 | \description{ 18 | Find linear dependent columns in a design matrix 19 | } 20 | \examples{ 21 | mat <- matrix(rnorm(3 * 10), nrow = 10, ncol = 3) 22 | mat <- cbind(mat, mat[,1] + 0.5 * mat[,3]) 23 | find_linear_dependent_columns(mat) # returns list(c(1,3,4)) 24 | 25 | } 26 | \seealso{ 27 | The algorithm and function is inspired by the \code{internalEnumLC} 28 | function in the 'caret' package (\href{https://github.com/topepo/caret/blob/679eabaac7e54f4e87efa6c3bff75659cb457d8b/pkg/caret/R/findLinearCombos.R#L33}{GitHub}) 29 | } 30 | -------------------------------------------------------------------------------- /man/make_full_rank_matrix.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/make_full_rank_matrix.R 3 | \name{make_full_rank_matrix} 4 | \alias{make_full_rank_matrix} 5 | \title{Create a full rank matrix} 6 | \usage{ 7 | make_full_rank_matrix(mat, verbose = FALSE) 8 | } 9 | \arguments{ 10 | \item{mat}{A matrix.} 11 | 12 | \item{verbose}{Print how column numbers change with each operation.} 13 | } 14 | \value{ 15 | a list containing: 16 | \itemize{ 17 | \item \code{matrix}: A matrix of full rank. Column headers will be renamed to reflect how columns depend on each other. 18 | \itemize{ 19 | \item \code{(c1_AND_c2)} If multiple columns are exactly identical, only a single instance is retained. 20 | \item \verb{SPACE__AXIS} For each set of linearly dependent columns, a space \code{i} with \code{max(j)} dimensions was created using orthogonal axes to replace the original columns. 21 | } 22 | \item \code{space_list}: A named list where each element corresponds to a space and contains the names of the original linearly dependent columns that are contained within that space. 23 | } 24 | } 25 | \description{ 26 | First remove empty columns. Then discover linear dependent columns. For each set of linearly dependent columns, create orthogonal vectors that span the space. Add these vectors as columns to the final matrix to replace the linearly dependent columns. 27 | } 28 | \examples{ 29 | # Create a 1-hot encoded (zero/one) matrix 30 | c1 <- rbinom(10, 1, .4) 31 | c2 <- 1-c1 32 | c3 <- integer(10) 33 | c4 <- c1 34 | c5 <- 2*c2 35 | c6 <- rbinom(10, 1, .8) 36 | c7 <- c5+c6 37 | # Turn into matrix 38 | mat <- cbind(c1, c2, c3, c4, c5, c6, c7) 39 | # Turn the matrix into full rank, this will: 40 | # 1. remove empty columns (all zero) 41 | # 2. merge columns with the same entries (duplicates) 42 | # 3. identify linearly dependent columns 43 | # 4. replace them with orthogonal vectors that span the same space 44 | result <- make_full_rank_matrix(mat) 45 | # verbose=TRUE will give details on how many columns are removed in every step 46 | result <- make_full_rank_matrix(mat, verbose=TRUE) 47 | # look at the create full rank matrix 48 | mat_full <- result$matrix 49 | # check which linearly dependent columns spanned the identified spaces 50 | spaces <- result$space_list 51 | } 52 | -------------------------------------------------------------------------------- /man/validate_column_names.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/validate_column_names.R 3 | \name{validate_column_names} 4 | \alias{validate_column_names} 5 | \title{Validate Column Names} 6 | \usage{ 7 | validate_column_names(names) 8 | } 9 | \arguments{ 10 | \item{names}{A character vector of column names to validate.} 11 | } 12 | \value{ 13 | Returns \code{TRUE} if all checks pass. If any check fails, the function stops and returns an error message. 14 | } 15 | \description{ 16 | This function checks a vector of column names to ensure they are valid. It performs the following checks: 17 | \itemize{ 18 | \item The column names must not be \code{NULL}. 19 | \item The column names must not contain empty strings. 20 | \item The column names must not contain \code{NA} values. 21 | \item The column names must be unique. 22 | } 23 | } 24 | \examples{ 25 | validate_column_names(c("name", "age", "gender")) 26 | 27 | } 28 | -------------------------------------------------------------------------------- /tests/spelling.R: -------------------------------------------------------------------------------- 1 | if(requireNamespace('spelling', quietly = TRUE)) 2 | spelling::spell_check_test(vignettes = TRUE, error = FALSE, 3 | skip_on_cran = TRUE) 4 | -------------------------------------------------------------------------------- /tests/testthat.R: -------------------------------------------------------------------------------- 1 | library(testthat) 2 | library(fullRankMatrix) 3 | 4 | test_check("fullRankMatrix") 5 | -------------------------------------------------------------------------------- /tests/testthat/test-find_connected_components.R: -------------------------------------------------------------------------------- 1 | test_that("find_connected_components", { 2 | 3 | connections <- list( 4 | c(1,2), c(2, 5), c(4, 3, 5), 5 | c(6, 7), 6 | c(8) 7 | ) 8 | 9 | res <- find_connected_components(connections) 10 | expect_setequal(res[[1]], c(1,2,3,4,5)) 11 | expect_setequal(res[[2]], c(6, 7)) 12 | expect_setequal(res[[3]], c(8)) 13 | expect_equal(length(res), 3) 14 | 15 | system.time( 16 | res <- find_connected_components(list(1:1000)) 17 | ) 18 | expect_setequal(res[[1]], 1:1000) 19 | }) 20 | 21 | test_that("find_connected_components finds same components as igraph", { 22 | testthat::skip_if_not_installed("igraph") 23 | # Check results against igraph 24 | edges <- as.character(as.integer(sample(1:500, size = 600, replace = TRUE))) 25 | gr <- igraph::make_undirected_graph(edges) 26 | connections <- asplit(igraph::as_edgelist(gr), 1) 27 | res <- find_connected_components(connections) 28 | ires <- igraph::components(gr) 29 | expect_equal(sort(lengths(res)), sort(ires$csize)) 30 | 31 | mem <- ires$membership 32 | ires_list <- lapply(seq_len(max(mem)), \(idx){ 33 | names(mem)[mem == idx] 34 | }) 35 | for(comp in res){ 36 | matched <- FALSE 37 | for(comp2 in ires_list){ 38 | if(length(comp) == length(comp2) && all(sort(comp) == sort(comp2))){ 39 | matched <- TRUE 40 | break 41 | } 42 | } 43 | expect_true(matched) 44 | } 45 | 46 | }) 47 | -------------------------------------------------------------------------------- /tests/testthat/test-find_linear_dependent_columns.R: -------------------------------------------------------------------------------- 1 | test_that("find_linear_dependent_columns works", { 2 | 3 | mat <- matrix(rnorm(n = 10 * 4), nrow = 10, ncol = 4) 4 | mat <- cbind(mat, mat[,1] + 2 * mat[,2], rnorm(10)) 5 | expect_equal(find_linear_dependent_columns(mat), list(c(1,2,5))) 6 | 7 | mat <- cbind(mat, 0.3 * mat[,3] + 0.4 * mat[,4], rnorm(10)) 8 | expect_equal(find_linear_dependent_columns(mat), list(c(1,2,5), c(3,4,7))) 9 | 10 | mat <- cbind(mat, 2 * mat[,1] + 4 * mat[,4]) 11 | expect_equal(find_linear_dependent_columns(mat), list(c(1,2,3,4,5,7,9))) 12 | 13 | }) 14 | 15 | test_that("find_linear_dependent_columns errors if input is not of type matrix", { 16 | set.seed(1000) 17 | c1 <- rbinom(10, 1, .4) 18 | c2 <- 1-c1 19 | c3 <- integer(10) 20 | c4 <- c1 21 | mat <- data.frame(c1, c2, c3, c4) 22 | expect_error(find_linear_dependent_columns(mat)) 23 | }) 24 | -------------------------------------------------------------------------------- /tests/testthat/test-make_full_rank_matrix.R: -------------------------------------------------------------------------------- 1 | test_that("check that a full rank matrix is returned unchanged", { 2 | 3 | df <- data.frame(col = sample(letters[1:3], 10, replace = 20)) 4 | mat <- model.matrix(~ col - 1, data = df) 5 | result <- make_full_rank_matrix(mat) 6 | mat2 <- result$matrix 7 | expect_equal(mat, mat2, ignore_attr = TRUE) 8 | expect_equal(length(result$space_list), 0) 9 | 10 | }) 11 | 12 | 13 | test_that("removing empty columns as expected", { 14 | set.seed(1000) 15 | c1 <- rbinom(10, 1, .4) 16 | c2 <- 1-c1 17 | c3 <- integer(10) 18 | c4 <- integer(10) 19 | mat <- as.matrix(data.frame(c1, c2, c3, c4)) 20 | mat_empty <- as.matrix(data.frame(c1, c2)) 21 | mat2 <- remove_empty_columns(mat) 22 | expect_equal(mat2, mat_empty) 23 | 24 | mat <- as.matrix(data.frame(c1, c2)) 25 | mat2 <- remove_empty_columns(mat) 26 | expect_equal(mat2, mat) 27 | 28 | 29 | }) 30 | 31 | test_that("returning names of empty columns", { 32 | set.seed(1000) 33 | c1 <- rbinom(10, 1, .4) 34 | c2 <- 1-c1 35 | c3 <- integer(10) 36 | c4 <- integer(10) 37 | mat <- as.matrix(data.frame(c1, c2, c3, c4)) 38 | empty_cols <- find_empty_columns(mat, return_names=TRUE) 39 | expect_equal(empty_cols, c("c3", "c4")) 40 | 41 | mat <- as.matrix(data.frame(c1, c2)) 42 | mat2 <- remove_empty_columns(mat) 43 | expect_equal(mat2, mat) 44 | }) 45 | 46 | test_that("merging duplicated columns as expected", { 47 | set.seed(1000) 48 | c1 <- rbinom(10, 1, .4) 49 | c2 <- 1-c1 50 | c3 <- integer(10) 51 | c4 <- c1 52 | mat <- as.matrix(data.frame(c1, c2, c3, c4)) 53 | mat_dedupl <- as.matrix(data.frame(c1, c2, c3)) 54 | colnames(mat_dedupl) <- c('(c1_AND_c4)', 'c2', 'c3') 55 | mat2 <- merge_duplicated(mat) 56 | expect_equal(mat2, mat_dedupl) 57 | 58 | mat <- as.matrix(data.frame(c1, c2)) 59 | mat2 <- merge_duplicated(mat) 60 | expect_equal(mat2, mat) 61 | }) 62 | 63 | test_that("make full rank column as expected", { 64 | intercept <- rep(1,10) 65 | c1 <- rbinom(10, 1, .4) 66 | c2 <- 1-c1 67 | c3 <- rnorm(10) 68 | c4 <- 10*c3 69 | c5 <- c1 70 | c6 <- integer(10) 71 | 72 | mat <- as.matrix(data.frame(intercept, c1, c2, c3, c4, c5, c6)) 73 | result <- make_full_rank_matrix(mat) 74 | red_mat <- result$matrix 75 | space_list <- result$space_list 76 | expect_equal(ncol(red_mat), 3) 77 | expect_equal(qr(red_mat)$rank, 3) 78 | expect_equal(colnames(red_mat), c("SPACE_1_AXIS1", "SPACE_1_AXIS2", "SPACE_2_AXIS1")) 79 | expect_equal(length(space_list), 2) 80 | expect_equal(space_list$SPACE_1, c("intercept", "(c1_AND_c5)", "c2")) 81 | expect_equal(space_list$SPACE_2, c("c3", "c4")) 82 | 83 | mat <- as.matrix(data.frame(c1, c2, c3, c4, c5, c6)) 84 | result <- make_full_rank_matrix(mat) 85 | red_mat <- result$matrix 86 | space_list <- result$space_list 87 | expect_equal(ncol(red_mat), 3) 88 | expect_equal(qr(red_mat)$rank, 3) 89 | expect_equal(colnames(red_mat), c("(c1_AND_c5)", "c2", "SPACE_1_AXIS1")) 90 | }) 91 | 92 | test_that("merge_duplicated errors if input is not of type matrix", { 93 | set.seed(1000) 94 | c1 <- rbinom(10, 1, .4) 95 | c2 <- 1-c1 96 | c3 <- integer(10) 97 | c4 <- c1 98 | mat <- data.frame(c1, c2, c3, c4) 99 | expect_error(merge_duplicated(mat)) 100 | }) 101 | 102 | test_that("collapse_linearly_dependent_columns works", { 103 | 104 | mat <- matrix(rnorm(n = 10 * 4), nrow = 10, ncol = 4, dimnames = list(character(0L), paste0("col_", 1:4))) 105 | mat <- cbind(mat, comb_12 = mat[,1] + 2 * mat[,2], col_6 = rnorm(10)) 106 | result <- collapse_linearly_dependent_columns(mat) 107 | red_mat <- result$matrix 108 | space_list <- result$space_list 109 | expect_equal(ncol(red_mat), 5) 110 | expect_equal(qr(red_mat)$rank, 5) 111 | expect_equal(colnames(red_mat), c("col_3", "col_4", "col_6", "SPACE_1_AXIS1", "SPACE_1_AXIS2")) 112 | expect_equal(length(space_list), 1) 113 | expect_equal(space_list[[1]], c("col_1", "col_2", "comb_12")) 114 | 115 | mat <- cbind(mat, comb_34 = 0.3 * mat[,3] + 0.4 * mat[,4], col_8 = rnorm(10)) 116 | result <- collapse_linearly_dependent_columns(mat) 117 | red_mat <- result$matrix 118 | space_list <- result$space_list 119 | expect_equal(ncol(red_mat), 6) 120 | expect_equal(qr(red_mat)$rank, 6) 121 | expect_equal(colnames(red_mat), c("col_6", "col_8", "SPACE_1_AXIS1", "SPACE_1_AXIS2", "SPACE_2_AXIS1", "SPACE_2_AXIS2")) 122 | 123 | c1 <- rbinom(10, 1, .4) 124 | c2 <- c1*2 125 | c3 <- c1*3 126 | c4 <- c1*0.5 127 | mat <- cbind(c1,c2,c3,c4) 128 | result <- make_full_rank_matrix(mat) 129 | red_mat <- result$matrix 130 | space_list <- result$space_list 131 | expect_equal(colnames(red_mat), c("SPACE_1_AXIS1")) 132 | expect_equal(space_list$SPACE_1, c("c1","c2","c3","c4")) 133 | 134 | }) 135 | -------------------------------------------------------------------------------- /vignettes/.gitignore: -------------------------------------------------------------------------------- 1 | *.html 2 | *.R 3 | -------------------------------------------------------------------------------- /vignettes/fullrankmat-comparison.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "fullRankMatrix - Comparison to other packages" 3 | output: rmarkdown::html_vignette 4 | vignette: > 5 | %\VignetteIndexEntry{fullrankmat-comparison} 6 | %\VignetteEngine{knitr::rmarkdown} 7 | %\VignetteEncoding{UTF-8} 8 | --- 9 | 10 | ```{r, include = FALSE} 11 | knitr::opts_chunk$set( 12 | collapse = TRUE, 13 | comment = "#>" 14 | ) 15 | ``` 16 | 17 | 18 | ## Other available packages that detect linear dependent columns 19 | There are already a few other packages out there that offer functions to detect linear dependent columns. Here are the ones we are aware of: 20 | 21 | ```{r} 22 | library(fullRankMatrix) 23 | 24 | # let's say we have 10 fruit salads and indicate which ingredients are present in each salad 25 | strawberry <- c(1,1,1,1,0,0,0,0,0,0) 26 | poppyseed <- c(0,0,0,0,1,1,1,0,0,0) 27 | orange <- c(1,1,1,1,1,1,1,0,0,0) 28 | pear <- c(0,0,0,1,0,0,0,1,1,1) 29 | mint <- c(1,1,0,0,0,0,0,0,0,0) 30 | apple <- c(0,0,0,0,0,0,1,1,1,1) 31 | 32 | # let's pretend we know how each fruit influences the sweetness of a fruit salad 33 | # in this case we say that strawberries and oranges have the biggest influence on sweetness 34 | set.seed(30) 35 | strawberry_sweet <- strawberry * rnorm(10, 4) 36 | poppyseed_sweet <- poppyseed * rnorm(10, 0.1) 37 | orange_sweet <- orange * rnorm(10, 5) 38 | pear_sweet <- pear * rnorm(10, 0.5) 39 | mint_sweet <- mint * rnorm(10, 1) 40 | apple_sweet <- apple * rnorm(10, 2) 41 | 42 | sweetness <- strawberry_sweet + poppyseed_sweet+ orange_sweet + pear_sweet + 43 | mint_sweet + apple_sweet 44 | 45 | mat <- cbind(strawberry,poppyseed,orange,pear,mint,apple) 46 | 47 | ``` 48 | 49 | 50 | 51 | **`caret::findLinearCombos()`**: https://rdrr.io/cran/caret/man/findLinearCombos.html 52 | 53 | This function identifies which columns are linearly dependent and suggests which columns to remove. But it doesn't provide appropriate naming for the remaining columns to indicate that any significant associations with the remaining columns are actually associations with the space spanned by the originally linearly dependent columns. Just removing the indicated columns and then fitting the linear model would lead to erroneous interpretation. 54 | ```{r} 55 | caret_result <- caret::findLinearCombos(mat) 56 | ``` 57 | Fitting a linear model with the `orange` column removed would lead to the erroneous interpretation that `strawberry` and `poppyseed` have the biggest influence on the fruit salad `sweetness`, but we know it is actually `strawberry` and `orange`. 58 | ```{r} 59 | mat_caret <- mat[, -caret_result$remove] 60 | fit <- lm(sweetness ~ mat_caret + 0) 61 | print(summary(fit)) 62 | ``` 63 | 64 | 65 | **`WeightIt::make_full_rank()`**: https://rdrr.io/cran/WeightIt/man/make_full_rank.html 66 | 67 | This function removes some of the linearly dependent columns to create a full rank matrix, but doesn't rename the remaining columns accordingly. For the user it isn't clear which columns were linearly dependent and they can't choose which column will be removed. 68 | ```{r} 69 | mat_weightit <- WeightIt::make_full_rank(mat, with.intercept = FALSE) 70 | mat_weightit 71 | ``` 72 | As above fitting a linear model with this full rank matrix would lead to erroneous interpretation that `strawberry` and `poppyseed` influence the `sweetness`, but we know it is actually `strawberry` and `orange`. 73 | 74 | ```{r} 75 | fit <- lm(sweetness ~ mat_weightit + 0) 76 | print(summary(fit)) 77 | ``` 78 | 79 | 80 | **`plm::detect.lindep()`:** https://rdrr.io/cran/plm/man/detect.lindep.html 81 | 82 | The function returns which columns are potentially linearly dependent. 83 | ```{r} 84 | plm::detect.lindep(mat) 85 | ``` 86 | 87 | However it doesn't capture all cases. For example here `plm::detect.lindep()` says there are no dependent columns, while there are several: 88 | ```{r} 89 | c1 <- rbinom(10, 1, .4) 90 | c2 <- 1-c1 91 | c3 <- integer(10) 92 | c4 <- c1 93 | c5 <- 2*c2 94 | c6 <- rbinom(10, 1, .8) 95 | c7 <- c5+c6 96 | mat_test <- as.matrix(data.frame(c1,c2,c3,c4,c5,c6,c7)) 97 | 98 | plm::detect.lindep(mat_test) 99 | ``` 100 | 101 | `fullRankMatrix` captures these cases: 102 | ```{r} 103 | result <- make_full_rank_matrix(mat_test) 104 | result$matrix 105 | ``` 106 | **`Smisc::findDepMat()`**: https://rdrr.io/cran/Smisc/man/findDepMat.html 107 | 108 | **NOTE**: this package was removed from CRAN as of 2020-01-26 (https://CRAN.R-project.org/package=Smisc) due to failing checks. 109 | 110 | This function indicates linearly dependent rows/columns, but it doesn't state which rows/columns are linearly dependent with each other. 111 | 112 | However, this function seems to not work well for one-hot encoded matrices and the package doesn't seem to be updated anymore (s. this issue: https://github.com/pnnl/Smisc/issues/24). 113 | ``` 114 | # example provided by Smisc documentation 115 | Y <- matrix(c(1, 3, 4, 116 | 2, 6, 8, 117 | 7, 2, 9, 118 | 4, 1, 7, 119 | 3.5, 1, 4.5), byrow = TRUE, ncol = 3) 120 | Smisc::findDepMat(t(Y), rows = FALSE) 121 | ``` 122 | 123 | Trying with the model matrix from our example above: 124 | ``` 125 | Smisc::findDepMat(mat, rows=FALSE) 126 | #> Error in if (!depends[j]) { : missing value where TRUE/FALSE needed 127 | ``` 128 | -------------------------------------------------------------------------------- /vignettes/fullrankmat-example.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "fullRankMatrix - Generation of Full Rank Design Matrix" 3 | output: rmarkdown::html_vignette 4 | vignette: > 5 | %\VignetteIndexEntry{fullrankmat-example} 6 | %\VignetteEngine{knitr::rmarkdown} 7 | %\VignetteEncoding{UTF-8} 8 | --- 9 | 10 | ```{r, include = FALSE} 11 | knitr::opts_chunk$set( 12 | collapse = TRUE, 13 | comment = "#>" 14 | ) 15 | ``` 16 | 17 | 18 | 19 | 20 | ```{r, echo=FALSE, out.width="20%", fig.align = 'left'} 21 | knitr::include_graphics("man/figures/fullRankMatrix.png") 22 | ``` 23 | 24 | We developed `fullRankMatrix` primarily for one-hot encoded design matrices used in linear models. In our case, we were faced with a 1-hot encoded design matrix, that had a lot of linearly dependent columns. This happened when modeling a lot of interaction terms. Since fitting a linear model on a design matrix with linearly dependent columns will produce results that can lead to misleading interpretation (s. example below), we decided to develop a package that will help with identifying linearly dependent columns and replacing them with columns constructed of orthogonal vectors that span the space of the previously linearly dependent columns. 25 | 26 | The goal of `fullRankMatrix` is to remove empty columns (contain only 0s), merge duplicated columns (containing the same entries) and merge linearly dependent columns. These operations will create a matrix of full rank. The changes made to the columns are reflected in the column headers such that the columns can still be interpreted if the matrix is used in e.g. a linear model fit. 27 | 28 | ## Installation 29 | 30 | You can install `fullRankMatrix` directly from CRAN. Just paste the following snippet into your R console: 31 | 32 | ```r 33 | install.packages("fullRankMatrix") 34 | ``` 35 | 36 | 37 | You can install the development version of `fullRankMatrix` from [GitHub](https://github.com/Pweidemueller/fullRankMatrix) with: 38 | 39 | ```r 40 | # install.packages("devtools") 41 | devtools::install_github("Pweidemueller/fullRankMatrix") 42 | ``` 43 | 44 | ## Citation 45 | 46 | If you want to cite this package in a publication, you can run the following 47 | command in your R console: 48 | 49 | ```{r citation} 50 | citation("fullRankMatrix") 51 | ``` 52 | 53 | ## Linearly dependent columns span a space of a certain dimension 54 | In order to visualize it, let's look at a very simple example. Say we have a matrix with three columns, each with three entries. These columns can be visualized as vectors in a coordinate system with 3 axes (s. image). The first vector points into the plane spanned by the first and third axis. The second and third vectors lie in the plane spanned by the first and second axis. Since this is a very simple example, we immediately spot that the third column is a multiple of the second column. Their corresponding vectors lie perfectly on top of each other. This means instead of the two columns spanning a 2D space they just occupy a line, i.e. a 1D space. This is identified by `fullRankMatrix`, which replaces these two linearly dependent columns with one vector that describes the 1D space in which column 2 and column 3 used to lie. The resulting matrix is now full rank with no linearly dependent columns. 55 | 56 | 57 | ```{r setup} 58 | library(fullRankMatrix) 59 | ``` 60 | 61 | ```{r} 62 | c1 <- c(1, 0, 1) 63 | c2 <- c(1, 2, 0) 64 | c3 <- c(2, 4, 0) 65 | 66 | mat <- cbind(c1, c2, c3) 67 | 68 | make_full_rank_matrix(mat) 69 | ``` 70 | ```{r, out.width="100%", fig.cap="Visualisation of identifying and replacing linearly dependent columns."} 71 | knitr::include_graphics("man/figures/example_vectors.png") 72 | ``` 73 | 74 | ## Worked through example 75 | 76 | Above was a rather abstract example that was easy to visualize, let's now walk through the utilities of `fullRankMatrix` when applied to a more realistic design matrix. 77 | 78 | When using linear models you should check if any of the columns in your design matrix are linearly dependent. If there are, this will alter the interpretation of the fit. Here is a rather constructed example where we are interested in identifying which ingredients contribute mostly to the sweetness of fruit salads. 79 | ```{r} 80 | # let's say we have 10 fruit salads and indicate which ingredients are present in each salad 81 | strawberry <- c(1,1,1,1,0,0,0,0,0,0) 82 | poppyseed <- c(0,0,0,0,1,1,1,0,0,0) 83 | orange <- c(1,1,1,1,1,1,1,0,0,0) 84 | pear <- c(0,0,0,1,0,0,0,1,1,1) 85 | mint <- c(1,1,0,0,0,0,0,0,0,0) 86 | apple <- c(0,0,0,0,0,0,1,1,1,1) 87 | 88 | # let's pretend we know how each fruit influences the sweetness of a fruit salad 89 | # in this case we say that strawberries and oranges have the biggest influence on sweetness 90 | set.seed(30) 91 | strawberry_sweet <- strawberry * rnorm(10, 4) 92 | poppyseed_sweet <- poppyseed * rnorm(10, 0.1) 93 | orange_sweet <- orange * rnorm(10, 5) 94 | pear_sweet <- pear * rnorm(10, 0.5) 95 | mint_sweet <- mint * rnorm(10, 1) 96 | apple_sweet <- apple * rnorm(10, 2) 97 | 98 | sweetness <- strawberry_sweet + poppyseed_sweet+ orange_sweet + pear_sweet + 99 | mint_sweet + apple_sweet 100 | 101 | mat <- cbind(strawberry,poppyseed,orange,pear,mint,apple) 102 | 103 | fit <- lm(sweetness ~ mat + 0) 104 | print(summary(fit)) 105 | ``` 106 | As you can see `lm` realizes that "1 [column] not defined because of singularities" (`matorange` is not defined) but it doesn't indicate what columns it is linearly dependent with. 107 | 108 | So if you would just look at the columns and not consider the `NA` further, you would interpret that `strawberry` and `poppyseed` are the biggest contributors to the sweetness of fruit salads. 109 | 110 | However, when you look at the model matrix you can see that the `orange` column is a linear combination of the `strawberry` and `poppyseed` columns (or vice versa). So truly any of the three factors could contribute to the sweetness of a fruit salad, the linear model has no way of recovering which one given these 10 examples. And since we constructed this example we know that `orange` and `strawberry` are the sweetest and `poppyseed` contributes least to the sweetness. 111 | ```{r} 112 | mat 113 | ``` 114 | To make such cases more obvious and to be able to still correctly interpret the linear model fit, we wrote `fullRankMatrix`. It removes linearly dependent columns and renames the remaining columns to make the dependencies clear using the `make_full_rank_matrix()` function. 115 | 116 | ```{r} 117 | library(fullRankMatrix) 118 | result <- make_full_rank_matrix(mat) 119 | mat_fr <- result$matrix 120 | space_list <- result$space_list 121 | mat_fr 122 | ``` 123 | 124 | ```{r} 125 | fit <- lm(sweetness ~ mat_fr + 0) 126 | print(summary(fit)) 127 | ``` 128 | You can see that there are no more undefined columns. The columns `strawberry`, `orange` and `poppyseed` were removed and replaced with two columns (`SPACE_1_AXIS1`, `SPACE_1_AXIS2`) that are linearly independent (orthogonal) vectors that span the space in which the linearly dependent columns `strawberry`, `orange` and `poppyseed` lied. 129 | 130 | The original columns that are contained within a space can be viewed in the returned `space_list`: 131 | ```{r} 132 | space_list 133 | ``` 134 | 135 | In terms of interpretation the individual axes of the constructed spaces are difficult to interpret, but we see that the axes of the space of `strawberry`, `orange` and `poppyseed` show a significant association with the sweetness of fruit salads. A further resolution which of the three terms is most strongly associated with `sweetness` is not possible with the given number of observations, but there is definitely an association of `sweetness` with the space spanned by the three terms. 136 | 137 | If only a subset of all axes of a space show a significant association in the linear model fit, this could indicate that only a subset of linearly dependent columns that lie within the space spanned by the significantly associated axes drive this association. This would require some more detailed investigation by the user that would be specific to the use case. 138 | -------------------------------------------------------------------------------- /vignettes/man/figures/example_vectors.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pweidemueller/fullRankMatrix/ad4bc2dc58642eb7e639ee3e3d79e5dca28f4c4c/vignettes/man/figures/example_vectors.png -------------------------------------------------------------------------------- /vignettes/man/figures/fullRankMatrix.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pweidemueller/fullRankMatrix/ad4bc2dc58642eb7e639ee3e3d79e5dca28f4c4c/vignettes/man/figures/fullRankMatrix.png --------------------------------------------------------------------------------