├── codecov.yml ├── data └── security_logs.rda ├── tests ├── testthat.R └── testthat │ ├── test_mc_adjust.R │ ├── test_inspect_block.R │ ├── test_get_all_factors.R │ ├── test_kaisers_index.R │ ├── test_horns_curve.R │ ├── test_bd_row.R │ ├── test_mahalanobis_distance.R │ ├── test_hmat.R │ ├── test_principal_components.R │ ├── test_factor_analysis.R │ ├── test_tabulate_state_vector.R │ ├── test_factor_analysis_results.R │ └── test_principal_components_result.R ├── .gitignore ├── docs ├── reference │ ├── horns_curve-1.png │ ├── figures │ │ └── anomalyDetection-logo.png │ ├── anomalyDetection.html │ ├── security_logs.html │ ├── get_all_factors.html │ ├── horns_curve.html │ ├── index.html │ ├── mc_adjust.html │ ├── bd_row.html │ └── kaisers_index.html ├── pkgdown.yml ├── articles │ ├── Introduction_files │ │ └── figure-html │ │ │ ├── unnamed-chunk-12-1.png │ │ │ ├── unnamed-chunk-13-1.png │ │ │ └── unnamed-chunk-6-1.png │ └── index.html ├── link.svg ├── docsearch.js ├── jquery.sticky-kit.min.js ├── pkgdown.js ├── pkgdown.css ├── authors.html ├── news │ └── index.html └── index.html ├── tools └── anomalyDetection-logo.png ├── man ├── figures │ └── anomalyDetection-logo.png ├── anomalyDetection.Rd ├── get_all_factors.Rd ├── pipe.Rd ├── inspect_block.Rd ├── security_logs.Rd ├── horns_curve.Rd ├── bd_row.Rd ├── kaisers_index.Rd ├── factor_analysis.Rd ├── factor_analysis_results.Rd ├── mc_adjust.Rd ├── tabulate_state_vector.Rd ├── principal_components_result.Rd ├── mahalanobis_distance.Rd ├── principal_components.Rd └── hmat.Rd ├── src ├── Makevars ├── Makevars.win ├── bottlenecks.cpp └── RcppExports.cpp ├── .Rbuildignore ├── R ├── anomalyDetection.R ├── RcppExports.R ├── pipe.R ├── data.R ├── get_all_factors.R ├── block_inspect.R ├── kaisers_index.R ├── horns_curve.R ├── bd_row.R ├── mahalanobis_distance.R ├── mc_adjust.R ├── factor_analysis.R ├── pca.R ├── hmat.R └── tabulate_state_vector.R ├── .travis.yml ├── anomalyDetection.Rproj ├── NAMESPACE ├── cran-comments.md ├── appveyor.yml ├── inst └── CITATION ├── NEWS.md ├── DESCRIPTION ├── README.md └── README.Rmd /codecov.yml: -------------------------------------------------------------------------------- 1 | comment: false 2 | -------------------------------------------------------------------------------- /data/security_logs.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/koalaverse/anomalyDetection/HEAD/data/security_logs.rda -------------------------------------------------------------------------------- /tests/testthat.R: -------------------------------------------------------------------------------- 1 | library(testthat) 2 | library(anomalyDetection) 3 | 4 | test_check("anomalyDetection") 5 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | inst/doc 6 | src/*.o 7 | src/*.so 8 | src/*.dll 9 | -------------------------------------------------------------------------------- /docs/reference/horns_curve-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/koalaverse/anomalyDetection/HEAD/docs/reference/horns_curve-1.png -------------------------------------------------------------------------------- /tools/anomalyDetection-logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/koalaverse/anomalyDetection/HEAD/tools/anomalyDetection-logo.png -------------------------------------------------------------------------------- /docs/pkgdown.yml: -------------------------------------------------------------------------------- 1 | pandoc: 2.3.1 2 | pkgdown: 1.1.0 3 | pkgdown_sha: ~ 4 | articles: 5 | Introduction: Introduction.html 6 | 7 | -------------------------------------------------------------------------------- /man/figures/anomalyDetection-logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/koalaverse/anomalyDetection/HEAD/man/figures/anomalyDetection-logo.png -------------------------------------------------------------------------------- /docs/reference/figures/anomalyDetection-logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/koalaverse/anomalyDetection/HEAD/docs/reference/figures/anomalyDetection-logo.png -------------------------------------------------------------------------------- /src/Makevars: -------------------------------------------------------------------------------- 1 | 2 | ## optional 3 | #CXX_STD = CXX11 4 | 5 | PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) 6 | PKG_LIBS = $(SHLIB_OPENMP_CFLAGS) $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) 7 | -------------------------------------------------------------------------------- /src/Makevars.win: -------------------------------------------------------------------------------- 1 | 2 | ## optional 3 | #CXX_STD = CXX11 4 | 5 | PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) 6 | PKG_LIBS = $(SHLIB_OPENMP_CFLAGS) $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) 7 | -------------------------------------------------------------------------------- /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^.*\.Rproj$ 2 | ^\.Rproj\.user$ 3 | ^README\.Rmd$ 4 | ^README-.*\.png$ 5 | ^\.travis\.yml$ 6 | ^appveyor\.yml$ 7 | ^codecov\.yml$ 8 | ^cran-comments\.md$ 9 | ^docs$ 10 | -------------------------------------------------------------------------------- /docs/articles/Introduction_files/figure-html/unnamed-chunk-12-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/koalaverse/anomalyDetection/HEAD/docs/articles/Introduction_files/figure-html/unnamed-chunk-12-1.png -------------------------------------------------------------------------------- /docs/articles/Introduction_files/figure-html/unnamed-chunk-13-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/koalaverse/anomalyDetection/HEAD/docs/articles/Introduction_files/figure-html/unnamed-chunk-13-1.png -------------------------------------------------------------------------------- /docs/articles/Introduction_files/figure-html/unnamed-chunk-6-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/koalaverse/anomalyDetection/HEAD/docs/articles/Introduction_files/figure-html/unnamed-chunk-6-1.png -------------------------------------------------------------------------------- /R/anomalyDetection.R: -------------------------------------------------------------------------------- 1 | #' anomalyDetection: An R package for implementing augmented network log anomoly 2 | #' detection procedures. 3 | #' 4 | #' @importFrom Rcpp evalCpp 5 | #' @useDynLib anomalyDetection, .registration = TRUE 6 | #' @docType package 7 | #' @name anomalyDetection 8 | NULL 9 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | # R for travis: see documentation at https://docs.travis-ci.com/user/languages/r 2 | 3 | language: R 4 | sudo: required 5 | cache: packages 6 | before_install: 7 | - sudo apt-get install libgmp3-dev 8 | 9 | r_packages: 10 | - covr 11 | 12 | after_success: 13 | - Rscript -e 'library(covr); codecov()' 14 | -------------------------------------------------------------------------------- /tests/testthat/test_mc_adjust.R: -------------------------------------------------------------------------------- 1 | context("mc_adjust") 2 | 3 | set.seed(123) 4 | x <- matrix(runif(100), ncol = 10) 5 | 6 | test_that("mc_adjust provides proper messages and warnings", { 7 | 8 | expect_that(x %>% mc_adjust(min_var = .5), throws_error()) 9 | expect_false(x %>% mc_adjust(max_cor = .8) %>% colnames() %>% is.null()) 10 | 11 | }) 12 | -------------------------------------------------------------------------------- /tests/testthat/test_inspect_block.R: -------------------------------------------------------------------------------- 1 | 2 | 3 | test_that("inspect_block provides proper messages and warnings", { 4 | 5 | expect_that(inspect_block(data = letters, 30), throws_error()) 6 | expect_that(inspect_block(security_logs, "30"), throws_error()) 7 | 8 | }) 9 | 10 | test_that("inspect_block provides proper output", { 11 | 12 | expect_true(is.list(inspect_block(security_logs, 30))) 13 | expect_equal(inspect_block(security_logs, 30) %>% length(), 10) 14 | 15 | }) 16 | -------------------------------------------------------------------------------- /man/anomalyDetection.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/anomalyDetection.R 3 | \docType{package} 4 | \name{anomalyDetection} 5 | \alias{anomalyDetection} 6 | \alias{anomalyDetection-package} 7 | \title{anomalyDetection: An R package for implementing augmented network log anomoly 8 | detection procedures.} 9 | \description{ 10 | anomalyDetection: An R package for implementing augmented network log anomoly 11 | detection procedures. 12 | } 13 | -------------------------------------------------------------------------------- /tests/testthat/test_get_all_factors.R: -------------------------------------------------------------------------------- 1 | context("get_all_factors") 2 | 3 | test_that("get_all_factors provides proper messages and warnings", { 4 | 5 | expect_that(get_all_factors(letters), throws_error()) 6 | 7 | }) 8 | 9 | test_that("get_all_factors output is a list", { 10 | 11 | expect_true(get_all_factors(1:10) %>% is.list) 12 | 13 | }) 14 | 15 | test_that("get_all_factors computes appropriately", { 16 | 17 | expect_equal(get_all_factors(27) %>% .[[1]], c(1, 3, 9, 27)) 18 | 19 | }) 20 | -------------------------------------------------------------------------------- /anomalyDetection.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | 15 | AutoAppendNewline: Yes 16 | StripTrailingWhitespace: Yes 17 | 18 | BuildType: Package 19 | PackageUseDevtools: Yes 20 | PackageInstallArgs: --no-multiarch --with-keep.source 21 | PackageRoxygenize: rd,collate,namespace 22 | -------------------------------------------------------------------------------- /R/RcppExports.R: -------------------------------------------------------------------------------- 1 | # Generated by using Rcpp::compileAttributes() -> do not edit by hand 2 | # Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 3 | 4 | compute_md <- function(x) { 5 | .Call(`_anomalyDetection_compute_md`, x) 6 | } 7 | 8 | compute_bd <- function(x, normalize) { 9 | .Call(`_anomalyDetection_compute_bd`, x, normalize) 10 | } 11 | 12 | compute_md_and_bd <- function(x, normalize) { 13 | .Call(`_anomalyDetection_compute_md_and_bd`, x, normalize) 14 | } 15 | 16 | compute_hc <- function(n, p, nsim) { 17 | .Call(`_anomalyDetection_compute_hc`, n, p, nsim) 18 | } 19 | 20 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | S3method(mahalanobis_distance,data.frame) 4 | S3method(mahalanobis_distance,matrix) 5 | export("%>%") 6 | export(bd_row) 7 | export(factor_analysis) 8 | export(factor_analysis_results) 9 | export(get_all_factors) 10 | export(hmat) 11 | export(horns_curve) 12 | export(inspect_block) 13 | export(kaisers_index) 14 | export(mahalanobis_distance) 15 | export(mc_adjust) 16 | export(principal_components) 17 | export(principal_components_result) 18 | export(tabulate_state_vector) 19 | importFrom(Rcpp,evalCpp) 20 | importFrom(magrittr,"%>%") 21 | useDynLib(anomalyDetection, .registration = TRUE) 22 | -------------------------------------------------------------------------------- /cran-comments.md: -------------------------------------------------------------------------------- 1 | ## Changes 2 | 3 | This is a resubmission. In this version I have: 4 | 5 | - Tested and fixed any issues with new tibble version dependency 6 | - Updated vignette to provide consistent tibble output 7 | 8 | ## Test environments 9 | 10 | * local Windows install, R 3.5.1 11 | * ubuntu 12.04 (on travis-ci), R 3.5.1 12 | * Windows (on apveyor) 13 | * win-builder (devel and release) 14 | 15 | ## R CMD check results 16 | 17 | 0 errors | 0 warnings | 1 notes 18 | 19 | R Under development (unstable) (2018-12-17 r75857) 20 | 21 | 22 | ## Reverse dependencies 23 | 24 | There are currently no downstream dependencies for this package. 25 | 26 | -------------------------------------------------------------------------------- /man/get_all_factors.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/get_all_factors.R 3 | \name{get_all_factors} 4 | \alias{get_all_factors} 5 | \title{Find All Factors} 6 | \source{ 7 | \url{http://stackoverflow.com/a/6425597/3851274} 8 | } 9 | \usage{ 10 | get_all_factors(n) 11 | } 12 | \arguments{ 13 | \item{n}{number to be factored} 14 | } 15 | \value{ 16 | A list containing the integer vector(s) containing all factors for the given 17 | \code{n} inputs. 18 | } 19 | \description{ 20 | \code{get_all_factors} finds all factor pairs for a given integer (i.e. a number 21 | that divides evenly into another number). 22 | } 23 | \examples{ 24 | 25 | # Find all the factors of 39304 26 | get_all_factors(39304) 27 | 28 | } 29 | -------------------------------------------------------------------------------- /R/pipe.R: -------------------------------------------------------------------------------- 1 | #' Pipe functions 2 | #' 3 | #' Like dplyr, anomalyDetection also uses the pipe function, \code{\%>\%} to turn 4 | #' function composition into a series of imperative statements. 5 | #' 6 | #' @importFrom magrittr %>% 7 | #' @name %>% 8 | #' @rdname pipe 9 | #' @export 10 | #' @param lhs,rhs An R object and a function to apply to it 11 | #' @examples 12 | #' 13 | #' x <- matrix(rnorm(200*3), ncol = 10) 14 | #' N <- nrow(x) 15 | #' p <- ncol(x) 16 | #' 17 | #' # Instead of 18 | #' hc <- horns_curve(x) 19 | #' fa <- factor_analysis(x, hc_points = hc) 20 | #' factor_analysis_results(fa, fa_scores_rotated) 21 | #' 22 | #' # You can write 23 | #' horns_curve(x) %>% 24 | #' factor_analysis(x, hc_points = .) %>% 25 | #' factor_analysis_results(fa_scores_rotated) 26 | NULL 27 | -------------------------------------------------------------------------------- /R/data.R: -------------------------------------------------------------------------------- 1 | #' Security Log Data 2 | #' 3 | #' A mock dataset containing common information that appears in security logs. 4 | #' 5 | #' @format A data frame with 300 rows and 10 variables: 6 | #' \describe{ 7 | #' \item{Device_Vendor}{Company who made the device} 8 | #' \item{Device_Product}{Name of the security device} 9 | #' \item{Device_Action}{Outcome result of access} 10 | #' \item{Src_IP}{IP address of the source} 11 | #' \item{Dst_IP}{IP address of the destination} 12 | #' \item{Src_Port}{Port identifier of the source} 13 | #' \item{Dst_Port}{Port identifier of the destination} 14 | #' \item{Protocol}{Transport protocol used} 15 | #' \item{Country_Src}{Country of the source} 16 | #' \item{Bytes_TRF}{Number of bytes transferred} 17 | #' } 18 | "security_logs" 19 | -------------------------------------------------------------------------------- /tests/testthat/test_kaisers_index.R: -------------------------------------------------------------------------------- 1 | context("kaisers_index") 2 | 3 | set.seed(123) 4 | x <- matrix(rnorm(200*3), ncol = 10) 5 | 6 | x %>% 7 | horns_curve() %>% 8 | factor_analysis(x, hc_points = .) %>% 9 | factor_analysis_results(fa_loadings_rotated) 10 | 11 | 12 | test_that("kaisers_index provides proper messages and warnings", { 13 | 14 | expect_that(kaisers_index(letters), throws_error()) 15 | 16 | }) 17 | 18 | test_that("kaisers_index output is correct object and dimensions", { 19 | 20 | expect_true(kaisers_index(x) %>% is.atomic) 21 | expect_true(kaisers_index(x) %>% is.numeric) 22 | expect_equal(kaisers_index(x) %>% length(), 1) 23 | 24 | }) 25 | 26 | test_that("kaisers_index computes appropriately", { 27 | 28 | expect_equal(kaisers_index(x) %>% round(3), 0.397) 29 | 30 | }) 31 | -------------------------------------------------------------------------------- /tests/testthat/test_horns_curve.R: -------------------------------------------------------------------------------- 1 | context("horns_curve") 2 | 3 | # Simulate some data for tests 4 | set.seed(123) 5 | x <- matrix(runif(100), ncol = 10) 6 | 7 | test_that("horns_curve provides proper messages and warnings", { 8 | expect_that(horns_curve(n = 10), throws_error()) 9 | expect_that(horns_curve(p = 10), throws_error()) 10 | expect_that(horns_curve(n = "a"), throws_error()) 11 | expect_that(horns_curve(p = "a"), throws_error()) 12 | expect_that(horns_curve(n = 1:3, p = 1:3), throws_error()) 13 | }) 14 | 15 | test_that("horns_curve output is correct dimensions", { 16 | expect_equal(length(horns_curve(x)), 10) 17 | expect_equal(length(horns_curve(n = 5, p = 8)), 8) 18 | expect_true(is.numeric(horns_curve(n = 5, p = 8))) 19 | expect_true(is.atomic(horns_curve(n = 5, p = 8))) 20 | }) 21 | -------------------------------------------------------------------------------- /man/pipe.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/pipe.R 3 | \name{\%>\%} 4 | \alias{\%>\%} 5 | \title{Pipe functions} 6 | \arguments{ 7 | \item{lhs, rhs}{An R object and a function to apply to it} 8 | } 9 | \description{ 10 | Like dplyr, anomalyDetection also uses the pipe function, \code{\%>\%} to turn 11 | function composition into a series of imperative statements. 12 | } 13 | \examples{ 14 | 15 | x <- matrix(rnorm(200*3), ncol = 10) 16 | N <- nrow(x) 17 | p <- ncol(x) 18 | 19 | # Instead of 20 | hc <- horns_curve(x) 21 | fa <- factor_analysis(x, hc_points = hc) 22 | factor_analysis_results(fa, fa_scores_rotated) 23 | 24 | # You can write 25 | horns_curve(x) \%>\% 26 | factor_analysis(x, hc_points = .) \%>\% 27 | factor_analysis_results(fa_scores_rotated) 28 | } 29 | -------------------------------------------------------------------------------- /docs/link.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 5 | 8 | 12 | 13 | -------------------------------------------------------------------------------- /man/inspect_block.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/block_inspect.R 3 | \name{inspect_block} 4 | \alias{inspect_block} 5 | \title{Block Inspection} 6 | \usage{ 7 | inspect_block(data, block_length) 8 | } 9 | \arguments{ 10 | \item{data}{data} 11 | 12 | \item{block_length}{integer value to divide data} 13 | } 14 | \value{ 15 | A list where each item is a data frame that contains the original data 16 | for each block denoted in the state vector. 17 | } 18 | \description{ 19 | \code{inspect_block} creates a list where the original data has been divided 20 | into blocks denoted in the state vector. Streamlines the process of inspecting 21 | specific blocks of interest. 22 | } 23 | \examples{ 24 | 25 | inspect_block(security_logs, 30) 26 | 27 | } 28 | \seealso{ 29 | \code{\link{tabulate_state_vector}} for creating the state vector matrix based 30 | on desired blocks. 31 | } 32 | -------------------------------------------------------------------------------- /man/security_logs.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/data.R 3 | \docType{data} 4 | \name{security_logs} 5 | \alias{security_logs} 6 | \title{Security Log Data} 7 | \format{A data frame with 300 rows and 10 variables: 8 | \describe{ 9 | \item{Device_Vendor}{Company who made the device} 10 | \item{Device_Product}{Name of the security device} 11 | \item{Device_Action}{Outcome result of access} 12 | \item{Src_IP}{IP address of the source} 13 | \item{Dst_IP}{IP address of the destination} 14 | \item{Src_Port}{Port identifier of the source} 15 | \item{Dst_Port}{Port identifier of the destination} 16 | \item{Protocol}{Transport protocol used} 17 | \item{Country_Src}{Country of the source} 18 | \item{Bytes_TRF}{Number of bytes transferred} 19 | }} 20 | \usage{ 21 | security_logs 22 | } 23 | \description{ 24 | A mock dataset containing common information that appears in security logs. 25 | } 26 | \keyword{datasets} 27 | -------------------------------------------------------------------------------- /appveyor.yml: -------------------------------------------------------------------------------- 1 | # DO NOT CHANGE the "init" and "install" sections below 2 | 3 | # Download script file from GitHub 4 | init: 5 | ps: | 6 | $ErrorActionPreference = "Stop" 7 | Invoke-WebRequest http://raw.github.com/krlmlr/r-appveyor/master/scripts/appveyor-tool.ps1 -OutFile "..\appveyor-tool.ps1" 8 | Import-Module '..\appveyor-tool.ps1' 9 | 10 | install: 11 | ps: Bootstrap 12 | 13 | # Adapt as necessary starting from here 14 | 15 | build_script: 16 | - travis-tool.sh install_deps 17 | 18 | test_script: 19 | - travis-tool.sh run_tests 20 | 21 | on_failure: 22 | - 7z a failure.zip *.Rcheck\* 23 | - appveyor PushArtifact failure.zip 24 | 25 | artifacts: 26 | - path: '*.Rcheck\**\*.log' 27 | name: Logs 28 | 29 | - path: '*.Rcheck\**\*.out' 30 | name: Logs 31 | 32 | - path: '*.Rcheck\**\*.fail' 33 | name: Logs 34 | 35 | - path: '*.Rcheck\**\*.Rout' 36 | name: Logs 37 | 38 | - path: '\*_*.tar.gz' 39 | name: Bits 40 | 41 | - path: '\*_*.zip' 42 | name: Bits 43 | -------------------------------------------------------------------------------- /tests/testthat/test_bd_row.R: -------------------------------------------------------------------------------- 1 | context("bd_row") 2 | 3 | set.seed(123) 4 | x = matrix(rnorm(200*3), ncol = 10) 5 | colnames(x) = paste0("C", 1:ncol(x)) 6 | 7 | m1 <- x %>% mahalanobis_distance("bd", normalize = TRUE) 8 | 9 | test_that("bd_row provides proper messages and warnings", { 10 | 11 | skip_on_cran() 12 | 13 | expect_that(bd_row(m1), throws_error()) 14 | expect_that(bd_row(m1, 250), throws_error()) 15 | expect_that(bd_row(m1, 5, n = 20), throws_error()) 16 | expect_equal(tryCatch(bd_row(m1, 1:5), warning = function(w) "warning"), "warning") 17 | 18 | }) 19 | 20 | test_that("bd_row output is correct dimensions", { 21 | 22 | expect_equal(bd_row(m1, 5, n = 5) %>% length(), 5) 23 | expect_equal(bd_row(m1, 5) %>% length(), 10) 24 | expect_true(bd_row(m1, 5) %>% is.numeric()) 25 | expect_true(bd_row(m1, 5) %>% is.atomic()) 26 | 27 | }) 28 | 29 | test_that("bd_row computes correctly", { 30 | 31 | expect_equal(bd_row(m1, 5, 1) %>% round(2) %>% .[[1]], 1.58) 32 | expect_equal(bd_row(m1, 36, 1) %>% round(2) %>% .[[1]], 3.02) 33 | 34 | }) 35 | -------------------------------------------------------------------------------- /inst/CITATION: -------------------------------------------------------------------------------- 1 | citHeader("To cite anomalyDetection in publications use:") 2 | 3 | citEntry(entry = "Article", 4 | title = "anomalyDetection: Implementation of augmented network log anomaly detection procedures.", 5 | author = personList(as.person("Robert J. Gutierrez"), 6 | as.person("Bradley C. Boehmke"), 7 | as.person("Kenneth W. Bauer"), 8 | as.person("Cade M. Saie"), 9 | as.person("Trevor J. Bihl")), 10 | journal = "The R Journal", 11 | year = "2017", 12 | volume = "9", 13 | number = "2", 14 | pages = "354--365", 15 | url = "https://journal.r-project.org/archive/2017/RJ-2017-039/index.html", 16 | 17 | textVersion = 18 | paste("Robert J. Gutierrez, Bradley C. Boehmke, Kenneth W. Bauer, Cade M. Saie, Trevor J. Bihl (2017).", 19 | "anomalyDetection: Implementation of augmented network log anomaly detection procedures.", 20 | "The R Journal, 9(2), 354--365.", 21 | "URL https://journal.r-project.org/archive/2017/RJ-2017-039/index.html.") 22 | ) 23 | -------------------------------------------------------------------------------- /tests/testthat/test_mahalanobis_distance.R: -------------------------------------------------------------------------------- 1 | context("mahalanobis_distance") 2 | 3 | set.seed(123) 4 | x <- data.frame(C1 = rnorm(100), C2 = rnorm(100), C3 = rnorm(100)) 5 | 6 | 7 | test_that("mahalanobis_distance output has correct dimensions", { 8 | 9 | expect_equal(x %>% dplyr::mutate(MD = mahalanobis_distance(x)) %>% dim(), c(100, 4)) 10 | expect_true(x %>% dplyr::mutate(MD = mahalanobis_distance(x)) %>% is.data.frame()) 11 | expect_equal(x %>% mahalanobis_distance(output = "bd") %>% dim(), c(100, 3)) 12 | expect_true(x %>% mahalanobis_distance(output = "bd") %>% is.matrix()) 13 | expect_equal(x %>% mahalanobis_distance(output = "both") %>% dim(), c(100, 4)) 14 | expect_true(x %>% mahalanobis_distance(output = "both") %>% is.matrix()) 15 | 16 | }) 17 | 18 | test_that("mahalanobis_distance computes correctly", { 19 | 20 | expect_equal(x %>% mahalanobis_distance() %>% .[1] %>% round(2), 5.48) 21 | expect_equal(x %>% mahalanobis_distance(output = "bd") %>% .[1, 1] %>% round(2) %>% .[[1]], 0.71) 22 | expect_equal(x %>% mahalanobis_distance(output = "bd", normalize = TRUE) %>% .[1, 1] %>% round(2) %>% .[[1]], 0.08) 23 | 24 | }) 25 | -------------------------------------------------------------------------------- /NEWS.md: -------------------------------------------------------------------------------- 1 | 2 | # anomalyDetection 0.2.6 3 | 4 | * Tested and fixed any issues with new tibble version dependency 5 | 6 | # anomalyDetection 0.2.5 7 | 8 | * Tested and fixed any issues with new tidyr version dependency 9 | * Updated maintainership and active authors 10 | * Updated URL for new Github organization holding this package 11 | * Updated vignette package loading in vignette to resolve 1 note 12 | * Added reference in the Description field 13 | 14 | # anomalyDetection 0.2.4 15 | 16 | * Added `NEWS` file. 17 | * Better tolerance in `mahalanobis_distance` when inverting covariance matrices. 18 | * `mahalanobis_distance` and `horns_curve` have been rewritten in C++ using the `RcppArmadillo` package. This greatly improved the speed (and accuracy) of these functions. 19 | * `tabulate_state_vector` has been rewritten using the `dplyr` package, greatly improving the speed of this function. Greater traceability is now also present for missing values and numeric variables. 20 | * Producing histogram matrix to visually display anomalous blocks made easier with addition of `hmat` function 21 | * Properly registered native routines and disabled symbol search. 22 | -------------------------------------------------------------------------------- /tests/testthat/test_hmat.R: -------------------------------------------------------------------------------- 1 | context("hmat") 2 | 3 | test_that("hmat produces a ggplot object", { 4 | expect_is(hmat(security_logs,block_length = 5),"ggplot") 5 | expect_is(security_logs %>% tabulate_state_vector(5) %>% 6 | hmat(input = "SV"),"ggplot") 7 | expect_is(security_logs %>% tabulate_state_vector(5) %>% 8 | mc_adjust() %>% 9 | mahalanobis_distance("both") %>% 10 | hmat(input = "MD"),"ggplot") 11 | expect_is(hmat(security_logs,block_length = 5, order = "anomaly"),"ggplot") 12 | expect_is(security_logs %>% tabulate_state_vector(5) %>% 13 | hmat(input = "SV", order = "anomaly"),"ggplot") 14 | expect_is(security_logs %>% tabulate_state_vector(5) %>% 15 | mc_adjust() %>% 16 | mahalanobis_distance("both") %>% 17 | hmat(input = "MD", order = "anomaly"),"ggplot") 18 | 19 | }) 20 | 21 | test_that("hmat error messages", { 22 | expect_error(hmat(NULL)) 23 | expect_error(hmat(security_logs,input = "SVP",block_length=5)) 24 | expect_error(hmat(6)) 25 | expect_error(hmat(security_logs)) 26 | expect_error(hmat(security_logs,6, order = "SV")) 27 | }) 28 | -------------------------------------------------------------------------------- /tests/testthat/test_principal_components.R: -------------------------------------------------------------------------------- 1 | context("principal_components") 2 | 3 | set.seed(123) 4 | x <- matrix(rnorm(200*3), ncol = 10) 5 | 6 | 7 | test_that("principal_components output has correct dimensions", { 8 | 9 | expect_equal(principal_components(x) %>% length(), 5) 10 | expect_true(principal_components(x) %>% is.list) 11 | expect_equal(principal_components(x) %>% .[[1]] %>% length(), 10) 12 | expect_equal(principal_components(x) %>% .[[2]] %>% dim(), c(10, 10)) 13 | expect_equal(principal_components(x) %>% .[[3]] %>% dim(), c(60, 10)) 14 | expect_equal(principal_components(x) %>% .[[4]] %>% length(), 10) 15 | expect_equal(principal_components(x) %>% .[[5]] %>% length(), 1) 16 | 17 | }) 18 | 19 | test_that("principal_components computes correctly", { 20 | 21 | expect_equal(principal_components(x) %>% .[[1]] %>% .[1] %>% round(3), 1.303) 22 | expect_equal(principal_components(x) %>% .[[2]] %>% .[[1,1]] %>% round(2), 0.31) 23 | expect_equal(principal_components(x) %>% .[[3]] %>% .[[1,1]] %>% round(2), -1.33) 24 | expect_equal(principal_components(x) %>% .[[4]] %>% .[1] %>% round(2), 0.07) 25 | expect_true(principal_components(x) %>% .[[5]] %>% !.) 26 | 27 | }) 28 | -------------------------------------------------------------------------------- /R/get_all_factors.R: -------------------------------------------------------------------------------- 1 | #' @title Find All Factors 2 | #' 3 | #' @description 4 | #' \code{get_all_factors} finds all factor pairs for a given integer (i.e. a number 5 | #' that divides evenly into another number). 6 | #' 7 | #' @param n number to be factored 8 | #' 9 | #' @return 10 | #' 11 | #' A list containing the integer vector(s) containing all factors for the given 12 | #' \code{n} inputs. 13 | #' 14 | #' @source 15 | #' 16 | #' \url{http://stackoverflow.com/a/6425597/3851274} 17 | #' 18 | #' @examples 19 | #' 20 | #' # Find all the factors of 39304 21 | #' get_all_factors(39304) 22 | #' 23 | #' @export 24 | 25 | get_all_factors <- function(n) { 26 | 27 | # check input type 28 | if(!is.numeric(n)) { 29 | stop("Invalid n argument: see ?get_all_factors for details", call. = FALSE) 30 | } 31 | 32 | prime_factor_tables <- lapply( 33 | stats::setNames(n, n), 34 | function(i) 35 | { 36 | if(i == 1) return(data.frame(x = 1L, freq = 1L)) 37 | plyr::count(as.integer(gmp::factorize(i))) 38 | } 39 | ) 40 | lapply( 41 | prime_factor_tables, 42 | function(pft) { 43 | powers <- plyr::alply(pft, 1, function(row) row$x ^ seq.int(0L, row$freq)) 44 | power_grid <- do.call(expand.grid, powers) 45 | sort(unique(apply(power_grid, 1, prod))) 46 | } 47 | ) 48 | } 49 | -------------------------------------------------------------------------------- /man/horns_curve.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/horns_curve.R 3 | \name{horns_curve} 4 | \alias{horns_curve} 5 | \title{Horn's Parallel Analysis} 6 | \usage{ 7 | horns_curve(data, n, p, nsim = 1000L) 8 | } 9 | \arguments{ 10 | \item{data}{A matrix or data frame.} 11 | 12 | \item{n}{Integer specifying the number of rows.} 13 | 14 | \item{p}{Integer specifying the number of columns.} 15 | 16 | \item{nsim}{Integer specifying the number of Monte Carlo simulations to run. 17 | Default is \code{1000}.} 18 | } 19 | \value{ 20 | A vector of length \code{p} containing the averaged eigenvalues. The 21 | values can then be plotted or compared to the true eigenvalues from a dataset 22 | for a dimensionality reduction assessment. 23 | } 24 | \description{ 25 | Computes the average eigenvalues produced by a Monte Carlo simulation that 26 | randomly generates a large number of \code{n}x\code{p} matrices of standard 27 | normal deviates. 28 | } 29 | \examples{ 30 | # Perform Horn's Parallel analysis with matrix n x p dimensions 31 | x <- matrix(rnorm(200 * 10), ncol = 10) 32 | horns_curve(x) 33 | horns_curve(n = 200, p = 10) 34 | plot(horns_curve(x)) # scree plot 35 | } 36 | \references{ 37 | J. L. Horn, "A rationale and test for the number of factors in factor 38 | analysis," Psychometrika, vol. 30, no. 2, pp. 179-185, 1965. 39 | } 40 | -------------------------------------------------------------------------------- /tests/testthat/test_factor_analysis.R: -------------------------------------------------------------------------------- 1 | context("factor_analysis") 2 | 3 | set.seed(123) 4 | x = matrix(rnorm(200*3), ncol = 10) 5 | 6 | hc_points <- x %>% horns_curve() 7 | 8 | test_that("factor_analysis provides proper messages and warnings", { 9 | 10 | expect_that(factor_analysis(x), throws_error()) 11 | expect_that(factor_analysis(hc_points), throws_error()) 12 | 13 | }) 14 | 15 | test_that("factor_analysis output has correct dimensions", { 16 | 17 | expect_equal(factor_analysis(x, hc_points) %>% length(), 5) 18 | expect_true(factor_analysis(x, hc_points) %>% is.list) 19 | expect_equal(factor_analysis(x, hc_points) %>% .[[1]] %>% dim(), c(10, 6)) 20 | expect_equal(factor_analysis(x, hc_points) %>% .[[2]] %>% dim(), c(60, 6)) 21 | expect_equal(factor_analysis(x, hc_points) %>% .[[3]] %>% dim(), c(10, 6)) 22 | expect_equal(factor_analysis(x, hc_points) %>% .[[4]] %>% dim(), c(60, 6)) 23 | expect_equal(factor_analysis(x, hc_points) %>% .[[5]] %>% length(), 1) 24 | 25 | }) 26 | 27 | test_that("factor_analysis computes correctly", { 28 | 29 | skip_on_cran() 30 | 31 | expect_equal(factor_analysis(x, hc_points) %>% .[[1]] %>% .[1,1] %>% round(3), -0.499) 32 | expect_equal(factor_analysis(x, hc_points) %>% .[[2]] %>% .[1,1] %>% round(2), 0.97) 33 | expect_equal(factor_analysis(x, hc_points) %>% .[[3]] %>% .[1,1] %>% round(2), 0.02) 34 | expect_equal(factor_analysis(x, hc_points) %>% .[[4]] %>% .[1,1] %>% round(2), 0.9) 35 | expect_equal(factor_analysis(x, hc_points) %>% .[[5]], 6) 36 | 37 | }) 38 | -------------------------------------------------------------------------------- /tests/testthat/test_tabulate_state_vector.R: -------------------------------------------------------------------------------- 1 | context("tabulate_state_vector") 2 | 3 | test_that("tabulate_state_vector output has correct dimensions", { 4 | 5 | expect_true(tabulate_state_vector(security_logs, 30) %>% is.data.frame()) 6 | expect_equal(tabulate_state_vector(security_logs, 30) %>% dim(), c(10, 54)) 7 | expect_equal(tabulate_state_vector(security_logs, 5, 4, 2) %>% dim(), c(60, 25)) 8 | 9 | }) 10 | 11 | test_that("tabulate_state_vector computes correctly", { 12 | 13 | expect_equal(tabulate_state_vector(security_logs, 30) %>% .[[1, 1]], 4) 14 | expect_equal(tabulate_state_vector(security_logs, 30) %>% .[[4,2]], 11) 15 | expect_equal(tabulate_state_vector(security_logs, 30, 4, 2) %>% .[[4, 6]], 9) 16 | expect_equal(tabulate_state_vector(security_logs, 7,partial_block = F) %>% nrow,42) 17 | expect_equal(tabulate_state_vector(security_logs, 7,partial_block = T) %>% nrow,43) 18 | expect_equal(tabulate_state_vector(security_logs, 30, na.rm = T) %>% .[[8,5]],2) 19 | 20 | }) 21 | 22 | test_that("tabulate_state_vector throws correct errors", { 23 | 24 | expect_error(tabulate_state_vector(security_logs, "A")) 25 | expect_error(tabulate_state_vector(security_logs, 30, "A")) 26 | expect_error(tabulate_state_vector(security_logs, 30, 40, "A")) 27 | expect_error(tabulate_state_vector(security_logs, 7,partial_block = F) %>% .[[43,1]]) 28 | expect_error(tabulate_state_vector(security_logs)) 29 | expect_error(tabulate_state_vector()) 30 | expect_error(tabulate_state_vector(4)) 31 | 32 | }) 33 | -------------------------------------------------------------------------------- /man/bd_row.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/bd_row.R 3 | \name{bd_row} 4 | \alias{bd_row} 5 | \title{Breakdown for Mahalanobis Distance} 6 | \usage{ 7 | bd_row(data, row, n = NULL) 8 | } 9 | \arguments{ 10 | \item{data}{numeric data} 11 | 12 | \item{row}{row of interest} 13 | 14 | \item{n}{number of values to return. By default, will return all variables 15 | (columns) with their respective differences. However, you can choose to view 16 | only the top \code{n} variables by setting the \code{n} value.} 17 | } 18 | \value{ 19 | Returns a vector indicating the variables in \code{data} that are driving the 20 | Mahalanobis distance for the respective row. 21 | } 22 | \description{ 23 | \code{bd_row} indicates which variables in data are driving the Mahalanobis 24 | distance for a specific row \code{r}, relative to the mean vector of the data. 25 | } 26 | \examples{ 27 | \dontrun{ 28 | x = matrix(rnorm(200*3), ncol = 10) 29 | colnames(x) = paste0("C", 1:ncol(x)) 30 | 31 | # compute the relative differences for row 5 and return all variables 32 | x \%>\% 33 | mahalanobis_distance("bd", normalize = TRUE) \%>\% 34 | bd_row(5) 35 | 36 | # compute the relative differences for row 5 and return the top 3 variables 37 | # that are influencing the Mahalanobis Distance the most 38 | x \%>\% 39 | mahalanobis_distance("bd", normalize = TRUE) \%>\% 40 | bd_row(5, 3) 41 | } 42 | 43 | } 44 | \seealso{ 45 | \code{\link{mahalanobis_distance}} for computing the Mahalanobis Distance values 46 | } 47 | -------------------------------------------------------------------------------- /man/kaisers_index.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/kaisers_index.R 3 | \name{kaisers_index} 4 | \alias{kaisers_index} 5 | \title{Kaiser's Index of Factorial Simplicity} 6 | \usage{ 7 | kaisers_index(loadings) 8 | } 9 | \arguments{ 10 | \item{loadings}{numerical matrix of the factor loadings} 11 | } 12 | \value{ 13 | Vector containing the computed score 14 | } 15 | \description{ 16 | \code{kaisers_index} computes scores designed to assess the quality of a factor 17 | analysis solution. It measures the tendency towards unifactoriality for both 18 | a given row and the entire matrix as a whole. Kaiser proposed the evaluations 19 | of the score shown below: 20 | 21 | \enumerate{ 22 | \item In the .90s: Marvelous 23 | \item In the .80s: Meritorious 24 | \item In the .70s: Middling 25 | \item In the .60s: Mediocre 26 | \item In the .50s: Miserable 27 | \item < .50: Unacceptable 28 | } 29 | 30 | Use as basis for selecting original or rotated loadings/scores in 31 | \code{factor_analysis}. 32 | } 33 | \examples{ 34 | # Perform Factor Analysis with matrix \\code{x} 35 | x <- matrix(rnorm(200*3), ncol = 10) 36 | 37 | x \%>\% 38 | horns_curve() \%>\% 39 | factor_analysis(x, hc_points = .) \%>\% 40 | factor_analysis_results(fa_loadings_rotated) \%>\% 41 | kaisers_index() 42 | 43 | } 44 | \references{ 45 | H. F. Kaiser, "An index of factorial simplicity," Psychometrika, vol. 39, no. 1, pp. 31-36, 1974. 46 | } 47 | \seealso{ 48 | \code{\link{factor_analysis}} for computing the factor analysis loadings 49 | } 50 | -------------------------------------------------------------------------------- /man/factor_analysis.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/factor_analysis.R 3 | \name{factor_analysis} 4 | \alias{factor_analysis} 5 | \title{Factor Analysis with Varimax Rotation} 6 | \usage{ 7 | factor_analysis(data, hc_points) 8 | } 9 | \arguments{ 10 | \item{data}{numeric data} 11 | 12 | \item{hc_points}{vector of eigenvalues [designed to use output from \code{\link{horns_curve}}]} 13 | } 14 | \value{ 15 | A list containing: 16 | \enumerate{ 17 | \item \code{fa_loadings}: numerical matrix with the original factor loadings 18 | \item \code{fa_scores}: numerical matrix with the row scores for each factor 19 | \item \code{fa_loadings_rotated}: numerical matrix with the varimax rotated factor loadings 20 | \item \code{fa_scores_rotated}: numerical matrix with the row scores for each varimax rotated factor 21 | \item \code{num_factors}: numeric vector identifying the number of factors 22 | } 23 | } 24 | \description{ 25 | \code{factor_analysis} reduces the structure of the data by relating the 26 | correlation between variables to a set of factors, using the eigen-decomposition 27 | of the correlation matrix. 28 | } 29 | \examples{ 30 | 31 | # Perform Factor Analysis with matrix \\code{x} 32 | x <- matrix(rnorm(200*3), ncol = 10) 33 | 34 | x \%>\% 35 | horns_curve() \%>\% 36 | factor_analysis(x, hc_points = .) 37 | 38 | } 39 | \references{ 40 | H. F. Kaiser, "The Application of Electronic Computers to Factor Analysis," 41 | Educational and Psychological Measurement, 1960. 42 | } 43 | \seealso{ 44 | \code{\link{horns_curve}} for computing the average eigenvalues used for \code{hc_points} argument 45 | } 46 | -------------------------------------------------------------------------------- /man/factor_analysis_results.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/factor_analysis.R 3 | \name{factor_analysis_results} 4 | \alias{factor_analysis_results} 5 | \title{Easy Access to Factor Analysis Results} 6 | \usage{ 7 | factor_analysis_results(data, results = 1) 8 | } 9 | \arguments{ 10 | \item{data}{list output from \code{factor_analysis}} 11 | 12 | \item{results}{factor analysis results to extract. Can use either results 13 | name or number (i.e. fa_scores or 2):: 14 | \enumerate{ 15 | \item \code{fa_loadings} (default) 16 | \item \code{fa_scores} 17 | \item \code{fa_loadings_rotated} 18 | \item \code{fa_scores_rotated} 19 | \item \code{num_factors} 20 | }} 21 | } 22 | \value{ 23 | Returns the one of the selected results: 24 | \enumerate{ 25 | \item \code{fa_loadings}: numerical matrix with the original factor loadings 26 | \item \code{fa_scores}: numerical matrix with the row scores for each factor 27 | \item \code{fa_loadings_rotated}: numerical matrix with the varimax rotated factor loadings 28 | \item \code{fa_scores_rotated}: numerical matrix with the row scores for each varimax rotated factor 29 | \item \code{num_factors}: numeric vector identifying the number of factors 30 | } 31 | } 32 | \description{ 33 | \code{factor_analysis_result} Provides easy access to factor analysis results 34 | } 35 | \examples{ 36 | 37 | # An efficient means for getting factor analysis results 38 | x <- matrix(rnorm(200*3), ncol = 10) 39 | N <- nrow(x) 40 | p <- ncol(x) 41 | 42 | x \%>\% 43 | horns_curve() \%>\% 44 | factor_analysis(x, hc_points = .) \%>\% 45 | factor_analysis_results(fa_scores_rotated) 46 | 47 | } 48 | \seealso{ 49 | \code{\link{factor_analysis}} for computing the factor analysis results 50 | } 51 | -------------------------------------------------------------------------------- /man/mc_adjust.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/mc_adjust.R 3 | \name{mc_adjust} 4 | \alias{mc_adjust} 5 | \title{Multi-Collinearity Adjustment} 6 | \usage{ 7 | mc_adjust(data, min_var = 0.1, max_cor = 0.9, action = "exclude") 8 | } 9 | \arguments{ 10 | \item{data}{named numeric data object (either data frame or matrix)} 11 | 12 | \item{min_var}{numeric value between 0-1 for the minimum acceptable variance (default = 0.1)} 13 | 14 | \item{max_cor}{numeric value between 0-1 for the maximum acceptable correlation (default = 0.9)} 15 | 16 | \item{action}{select action for handling columns causing multi-collinearity issues 17 | \enumerate{ 18 | \item \code{exclude}: exclude all columns causing multi-collinearity issues (default) 19 | \item \code{select}: identify the columns causing multi-collinearity issues 20 | and allow the user to interactively select those columns to remove 21 | }} 22 | } 23 | \value{ 24 | \code{mc_adjust} returns the numeric data object supplied minus variables 25 | violating the minimum acceptable variance (\code{min_var}) and the 26 | maximum acceptable correlation (\code{max_cor}) levels. 27 | } 28 | \description{ 29 | \code{mc_adjust} handles issues with multi-collinearity. 30 | } 31 | \details{ 32 | \code{mc_adjust} handles issues with multi-collinearity by first removing 33 | any columns whose variance is close to or less than \code{min_var}. Then, it 34 | removes linearly dependent columns. Finally, it removes any columns that have 35 | a high absolute correlation value equal to or greater than \code{max_cor}. 36 | } 37 | \examples{ 38 | \dontrun{ 39 | x <- matrix(runif(100), ncol = 10) 40 | x \%>\% 41 | mc_adjust() 42 | 43 | x \%>\% 44 | mc_adjust(min_var = .15, max_cor = .75, action = "select") 45 | } 46 | 47 | } 48 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: anomalyDetection 2 | Type: Package 3 | Title: Implementation of Augmented Network Log Anomaly Detection Procedures 4 | Version: 0.2.6 5 | Authors@R: c( 6 | person("Bradley", "Boehmke", 7 | email = "bradleyboehmke@gmail.com", 8 | role = c("aut", "cre"), 9 | comment = c(ORCID = "0000-0002-3611-8516")), 10 | person("Brandon", "Greenwell", 11 | email = "greenwell.brandon@gmail.com", 12 | role = c("aut"), 13 | comment = c(ORCID = "0000-0002-8120-0084")), 14 | person("Jason", "Freels", 15 | email = "auburngrads@live.com", 16 | role = c("aut")), 17 | person("Robert", "Gutierrez", 18 | email = "rjgutierrez2015@gmail.com", 19 | role = c("aut")) 20 | ) 21 | Maintainer: Bradley Boehmke 22 | Date: 2018-12-17 23 | Description: Implements procedures developed by Gutierrez et al. (2017, ) 24 | to aid in detecting network log anomalies. By combining various multivariate 25 | analytic approaches relevant to network anomaly detection, it provides cyber 26 | analysts efficient means to detect suspected anomalies requiring further evaluation. 27 | URL: https://github.com/koalaverse/anomalyDetection 28 | BugReports: https://github.com/koalaverse/anomalyDetection/issues 29 | License: GPL (>= 2) 30 | Encoding: UTF-8 31 | LazyData: true 32 | Depends: 33 | R (>= 2.10) 34 | Imports: 35 | caret, 36 | dplyr, 37 | ggplot2, 38 | gmp, 39 | magrittr, 40 | MASS, 41 | plyr, 42 | purrr, 43 | Rcpp (>= 0.12.11), 44 | stats, 45 | tibble, 46 | tidyr, 47 | RoxygenNote: 6.0.1 48 | Suggests: 49 | gplots, 50 | knitr, 51 | RColorBrewer, 52 | rmarkdown, 53 | testthat, 54 | LinkingTo: Rcpp, RcppArmadillo 55 | VignetteBuilder: knitr 56 | -------------------------------------------------------------------------------- /R/block_inspect.R: -------------------------------------------------------------------------------- 1 | #' @title Block Inspection 2 | #' 3 | #' @description 4 | #' \code{inspect_block} creates a list where the original data has been divided 5 | #' into blocks denoted in the state vector. Streamlines the process of inspecting 6 | #' specific blocks of interest. 7 | #' 8 | #' @param data data 9 | #' @param block_length integer value to divide data 10 | #' 11 | #' @return A list where each item is a data frame that contains the original data 12 | #' for each block denoted in the state vector. 13 | #' 14 | #' @seealso 15 | #' 16 | #' \code{\link{tabulate_state_vector}} for creating the state vector matrix based 17 | #' on desired blocks. 18 | #' 19 | #' @examples 20 | #' 21 | #' inspect_block(security_logs, 30) 22 | #' 23 | #' @export 24 | #' 25 | inspect_block <- function(data, block_length){ 26 | 27 | # return error if parameters are missing 28 | if(missing(data)) { 29 | stop("Missing argument: data argument", call. = FALSE) 30 | } 31 | if(missing(block_length)) { 32 | stop("Missing argument: block_length argument", call. = FALSE) 33 | } 34 | 35 | # return error if arguments are wrong type 36 | if(!is.data.frame(data) && !is.matrix(data)) { 37 | stop("data must be a data frame or matrix", call. = FALSE) 38 | } 39 | if(is.null(nrow(data)) || isTRUE(nrow(data) < block_length)) { 40 | stop("Your data input does not have sufficient number of rows", call. = FALSE) 41 | } 42 | if(!is.numeric(block_length)) { 43 | stop("block_length must be a numeric input", call. = FALSE) 44 | } 45 | 46 | num_rows <- nrow(data) 47 | num_blocks <- as.numeric(floor(num_rows / block_length)) 48 | 49 | Blocks <- vector("list", num_blocks) 50 | 51 | i <- 1 52 | start <- 1 53 | for (i in seq_len(num_blocks)) { 54 | stopp <- block_length*i 55 | Blocks[[i]] <- data[start:stopp,] 56 | start <- stopp + 1 57 | } 58 | 59 | return(Blocks) 60 | } 61 | -------------------------------------------------------------------------------- /R/kaisers_index.R: -------------------------------------------------------------------------------- 1 | #' @title Kaiser's Index of Factorial Simplicity 2 | #' 3 | #' @description 4 | #' \code{kaisers_index} computes scores designed to assess the quality of a factor 5 | #' analysis solution. It measures the tendency towards unifactoriality for both 6 | #' a given row and the entire matrix as a whole. Kaiser proposed the evaluations 7 | #' of the score shown below: 8 | #' 9 | #' \enumerate{ 10 | #' \item In the .90s: Marvelous 11 | #' \item In the .80s: Meritorious 12 | #' \item In the .70s: Middling 13 | #' \item In the .60s: Mediocre 14 | #' \item In the .50s: Miserable 15 | #' \item < .50: Unacceptable 16 | #' } 17 | #' 18 | #' Use as basis for selecting original or rotated loadings/scores in 19 | #' \code{factor_analysis}. 20 | #' 21 | #' @param loadings numerical matrix of the factor loadings 22 | #' 23 | #' @return Vector containing the computed score 24 | #' 25 | #' @references 26 | #' 27 | #' H. F. Kaiser, "An index of factorial simplicity," Psychometrika, vol. 39, no. 1, pp. 31-36, 1974. 28 | #' 29 | #' @seealso 30 | #' 31 | #' \code{\link{factor_analysis}} for computing the factor analysis loadings 32 | #' 33 | #' 34 | #' @examples 35 | #' # Perform Factor Analysis with matrix \code{x} 36 | #' x <- matrix(rnorm(200*3), ncol = 10) 37 | #' 38 | #' x %>% 39 | #' horns_curve() %>% 40 | #' factor_analysis(x, hc_points = .) %>% 41 | #' factor_analysis_results(fa_loadings_rotated) %>% 42 | #' kaisers_index() 43 | #' 44 | #'@export 45 | 46 | kaisers_index <- function(loadings) { 47 | 48 | if(!is.numeric(loadings)) { 49 | stop("loadings must be numeric") 50 | } 51 | 52 | N <- nrow(loadings) 53 | M <- ncol(loadings) 54 | 55 | t2 <- rowSums(loadings ^ 2) 56 | t4 <- rowSums(loadings ^ 4) 57 | 58 | sum1 <- sum(M * t4[seq_len(M)] - t2[seq_len(M)] ^ 2) 59 | sum2 <- sum((M - 1) * t2[seq_len(M)] ^ 2) 60 | 61 | result <- sqrt(sum1/sum2) 62 | return(result) 63 | 64 | } 65 | -------------------------------------------------------------------------------- /man/tabulate_state_vector.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/tabulate_state_vector.R 3 | \name{tabulate_state_vector} 4 | \alias{tabulate_state_vector} 5 | \title{Tabulate State Vector} 6 | \usage{ 7 | tabulate_state_vector(data, block_length, level_limit = 50L, 8 | level_keep = 10L, partial_block = FALSE, na.rm = FALSE) 9 | } 10 | \arguments{ 11 | \item{data}{data} 12 | 13 | \item{block_length}{integer value to divide data by} 14 | 15 | \item{level_limit}{integer value to determine the cutoff for the number of 16 | factors in a column to display before being reduced to show the number of 17 | levels to keep (default is 50)} 18 | 19 | \item{level_keep}{integer value indicating the top number of factor levels to 20 | retain if a column has more than the level limit (default is 10)} 21 | 22 | \item{partial_block}{a logical which determines whether incomplete blocks are kept in 23 | the analysis in the case where the number of log entries isn't evenly 24 | divisible by the \code{block_length}} 25 | 26 | \item{na.rm}{whether to keep track of missing values as part of the analysis or 27 | ignore them} 28 | } 29 | \value{ 30 | A data frame where each row represents one block and the columns count 31 | the number of occurrences that character/factor level occurred in that block 32 | } 33 | \description{ 34 | \code{tabulate_state_vector} employs a tabulated vector approach to transform 35 | security log data into unique counts of data attributes based on time blocks. 36 | Taking a contingency table approach, this function separates variables of type 37 | character or factor into their unique levels and counts the number of occurrences 38 | for those levels within each block. Due to the large number of unique IP addresses, 39 | this function allows for the user to determine how many IP addresses they would 40 | like to investigate. The function tabulates the most popular IP addresses. 41 | } 42 | \examples{ 43 | tabulate_state_vector(security_logs, 30) 44 | 45 | } 46 | -------------------------------------------------------------------------------- /R/horns_curve.R: -------------------------------------------------------------------------------- 1 | #' Horn's Parallel Analysis 2 | #' 3 | #' Computes the average eigenvalues produced by a Monte Carlo simulation that 4 | #' randomly generates a large number of \code{n}x\code{p} matrices of standard 5 | #' normal deviates. 6 | #' 7 | #' @param data A matrix or data frame. 8 | #' 9 | #' @param n Integer specifying the number of rows. 10 | #' 11 | #' @param p Integer specifying the number of columns. 12 | #' 13 | #' @param nsim Integer specifying the number of Monte Carlo simulations to run. 14 | #' Default is \code{1000}. 15 | #' 16 | #' @return A vector of length \code{p} containing the averaged eigenvalues. The 17 | #' values can then be plotted or compared to the true eigenvalues from a dataset 18 | #' for a dimensionality reduction assessment. 19 | #' 20 | #' @references 21 | #' J. L. Horn, "A rationale and test for the number of factors in factor 22 | #' analysis," Psychometrika, vol. 30, no. 2, pp. 179-185, 1965. 23 | #' 24 | #' @rdname horns_curve 25 | #' 26 | #' @export 27 | #' 28 | #' @examples 29 | #' # Perform Horn's Parallel analysis with matrix n x p dimensions 30 | #' x <- matrix(rnorm(200 * 10), ncol = 10) 31 | #' horns_curve(x) 32 | #' horns_curve(n = 200, p = 10) 33 | #' plot(horns_curve(x)) # scree plot 34 | horns_curve <- function(data, n, p, nsim = 1000L) { 35 | if (!missing(data)) { 36 | if (!inherits(data, c("data.frame", "matrix"))) { 37 | stop("data must be a data frame or matrix") 38 | } 39 | compute_hc(n = nrow(data), p = ncol(data), nsim = nsim)[, , drop = TRUE] 40 | } else { 41 | if (missing(n)) { 42 | stop("missing n; please supply the number of rows") 43 | } 44 | if (missing(p)) { 45 | stop("missing p; please supply the number of columns") 46 | } 47 | if (!(n %% 2 %in% c(0, 1)) || n < 1) { 48 | stop("n should be a positive integer") 49 | } 50 | if (!(p %% 2 %in% c(0, 1)) || p < 1) { 51 | stop("p should be a positive integer") 52 | } 53 | compute_hc(n = n, p = p, nsim = nsim)[, , drop = TRUE] 54 | } 55 | } 56 | 57 | -------------------------------------------------------------------------------- /man/principal_components_result.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/pca.R 3 | \name{principal_components_result} 4 | \alias{principal_components_result} 5 | \title{Easy Access to Principal Component Analysis Results} 6 | \usage{ 7 | principal_components_result(data, results = 2) 8 | } 9 | \arguments{ 10 | \item{data}{list output from \code{principal_components}} 11 | 12 | \item{results}{principal component analysis results to extract. Can use either 13 | results name or number (i.e. pca_loadings or 2): 14 | \enumerate{ 15 | \item \code{pca_sdev} 16 | \item \code{pca_loadings} (default) 17 | \item \code{pca_rotated} 18 | \item \code{pca_center} 19 | \item \code{pca_scale} 20 | }} 21 | } 22 | \value{ 23 | Returns one of the selected results: 24 | \enumerate{ 25 | \item \code{pca_sdev}: the standard deviations of the principal components (i.e., the square roots of the eigenvalues of the correlation matrix, though the calculation is actually done with the singular values of the data matrix). 26 | \item \code{pca_loadings}: the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors). 27 | \item \code{pca_rotated}: if \code{retx} is \code{TRUE} the value of the rotated data (the centred (and scaled if requested) data multiplied by the rotation matrix) is returned. Hence, \code{cov(x)} is the diagonal matrix \code{diag(sdev^2)}. 28 | \item \code{pca_center}: the centering used 29 | \item \code{pca_scale}: whether scaling was used 30 | } 31 | } 32 | \description{ 33 | \code{principal_components_result} Provides easy access to principal 34 | component analysis results 35 | } 36 | \examples{ 37 | 38 | # An efficient means for getting principal component analysis results 39 | x <- matrix(rnorm(200 * 3), ncol = 10) 40 | 41 | principal_components(x) \%>\% 42 | principal_components_result(pca_loadings) 43 | 44 | } 45 | \seealso{ 46 | \code{\link{principal_components}} for computing the principal components results 47 | } 48 | -------------------------------------------------------------------------------- /tests/testthat/test_factor_analysis_results.R: -------------------------------------------------------------------------------- 1 | context("factor_analysis_result") 2 | 3 | set.seed(123) 4 | x = matrix(rnorm(200*3), ncol = 10) 5 | 6 | fa <- x %>% 7 | horns_curve() %>% 8 | factor_analysis(x, hc_points = .) 9 | 10 | test_that("factor_analysis_result provides proper messages and warnings", { 11 | 12 | expect_that(factor_analysis_results(), throws_error()) 13 | expect_that(factor_analysis_results(fa, results = 6), throws_error()) 14 | expect_that(factor_analysis_results(fa, results = "wrong"), throws_error()) 15 | 16 | }) 17 | 18 | test_that("factor_analysis_result output has correct dimensions", { 19 | 20 | expect_equal(factor_analysis_results(fa) %>% dim(), c(10, 6)) 21 | expect_equal(factor_analysis_results(fa, 1) %>% dim(), c(10, 6)) 22 | expect_equal(factor_analysis_results(fa, 2) %>% dim(), c(60, 6)) 23 | expect_equal(factor_analysis_results(fa, 3) %>% dim(), c(10, 6)) 24 | expect_equal(factor_analysis_results(fa, 4) %>% dim(), c(60, 6)) 25 | expect_equal(factor_analysis_results(fa, 5) %>% length(), 1) 26 | 27 | expect_equal(factor_analysis_results(fa, fa_loadings) %>% dim(), c(10, 6)) 28 | expect_equal(factor_analysis_results(fa, fa_scores) %>% dim(), c(60, 6)) 29 | expect_equal(factor_analysis_results(fa, fa_loadings_rotated) %>% dim(), c(10, 6)) 30 | expect_equal(factor_analysis_results(fa, fa_scores_rotated) %>% dim(), c(60, 6)) 31 | expect_equal(factor_analysis_results(fa, num_factors) %>% length(), 1) 32 | 33 | }) 34 | 35 | test_that("factor_analysis_result gets the right output", { 36 | 37 | skip_on_cran() 38 | 39 | expect_equal(factor_analysis_results(fa) %>% .[1,1] %>% round(3), -0.499) 40 | expect_equal(factor_analysis_results(fa, 1) %>% .[1,1] %>% round(3), -0.499) 41 | expect_equal(factor_analysis_results(fa, 2) %>% .[1,1] %>% round(2), 0.97) 42 | expect_equal(factor_analysis_results(fa, 3) %>% .[1,1] %>% round(2), 0.02) 43 | expect_equal(factor_analysis_results(fa, 4) %>% .[1,1] %>% round(2), 0.9) 44 | expect_equal(factor_analysis_results(fa, 5), 6) 45 | 46 | }) 47 | -------------------------------------------------------------------------------- /tests/testthat/test_principal_components_result.R: -------------------------------------------------------------------------------- 1 | context("principal_components_result") 2 | 3 | set.seed(123) 4 | x <- matrix(rnorm(200*3), ncol = 10) 5 | 6 | pca <- principal_components(x) 7 | 8 | test_that("principal_components_result provides proper messages and warnings", { 9 | 10 | expect_that(principal_components_result(), throws_error()) 11 | expect_that(principal_components_result(pca, results = 6), throws_error()) 12 | expect_that(principal_components_result(pca, results = "wrong"), throws_error()) 13 | 14 | }) 15 | 16 | test_that("principal_components_result output has correct dimensions", { 17 | 18 | expect_equal(principal_components_result(pca) %>% length, 100) 19 | expect_equal(principal_components_result(pca, 1) %>% length(), 10) 20 | expect_equal(principal_components_result(pca, 2) %>% dim(), c(10, 10)) 21 | expect_equal(principal_components_result(pca, 3) %>% dim(), c(60, 10)) 22 | expect_equal(principal_components_result(pca, 4) %>% length(), 10) 23 | expect_equal(principal_components_result(pca, 5) %>% length(), 1) 24 | 25 | expect_equal(principal_components_result(pca, pca_sdev) %>% length(), 10) 26 | expect_equal(principal_components_result(pca, pca_loadings) %>% dim(), c(10, 10)) 27 | expect_equal(principal_components_result(pca, pca_rotated) %>% dim(), c(60, 10)) 28 | expect_equal(principal_components_result(pca, pca_center) %>% length(), 10) 29 | expect_equal(principal_components_result(pca, pca_scale) %>% length(), 1) 30 | 31 | }) 32 | 33 | test_that("principal_components_result gets the right output", { 34 | 35 | expect_equal(principal_components_result(pca) %>% .[1] %>% round(3), 0.315) 36 | expect_equal(principal_components_result(pca, 1) %>% .[1] %>% round(3), 1.303) 37 | expect_equal(principal_components_result(pca, 2) %>% .[[1,1]] %>% round(2), 0.31) 38 | expect_equal(principal_components_result(pca, 3) %>% .[[1,1]] %>% round(2), -1.33) 39 | expect_equal(principal_components_result(pca, 4) %>% .[[1]] %>% round(2), 0.07) 40 | expect_true(principal_components_result(pca, 5) %>% !.) 41 | 42 | }) 43 | -------------------------------------------------------------------------------- /src/bottlenecks.cpp: -------------------------------------------------------------------------------- 1 | // -*- mode: C++; c-indent-level: 4; c-basic-offset: 4; indent-tabs-mode: nil; -*- 2 | 3 | // we only include RcppArmadillo.h which pulls Rcpp.h in for us 4 | #include "RcppArmadillo.h" 5 | 6 | // via the depends attribute we tell Rcpp to create hooks for 7 | // RcppArmadillo so that the build process will know what to do 8 | // 9 | // [[Rcpp::depends(RcppArmadillo)]] 10 | 11 | 12 | // [[Rcpp::export]] 13 | arma::mat compute_md(arma::mat x) { 14 | arma::rowvec colmeans = arma::mean(x, 0); 15 | arma::mat xsweep = x.each_row() - colmeans; 16 | return sum(xsweep * arma::pinv(arma::cov(x)) % xsweep, 1); 17 | } 18 | 19 | 20 | // [[Rcpp::export]] 21 | arma::mat compute_bd(arma::mat x, bool normalize) { 22 | arma::mat covx = arma::cov(x); 23 | arma::vec sqrt_diag_covx = arma::sqrt(covx.diag()); 24 | arma::rowvec colmeans = arma::mean(x, 0); 25 | arma::mat xsweep = x.each_row() - colmeans; 26 | arma::mat bd = arma::abs(xsweep.each_row() / sqrt_diag_covx.t()); 27 | if (normalize) { 28 | bd = bd * arma::diagmat(1 / arma::sum(x, 0)); 29 | } 30 | return bd; 31 | } 32 | 33 | 34 | // [[Rcpp::export]] 35 | arma::mat compute_md_and_bd(arma::mat x, bool normalize) { 36 | arma::mat covx = arma::cov(x); 37 | arma::vec sqrt_diag_covx = arma::sqrt(covx.diag()); 38 | arma::rowvec colmeans = arma::mean(x, 0); 39 | arma::mat xsweep = x.each_row() - colmeans; 40 | arma::vec md = sum(xsweep * arma::pinv(arma::cov(x)) % xsweep, 1); 41 | arma::mat out = arma::abs(xsweep.each_row() / sqrt_diag_covx.t()); 42 | if (normalize) { 43 | out = out * arma::diagmat(1 / arma::sum(x, 0)); 44 | } 45 | out.insert_cols(0, md); 46 | return out; 47 | } 48 | 49 | 50 | // [[Rcpp::export]] 51 | arma::mat compute_hc(int n, int p, int nsim) { 52 | arma::vec eig_vals(p); 53 | arma::mat eig_vecs(nsim, p); 54 | for(int i = 0; i < nsim; ++i) { 55 | if (i % 100 == 0) { 56 | Rcpp::checkUserInterrupt(); // check for user interuption 57 | } 58 | eig_vals = arma::eig_sym(arma::cov(arma::randn(n, p))); 59 | eig_vecs.row(i) = arma::sort(eig_vals, "descend").t(); 60 | } 61 | return arma::mean(eig_vecs, 0).t(); 62 | } 63 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | [![Lifecycle Status](https://img.shields.io/badge/lifecycle-archived-brightgreen.svg)](https://img.shields.io/badge/lifecycle-archived-brightgreen.svg) 4 | [![Travis-CI Build Status](https://travis-ci.org/koalaverse/anomalyDetection.svg?branch=master)](https://travis-ci.org/koalaverse/anomalyDetection) 5 | [![codecov](https://codecov.io/gh/koalaverse/anomalyDetection/branch/master/graph/badge.svg)](https://codecov.io/gh/koalaverse/anomalyDetection) 6 | [![Total Downloads](http://cranlogs.r-pkg.org/badges/grand-total/anomalyDetection)](http://cranlogs.r-pkg.org/badges/grand-total/anomalyDetection) 7 | 8 | anomalyDetection 9 | ===================================================================================================== 10 | 11 | `anomalyDetection` implements procedures to aid in detecting network log anomalies. By combining various multivariate analytic approaches relevant to network anomaly detection, it provides cyber analysts efficient means to detect suspected anomalies requiring further evaluation. 12 | 13 | Installation 14 | ------------ 15 | 16 | You can install `anomalyDetection` two ways. 17 | 18 | - Using the latest released version from CRAN: 19 | 20 | 21 | 22 | install.packages("anomalyDetection") 23 | 24 | - Using the latest development version from GitHub: 25 | 26 | 27 | 28 | if (packageVersion("devtools") < 1.6) { 29 | install.packages("devtools") 30 | } 31 | 32 | devtools::install_github("koalaverse/anomalyDetection", build_vignettes = TRUE) 33 | 34 | Learning 35 | -------- 36 | 37 | To get started with `anomalyDetection`, read the intro [vignette](https://cran.r-project.org/web/packages/anomalyDetection/vignettes/Introduction.html): `vignette("Introduction", package = "anomalyDetection")`. This will provide a thorough introduction to the functions provided in the package. 38 | 39 | References 40 | ---------- 41 | 42 | Gutierrez, R.J., Boehmke, B.C., Bauer, K.W., Saie, C.M. & Bihl, T.J. (2017) "`anomalyDetection`: Implementation of augmented network log anomaly detection procedures." The R Journal, 9(2), 354-365. [link](https://journal.r-project.org/archive/2017/RJ-2017-039/index.html) 43 | -------------------------------------------------------------------------------- /man/mahalanobis_distance.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/mahalanobis_distance.R 3 | \name{mahalanobis_distance} 4 | \alias{mahalanobis_distance} 5 | \alias{mahalanobis_distance.matrix} 6 | \alias{mahalanobis_distance.data.frame} 7 | \title{Mahalanobis Distance} 8 | \usage{ 9 | mahalanobis_distance(data, output = c("md", "bd", "both"), 10 | normalize = FALSE) 11 | 12 | \method{mahalanobis_distance}{matrix}(data, output = c("md", "bd", "both"), 13 | normalize = FALSE) 14 | 15 | \method{mahalanobis_distance}{data.frame}(data, output = c("md", "bd", 16 | "both"), normalize = FALSE) 17 | } 18 | \arguments{ 19 | \item{data}{A matrix or data frame. Data frames will be converted to matrices 20 | via \code{data.matrix}.} 21 | 22 | \item{output}{Character string specifying which distance metric(s) to 23 | compute. Current options include: \code{"md"} for Mahalanobis distance 24 | (default); \code{"bd"} for absolute breakdown distance (used to see which 25 | columns drive the Mahalanobis distance); and \code{"both"} to return both 26 | distance metrics.} 27 | 28 | \item{normalize}{Logical indicating whether or not to normalize the breakdown 29 | distances within each column (so that breakdown distances across columns can 30 | be compared).} 31 | } 32 | \value{ 33 | If \code{output = "md"}, then a vector containing the Mahalanobis 34 | distances is returned. Otherwise, a matrix. 35 | } 36 | \description{ 37 | Calculates the distance between the elements in a data set and the mean 38 | vector of the data for outlier detection. Values are independent of the scale 39 | between variables. 40 | } 41 | \examples{ 42 | \dontrun{ 43 | # Simulate some data 44 | x <- data.frame(C1 = rnorm(100), C2 = rnorm(100), C3 = rnorm(100)) 45 | 46 | # Add Mahalanobis distances 47 | x \%>\% dplyr::mutate(MD = mahalanobis_distance(x)) 48 | 49 | # Add Mahalanobis and breakdown distances 50 | x \%>\% cbind(mahalanobis_distance(x, output = "both")) 51 | 52 | # Add Mahalanobis and normalized breakdown distances 53 | x \%>\% cbind(mahalanobis_distance(x, output = "both", normalize = TRUE)) 54 | } 55 | } 56 | \references{ 57 | W. Wang and R. Battiti, "Identifying Intrusions in Computer Networks with 58 | Principal Component Analysis," in First International Conference on 59 | Availability, Reliability and Security, 2006. 60 | } 61 | -------------------------------------------------------------------------------- /docs/docsearch.js: -------------------------------------------------------------------------------- 1 | $(function() { 2 | 3 | // register a handler to move the focus to the search bar 4 | // upon pressing shift + "/" (i.e. "?") 5 | $(document).on('keydown', function(e) { 6 | if (e.shiftKey && e.keyCode == 191) { 7 | e.preventDefault(); 8 | $("#search-input").focus(); 9 | } 10 | }); 11 | 12 | $(document).ready(function() { 13 | // do keyword highlighting 14 | /* modified from https://jsfiddle.net/julmot/bL6bb5oo/ */ 15 | var mark = function() { 16 | 17 | var referrer = document.URL ; 18 | var paramKey = "q" ; 19 | 20 | if (referrer.indexOf("?") !== -1) { 21 | var qs = referrer.substr(referrer.indexOf('?') + 1); 22 | var qs_noanchor = qs.split('#')[0]; 23 | var qsa = qs_noanchor.split('&'); 24 | var keyword = ""; 25 | 26 | for (var i = 0; i < qsa.length; i++) { 27 | var currentParam = qsa[i].split('='); 28 | 29 | if (currentParam.length !== 2) { 30 | continue; 31 | } 32 | 33 | if (currentParam[0] == paramKey) { 34 | keyword = decodeURIComponent(currentParam[1].replace(/\+/g, "%20")); 35 | } 36 | } 37 | 38 | if (keyword !== "") { 39 | $(".contents").unmark({ 40 | done: function() { 41 | $(".contents").mark(keyword); 42 | } 43 | }); 44 | } 45 | } 46 | }; 47 | 48 | mark(); 49 | }); 50 | }); 51 | 52 | /* Search term highlighting ------------------------------*/ 53 | 54 | function matchedWords(hit) { 55 | var words = []; 56 | 57 | var hierarchy = hit._highlightResult.hierarchy; 58 | // loop to fetch from lvl0, lvl1, etc. 59 | for (var idx in hierarchy) { 60 | words = words.concat(hierarchy[idx].matchedWords); 61 | } 62 | 63 | var content = hit._highlightResult.content; 64 | if (content) { 65 | words = words.concat(content.matchedWords); 66 | } 67 | 68 | // return unique words 69 | var words_uniq = [...new Set(words)]; 70 | return words_uniq; 71 | } 72 | 73 | function updateHitURL(hit) { 74 | 75 | var words = matchedWords(hit); 76 | var url = ""; 77 | 78 | if (hit.anchor) { 79 | url = hit.url_without_anchor + '?q=' + escape(words.join(" ")) + '#' + hit.anchor; 80 | } else { 81 | url = hit.url + '?q=' + escape(words.join(" ")); 82 | } 83 | 84 | return url; 85 | } 86 | -------------------------------------------------------------------------------- /R/bd_row.R: -------------------------------------------------------------------------------- 1 | #' @title Breakdown for Mahalanobis Distance 2 | #' 3 | #' @description 4 | #' \code{bd_row} indicates which variables in data are driving the Mahalanobis 5 | #' distance for a specific row \code{r}, relative to the mean vector of the data. 6 | #' 7 | #' @param data numeric data 8 | #' @param row row of interest 9 | #' @param n number of values to return. By default, will return all variables 10 | #' (columns) with their respective differences. However, you can choose to view 11 | #' only the top \code{n} variables by setting the \code{n} value. 12 | #' 13 | #' @return 14 | #' 15 | #' Returns a vector indicating the variables in \code{data} that are driving the 16 | #' Mahalanobis distance for the respective row. 17 | #' 18 | #' @seealso 19 | #' 20 | #' \code{\link{mahalanobis_distance}} for computing the Mahalanobis Distance values 21 | #' 22 | #' @examples 23 | #' \dontrun{ 24 | #' x = matrix(rnorm(200*3), ncol = 10) 25 | #' colnames(x) = paste0("C", 1:ncol(x)) 26 | #' 27 | #' # compute the relative differences for row 5 and return all variables 28 | #' x %>% 29 | #' mahalanobis_distance("bd", normalize = TRUE) %>% 30 | #' bd_row(5) 31 | #' 32 | #' # compute the relative differences for row 5 and return the top 3 variables 33 | #' # that are influencing the Mahalanobis Distance the most 34 | #' x %>% 35 | #' mahalanobis_distance("bd", normalize = TRUE) %>% 36 | #' bd_row(5, 3) 37 | #' } 38 | #' 39 | #' @export 40 | 41 | bd_row <- function(data, row, n = NULL) { 42 | 43 | # return error if parameters are missing or invalid 44 | if(missing(data)) { 45 | stop("Missing data argument", call. = FALSE) 46 | } 47 | if(! row %in% seq_len(nrow(data))) { 48 | stop("Invalid row value", call. = FALSE) 49 | } 50 | if(length(row) != 1) { 51 | stop("row value must be a single integer", call. = FALSE) 52 | } 53 | if(!isTRUE(n %in% seq_len(ncol(data))) && !is.null(n)) { 54 | stop("Invalid n value", call. = FALSE) 55 | } 56 | 57 | C <- stats::cov(data) 58 | CM <- as.matrix(colMeans(data)) 59 | D <- (data[row,] - t(CM)) 60 | bd <- D /sqrt(diag(C)) 61 | bd_abs <- abs(bd) 62 | tmp <- sort(bd_abs, decreasing = TRUE, index.return = TRUE) 63 | bd_sort <- tmp$x 64 | bd_index <- tmp$ix 65 | 66 | output <- bd_sort 67 | names(output) <- colnames(data)[bd_index] 68 | 69 | if(!is.null(n)) { 70 | output <- output[seq_len(n)] 71 | } 72 | 73 | return(output) 74 | 75 | } 76 | 77 | -------------------------------------------------------------------------------- /README.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | output: github_document 3 | --- 4 | 5 | 6 | 7 | ```{r, echo = FALSE} 8 | knitr::opts_chunk$set( 9 | collapse = TRUE, 10 | comment = "#>", 11 | fig.path = "README-" 12 | ) 13 | ``` 14 | 15 | [![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/anomalyDetection)](https://cran.r-project.org/package=anomalyDetection) 16 | [![Travis-CI Build Status](https://travis-ci.org/koalaverse/anomalyDetection.svg?branch=master)](https://travis-ci.org/koalaverse/anomalyDetection) 17 | [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/bradleyboehmke/anomalyDetection?branch=master&svg=true)](https://ci.appveyor.com/project/bradleyboehmke/anomalyDetection) 18 | [![codecov](https://codecov.io/gh/koalaverse/anomalyDetection/branch/master/graph/badge.svg)](https://codecov.io/gh/koalaverse/anomalyDetection) 19 | [![Downloads](http://cranlogs.r-pkg.org/badges/anomalyDetection)](http://cranlogs.r-pkg.org/badges/anomalyDetection) 20 | [![Total Downloads](http://cranlogs.r-pkg.org/badges/grand-total/anomalyDetection)](http://cranlogs.r-pkg.org/badges/grand-total/anomalyDetection) 21 | 22 | # anomalyDetection 23 | 24 | `anomalyDetection` implements procedures to aid in detecting network log anomalies. By combining various multivariate analytic approaches relevant to network anomaly detection, it provides cyber analysts efficient means to detect suspected anomalies requiring further evaluation. 25 | 26 | 27 | ## Installation 28 | 29 | You can install `anomalyDetection` two ways. 30 | 31 | - Using the latest released version from CRAN: 32 | 33 | ``` 34 | install.packages("anomalyDetection") 35 | ``` 36 | 37 | - Using the latest development version from GitHub: 38 | 39 | ``` 40 | if (packageVersion("devtools") < 1.6) { 41 | install.packages("devtools") 42 | } 43 | 44 | devtools::install_github("koalaverse/anomalyDetection", build_vignettes = TRUE) 45 | ``` 46 | 47 | ## Learning 48 | 49 | To get started with `anomalyDetection`, read the intro [vignette](https://cran.r-project.org/web/packages/anomalyDetection/vignettes/Introduction.html): `vignette("Introduction", package = "anomalyDetection")`. This will provide a thorough introduction to the functions provided in the package. 50 | 51 | ## References 52 | 53 | Gutierrez, R.J., Boehmke, B.C., Bauer, K.W., Saie, C.M. & Bihl, T.J. (2017) "`anomalyDetection`: Implementation of augmented network log anomaly detection procedures." The R Journal, 9(2), 354-365. [link](https://journal.r-project.org/archive/2017/RJ-2017-039/index.html) 54 | -------------------------------------------------------------------------------- /docs/jquery.sticky-kit.min.js: -------------------------------------------------------------------------------- 1 | /* 2 | Sticky-kit v1.1.2 | WTFPL | Leaf Corcoran 2015 | http://leafo.net 3 | */ 4 | (function(){var b,f;b=this.jQuery||window.jQuery;f=b(window);b.fn.stick_in_parent=function(d){var A,w,J,n,B,K,p,q,k,E,t;null==d&&(d={});t=d.sticky_class;B=d.inner_scrolling;E=d.recalc_every;k=d.parent;q=d.offset_top;p=d.spacer;w=d.bottoming;null==q&&(q=0);null==k&&(k=void 0);null==B&&(B=!0);null==t&&(t="is_stuck");A=b(document);null==w&&(w=!0);J=function(a,d,n,C,F,u,r,G){var v,H,m,D,I,c,g,x,y,z,h,l;if(!a.data("sticky_kit")){a.data("sticky_kit",!0);I=A.height();g=a.parent();null!=k&&(g=g.closest(k)); 5 | if(!g.length)throw"failed to find stick parent";v=m=!1;(h=null!=p?p&&a.closest(p):b("
"))&&h.css("position",a.css("position"));x=function(){var c,f,e;if(!G&&(I=A.height(),c=parseInt(g.css("border-top-width"),10),f=parseInt(g.css("padding-top"),10),d=parseInt(g.css("padding-bottom"),10),n=g.offset().top+c+f,C=g.height(),m&&(v=m=!1,null==p&&(a.insertAfter(h),h.detach()),a.css({position:"",top:"",width:"",bottom:""}).removeClass(t),e=!0),F=a.offset().top-(parseInt(a.css("margin-top"),10)||0)-q, 6 | u=a.outerHeight(!0),r=a.css("float"),h&&h.css({width:a.outerWidth(!0),height:u,display:a.css("display"),"vertical-align":a.css("vertical-align"),"float":r}),e))return l()};x();if(u!==C)return D=void 0,c=q,z=E,l=function(){var b,l,e,k;if(!G&&(e=!1,null!=z&&(--z,0>=z&&(z=E,x(),e=!0)),e||A.height()===I||x(),e=f.scrollTop(),null!=D&&(l=e-D),D=e,m?(w&&(k=e+u+c>C+n,v&&!k&&(v=!1,a.css({position:"fixed",bottom:"",top:c}).trigger("sticky_kit:unbottom"))),eb&&!v&&(c-=l,c=Math.max(b-u,c),c=Math.min(q,c),m&&a.css({top:c+"px"})))):e>F&&(m=!0,b={position:"fixed",top:c},b.width="border-box"===a.css("box-sizing")?a.outerWidth()+"px":a.width()+"px",a.css(b).addClass(t),null==p&&(a.after(h),"left"!==r&&"right"!==r||h.append(a)),a.trigger("sticky_kit:stick")),m&&w&&(null==k&&(k=e+u+c>C+n),!v&&k)))return v=!0,"static"===g.css("position")&&g.css({position:"relative"}), 8 | a.css({position:"absolute",bottom:d,top:"auto"}).trigger("sticky_kit:bottom")},y=function(){x();return l()},H=function(){G=!0;f.off("touchmove",l);f.off("scroll",l);f.off("resize",y);b(document.body).off("sticky_kit:recalc",y);a.off("sticky_kit:detach",H);a.removeData("sticky_kit");a.css({position:"",bottom:"",top:"",width:""});g.position("position","");if(m)return null==p&&("left"!==r&&"right"!==r||a.insertAfter(h),h.remove()),a.removeClass(t)},f.on("touchmove",l),f.on("scroll",l),f.on("resize", 9 | y),b(document.body).on("sticky_kit:recalc",y),a.on("sticky_kit:detach",H),setTimeout(l,0)}};n=0;for(K=this.length;n do not edit by hand 2 | // Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393 3 | 4 | #include 5 | #include 6 | 7 | using namespace Rcpp; 8 | 9 | // compute_md 10 | arma::mat compute_md(arma::mat x); 11 | RcppExport SEXP _anomalyDetection_compute_md(SEXP xSEXP) { 12 | BEGIN_RCPP 13 | Rcpp::RObject rcpp_result_gen; 14 | Rcpp::RNGScope rcpp_rngScope_gen; 15 | Rcpp::traits::input_parameter< arma::mat >::type x(xSEXP); 16 | rcpp_result_gen = Rcpp::wrap(compute_md(x)); 17 | return rcpp_result_gen; 18 | END_RCPP 19 | } 20 | // compute_bd 21 | arma::mat compute_bd(arma::mat x, bool normalize); 22 | RcppExport SEXP _anomalyDetection_compute_bd(SEXP xSEXP, SEXP normalizeSEXP) { 23 | BEGIN_RCPP 24 | Rcpp::RObject rcpp_result_gen; 25 | Rcpp::RNGScope rcpp_rngScope_gen; 26 | Rcpp::traits::input_parameter< arma::mat >::type x(xSEXP); 27 | Rcpp::traits::input_parameter< bool >::type normalize(normalizeSEXP); 28 | rcpp_result_gen = Rcpp::wrap(compute_bd(x, normalize)); 29 | return rcpp_result_gen; 30 | END_RCPP 31 | } 32 | // compute_md_and_bd 33 | arma::mat compute_md_and_bd(arma::mat x, bool normalize); 34 | RcppExport SEXP _anomalyDetection_compute_md_and_bd(SEXP xSEXP, SEXP normalizeSEXP) { 35 | BEGIN_RCPP 36 | Rcpp::RObject rcpp_result_gen; 37 | Rcpp::RNGScope rcpp_rngScope_gen; 38 | Rcpp::traits::input_parameter< arma::mat >::type x(xSEXP); 39 | Rcpp::traits::input_parameter< bool >::type normalize(normalizeSEXP); 40 | rcpp_result_gen = Rcpp::wrap(compute_md_and_bd(x, normalize)); 41 | return rcpp_result_gen; 42 | END_RCPP 43 | } 44 | // compute_hc 45 | arma::mat compute_hc(int n, int p, int nsim); 46 | RcppExport SEXP _anomalyDetection_compute_hc(SEXP nSEXP, SEXP pSEXP, SEXP nsimSEXP) { 47 | BEGIN_RCPP 48 | Rcpp::RObject rcpp_result_gen; 49 | Rcpp::RNGScope rcpp_rngScope_gen; 50 | Rcpp::traits::input_parameter< int >::type n(nSEXP); 51 | Rcpp::traits::input_parameter< int >::type p(pSEXP); 52 | Rcpp::traits::input_parameter< int >::type nsim(nsimSEXP); 53 | rcpp_result_gen = Rcpp::wrap(compute_hc(n, p, nsim)); 54 | return rcpp_result_gen; 55 | END_RCPP 56 | } 57 | 58 | static const R_CallMethodDef CallEntries[] = { 59 | {"_anomalyDetection_compute_md", (DL_FUNC) &_anomalyDetection_compute_md, 1}, 60 | {"_anomalyDetection_compute_bd", (DL_FUNC) &_anomalyDetection_compute_bd, 2}, 61 | {"_anomalyDetection_compute_md_and_bd", (DL_FUNC) &_anomalyDetection_compute_md_and_bd, 2}, 62 | {"_anomalyDetection_compute_hc", (DL_FUNC) &_anomalyDetection_compute_hc, 3}, 63 | {NULL, NULL, 0} 64 | }; 65 | 66 | RcppExport void R_init_anomalyDetection(DllInfo *dll) { 67 | R_registerRoutines(dll, NULL, CallEntries, NULL, NULL); 68 | R_useDynamicSymbols(dll, FALSE); 69 | } 70 | -------------------------------------------------------------------------------- /man/principal_components.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/pca.R 3 | \name{principal_components} 4 | \alias{principal_components} 5 | \title{Principal Component Analysis} 6 | \usage{ 7 | principal_components(data, retx = TRUE, center = TRUE, scale. = FALSE, 8 | tol = NULL, ...) 9 | } 10 | \arguments{ 11 | \item{data}{numeric data.} 12 | 13 | \item{retx}{a logical value indicating whether the rotated variables should be returned.} 14 | 15 | \item{center}{a logical value indicating whether the variables should be shifted to be 16 | zero centered. Alternately, a vector of length equal the number of columns of x can 17 | be supplied. The value is passed to scale.} 18 | 19 | \item{scale.}{a logical value indicating whether the variables should be scaled to have 20 | unit variance before the analysis takes place. The default is FALSE for consistency 21 | with S, but in general scaling is advisable. Alternatively, a vector of length equal 22 | the number of columns of \code{data} can be supplied. The value is passed to scale.} 23 | 24 | \item{tol}{a value indicating the magnitude below which components should be omitted. 25 | (Components are omitted if their standard deviations are less than or equal to tol 26 | times the standard deviation of the first component.) With the default null setting, 27 | no components are omitted. Other settings for tol could be \code{tol = 0} or 28 | \code{tol = sqrt(.Machine$double.eps)}, which would omit essentially constant components.} 29 | 30 | \item{...}{arguments passed to or from other methods.} 31 | } 32 | \value{ 33 | \code{principal_components} returns a list containing the following components: 34 | \enumerate{ 35 | \item \code{pca_sdev}: the standard deviations of the principal components (i.e., the square roots of the eigenvalues of the correlation matrix, though the calculation is actually done with the singular values of the data matrix). 36 | \item \code{pca_loadings}: the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors). 37 | \item \code{pca_rotated}: if \code{retx} is \code{TRUE} the value of the rotated data (the centred (and scaled if requested) data multiplied by the rotation matrix) is returned. Hence, \code{cov(x)} is the diagonal matrix \code{diag(sdev^2)}. 38 | \item \code{pca_center}: the centering used 39 | \item \code{pca_scale}: whether scaling was used 40 | } 41 | } 42 | \description{ 43 | \code{principal_components} relates the data to a set of a components through 44 | the eigen-decomposition of the correlation matrix, where each component explains 45 | some variance of the data and returns the results as an object of class prcomp. 46 | } 47 | \details{ 48 | The calculation is done by a singular value decomposition of the (centered and 49 | possibly scaled) data matrix, not by using eigen on the covariance matrix. 50 | This is generally the preferred method for numerical accuracy 51 | } 52 | \examples{ 53 | x <- matrix(rnorm(200 * 3), ncol = 10) 54 | principal_components(x) 55 | principal_components(x, scale = TRUE) 56 | 57 | } 58 | \seealso{ 59 | \code{\link{prcomp}}, \code{\link{biplot.prcomp}}, \code{\link{screeplot}}, \code{\link{cor}}, 60 | \code{\link{cov}}, \code{\link{svd}}, \code{\link{eigen}} 61 | } 62 | -------------------------------------------------------------------------------- /R/mahalanobis_distance.R: -------------------------------------------------------------------------------- 1 | #' Mahalanobis Distance 2 | #' 3 | #' Calculates the distance between the elements in a data set and the mean 4 | #' vector of the data for outlier detection. Values are independent of the scale 5 | #' between variables. 6 | #' 7 | #' @param data A matrix or data frame. Data frames will be converted to matrices 8 | #' via \code{data.matrix}. 9 | #' 10 | #' @param output Character string specifying which distance metric(s) to 11 | #' compute. Current options include: \code{"md"} for Mahalanobis distance 12 | #' (default); \code{"bd"} for absolute breakdown distance (used to see which 13 | #' columns drive the Mahalanobis distance); and \code{"both"} to return both 14 | #' distance metrics. 15 | #' 16 | #' @param normalize Logical indicating whether or not to normalize the breakdown 17 | #' distances within each column (so that breakdown distances across columns can 18 | #' be compared). 19 | #' 20 | #' @return If \code{output = "md"}, then a vector containing the Mahalanobis 21 | #' distances is returned. Otherwise, a matrix. 22 | #' 23 | #' @references 24 | #' W. Wang and R. Battiti, "Identifying Intrusions in Computer Networks with 25 | #' Principal Component Analysis," in First International Conference on 26 | #' Availability, Reliability and Security, 2006. 27 | #' 28 | #' @rdname mahalanobis_distance 29 | #' 30 | #' @export 31 | #' 32 | #' @examples 33 | #' \dontrun{ 34 | #' # Simulate some data 35 | #' x <- data.frame(C1 = rnorm(100), C2 = rnorm(100), C3 = rnorm(100)) 36 | #' 37 | #' # Add Mahalanobis distances 38 | #' x %>% dplyr::mutate(MD = mahalanobis_distance(x)) 39 | #' 40 | #' # Add Mahalanobis and breakdown distances 41 | #' x %>% cbind(mahalanobis_distance(x, output = "both")) 42 | #' 43 | #' # Add Mahalanobis and normalized breakdown distances 44 | #' x %>% cbind(mahalanobis_distance(x, output = "both", normalize = TRUE)) 45 | #' } 46 | mahalanobis_distance <- function(data, output = c("md", "bd", "both"), 47 | normalize = FALSE) { 48 | UseMethod("mahalanobis_distance") 49 | } 50 | 51 | 52 | #' @rdname mahalanobis_distance 53 | #' @export 54 | mahalanobis_distance.matrix <- function(data, output = c("md", "bd", "both"), 55 | normalize = FALSE) { 56 | 57 | # Check params 58 | output <- match.arg(output) 59 | 60 | # Check column names 61 | if (is.null(colnames(data))) { 62 | colnames(data) <- seq_len(ncol(data)) 63 | } 64 | 65 | # Compute distances 66 | if (output == "md") { # mahalanobis distance only 67 | compute_md(data)[, , drop = TRUE] 68 | } else if (output == "bd") { # absolute breakdown distance only 69 | out <- compute_bd(data, normalize = normalize) 70 | colnames(out) <- paste0(colnames(data), "_BD") 71 | out 72 | } else { # both 73 | out <- compute_md_and_bd(data, normalize = normalize) 74 | colnames(out) <- c("MD", paste0(colnames(data), "_BD")) 75 | out 76 | } 77 | 78 | } 79 | 80 | 81 | #' @rdname mahalanobis_distance 82 | #' @export 83 | mahalanobis_distance.data.frame <- function(data, 84 | output = c("md", "bd", "both"), 85 | normalize = FALSE) { 86 | mahalanobis_distance.matrix(data.matrix(data), output = output, 87 | normalize = normalize) 88 | } 89 | -------------------------------------------------------------------------------- /man/hmat.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/hmat.R 3 | \name{hmat} 4 | \alias{hmat} 5 | \title{Plot a Histogram Matrix} 6 | \usage{ 7 | hmat(data, input = "data", top = 20, order = "numeric", 8 | block_length = NULL, level_limit = 50, level_keep = 10, 9 | partial_block = TRUE, na.rm = FALSE, min_var = 0.1, max_cor = 0.9, 10 | action = "exclude", output = "both", normalize = FALSE) 11 | } 12 | \arguments{ 13 | \item{data}{the data set (data frame or matrix)} 14 | 15 | \item{input}{the type of input data being passed to the function. \code{data} for 16 | a raw categorical data set, \code{SV} for a state vector input, and \code{MD} if the 17 | input has already had the Mahalanobis distances calculated} 18 | 19 | \item{top}{how many of the most anomalous blocks you would like to display 20 | (default 20)} 21 | 22 | \item{order}{whether to show the anomalous blocks in numeric order or in order of 23 | most anomalous to least anomalous (default is "numeric", other choice is "anomaly")} 24 | 25 | \item{block_length}{argument fed into \code{tabulate_state_vector}, necessary if 26 | \code{input = data}} 27 | 28 | \item{level_limit}{argument fed into \code{tabulate_state_vector}, if the 29 | number of unique categories for a variable exceeds this number, only keep 30 | a limited number of the most popular values (default 50)} 31 | 32 | \item{level_keep}{argument fed into \code{tabulate_state_vector}, if \code{level_limit} 33 | is exceeded, keep this many of the most popular values (default 10)} 34 | 35 | \item{partial_block}{argument fed into \code{tabulate_state_vector}, if the number of 36 | entries is not divisible by the \code{block_length}, this logical decides 37 | whether to keep the smaller last block (default \code{TRUE})} 38 | 39 | \item{na.rm}{whether to keep track of missing values as part of the analysis or 40 | ignore them (default \code{FALSE})} 41 | 42 | \item{min_var}{argument fed into \code{mc_adjust}, if a column in the state 43 | vector has variance less than this value, remove it (default 0.1)} 44 | 45 | \item{max_cor}{argument fed into \code{mc_adjust}, if a column in the state 46 | vector has correlation greater than this value, remove it (default 0.9)} 47 | 48 | \item{action}{argument fed into \code{mc_adjust}, if a column does not fall in 49 | the specified range, determine what to do with it (default "exclude")} 50 | 51 | \item{output}{argument fed into \code{mahalanobis_distance} that decides 52 | whether to add a column for the Mahalanobis Distance ('MD'), the breakdown 53 | distances ('BD') or both (default "both")} 54 | 55 | \item{normalize}{argument fed into \code{mahalanobis_distance} that decides 56 | whether to normalize the values by column (default = FALSE)} 57 | } 58 | \description{ 59 | Display a histogram matrix for visual inspection of anomalous 60 | observation detection. The color of the blocks represents how anomalous each 61 | block is, where a lighter blue represents a more anomalous block. The size 62 | of the points indicate which values are driving the anomaly, with larger 63 | blocks representing more anomalous values. 64 | } 65 | \examples{ 66 | \dontrun{ 67 | # Data set input 68 | hmat(security_logs,block_length = 8) 69 | 70 | # Data Set input with top 10 blocks displayed 71 | hmat(security_logs, top = 10, block_length = 5) 72 | 73 | # State Vector Input 74 | tabulate_state_vector(security_logs, block_length = 6, level_limit = 20) \%>\% 75 | hmat(input = "SV") 76 | } 77 | 78 | } 79 | -------------------------------------------------------------------------------- /docs/pkgdown.js: -------------------------------------------------------------------------------- 1 | /* http://gregfranko.com/blog/jquery-best-practices/ */ 2 | (function($) { 3 | $(function() { 4 | 5 | $("#sidebar") 6 | .stick_in_parent({offset_top: 40}) 7 | .on('sticky_kit:bottom', function(e) { 8 | $(this).parent().css('position', 'static'); 9 | }) 10 | .on('sticky_kit:unbottom', function(e) { 11 | $(this).parent().css('position', 'relative'); 12 | }); 13 | 14 | $('body').scrollspy({ 15 | target: '#sidebar', 16 | offset: 60 17 | }); 18 | 19 | $('[data-toggle="tooltip"]').tooltip(); 20 | 21 | var cur_path = paths(location.pathname); 22 | var links = $("#navbar ul li a"); 23 | var max_length = -1; 24 | var pos = -1; 25 | for (var i = 0; i < links.length; i++) { 26 | if (links[i].getAttribute("href") === "#") 27 | continue; 28 | var path = paths(links[i].pathname); 29 | 30 | var length = prefix_length(cur_path, path); 31 | if (length > max_length) { 32 | max_length = length; 33 | pos = i; 34 | } 35 | } 36 | 37 | // Add class to parent
  • , and enclosing
  • if in dropdown 38 | if (pos >= 0) { 39 | var menu_anchor = $(links[pos]); 40 | menu_anchor.parent().addClass("active"); 41 | menu_anchor.closest("li.dropdown").addClass("active"); 42 | } 43 | }); 44 | 45 | function paths(pathname) { 46 | var pieces = pathname.split("/"); 47 | pieces.shift(); // always starts with / 48 | 49 | var end = pieces[pieces.length - 1]; 50 | if (end === "index.html" || end === "") 51 | pieces.pop(); 52 | return(pieces); 53 | } 54 | 55 | function prefix_length(needle, haystack) { 56 | if (needle.length > haystack.length) 57 | return(0); 58 | 59 | // Special case for length-0 haystack, since for loop won't run 60 | if (haystack.length === 0) { 61 | return(needle.length === 0 ? 1 : 0); 62 | } 63 | 64 | for (var i = 0; i < haystack.length; i++) { 65 | if (needle[i] != haystack[i]) 66 | return(i); 67 | } 68 | 69 | return(haystack.length); 70 | } 71 | 72 | /* Clipboard --------------------------*/ 73 | 74 | function changeTooltipMessage(element, msg) { 75 | var tooltipOriginalTitle=element.getAttribute('data-original-title'); 76 | element.setAttribute('data-original-title', msg); 77 | $(element).tooltip('show'); 78 | element.setAttribute('data-original-title', tooltipOriginalTitle); 79 | } 80 | 81 | if(Clipboard.isSupported()) { 82 | $(document).ready(function() { 83 | var copyButton = ""; 84 | 85 | $(".examples, div.sourceCode").addClass("hasCopyButton"); 86 | 87 | // Insert copy buttons: 88 | $(copyButton).prependTo(".hasCopyButton"); 89 | 90 | // Initialize tooltips: 91 | $('.btn-copy-ex').tooltip({container: 'body'}); 92 | 93 | // Initialize clipboard: 94 | var clipboardBtnCopies = new Clipboard('[data-clipboard-copy]', { 95 | text: function(trigger) { 96 | return trigger.parentNode.textContent; 97 | } 98 | }); 99 | 100 | clipboardBtnCopies.on('success', function(e) { 101 | changeTooltipMessage(e.trigger, 'Copied!'); 102 | e.clearSelection(); 103 | }); 104 | 105 | clipboardBtnCopies.on('error', function() { 106 | changeTooltipMessage(e.trigger,'Press Ctrl+C or Command+C to copy'); 107 | }); 108 | }); 109 | } 110 | })(window.jQuery || window.$) 111 | -------------------------------------------------------------------------------- /R/mc_adjust.R: -------------------------------------------------------------------------------- 1 | #' @title Multi-Collinearity Adjustment 2 | #' 3 | #' @description 4 | #' \code{mc_adjust} handles issues with multi-collinearity. 5 | #' 6 | #' @param data named numeric data object (either data frame or matrix) 7 | #' @param min_var numeric value between 0-1 for the minimum acceptable variance (default = 0.1) 8 | #' @param max_cor numeric value between 0-1 for the maximum acceptable correlation (default = 0.9) 9 | #' @param action select action for handling columns causing multi-collinearity issues 10 | #' \enumerate{ 11 | #' \item \code{exclude}: exclude all columns causing multi-collinearity issues (default) 12 | #' \item \code{select}: identify the columns causing multi-collinearity issues 13 | #' and allow the user to interactively select those columns to remove 14 | #' } 15 | #' 16 | #' @details 17 | #' 18 | #' \code{mc_adjust} handles issues with multi-collinearity by first removing 19 | #' any columns whose variance is close to or less than \code{min_var}. Then, it 20 | #' removes linearly dependent columns. Finally, it removes any columns that have 21 | #' a high absolute correlation value equal to or greater than \code{max_cor}. 22 | #' 23 | #' @return 24 | #' \code{mc_adjust} returns the numeric data object supplied minus variables 25 | #' violating the minimum acceptable variance (\code{min_var}) and the 26 | #' maximum acceptable correlation (\code{max_cor}) levels. 27 | #' 28 | #' 29 | #' @examples 30 | #' \dontrun{ 31 | #' x <- matrix(runif(100), ncol = 10) 32 | #' x %>% 33 | #' mc_adjust() 34 | #' 35 | #' x %>% 36 | #' mc_adjust(min_var = .15, max_cor = .75, action = "select") 37 | #'} 38 | #' 39 | #' @export 40 | 41 | mc_adjust <- function(data, min_var = 0.1, max_cor = 0.9, action = "exclude") { 42 | 43 | if(!is.numeric(min_var) || !is.numeric(max_cor)) { 44 | stop("min_var and max_cor must be numeric inputs") 45 | } 46 | 47 | if(action != "exclude" && action != "select") { 48 | stop("The action argument must be either 'exclude' or 'select'") 49 | } 50 | 51 | # add argument validation 52 | if(is.null(colnames(data))) { 53 | colnames(data) <- paste0("var", seq_len(ncol(data))) 54 | } 55 | 56 | # treat data as a data frame 57 | data <- as.data.frame(data) 58 | 59 | # Remove columns with minimal variance 60 | low_var <- names(data)[diag(stats::var(data)) > min_var] 61 | # throw error if all columns are removed 62 | if(length(low_var) == 0) { 63 | stop("All variables have been removed based on the min_var\n", 64 | "level. Consider adjusting minimum acceptable variance\n", 65 | "levels to allow for some variables to be retained.") 66 | } else { 67 | if (action == "exclude") {data <- data[,low_var, drop = FALSE]} 68 | } 69 | 70 | # Remove linearly dependent columns (but only if there are at least 2 columns) 71 | if (ncol(data) > 1) { 72 | cor_mat <- stats::cor(data) 73 | cor_mat[lower.tri(cor_mat, diag = TRUE)] <- 0 74 | high_cor <- names(data[,sapply(as.data.frame(cor_mat),function(x) max(abs(x)) < 0.9)]) 75 | if (action == "exclude") {data <- data[,high_cor, drop = FALSE]; return(tibble::as.tibble(data))} 76 | 77 | if (action == "select" & ncol(data) != length(intersect(low_var, high_cor))) { 78 | col2rmv <- setdiff(names(data),intersect(low_var, high_cor)) 79 | message("The following variables are set to be removed:") 80 | message(paste(col2rmv,"\n")) 81 | keep <- unlist(strsplit(readline("Which of these variables would you like to keep? "), split = " ")) 82 | if (all(keep %in% col2rmv)) { 83 | data <- data[,union(intersect(low_var, high_cor),keep), drop = FALSE] 84 | return(tibble::as.tibble(data)) 85 | } else if (keep == ""){ 86 | data <- data[,intersect(low_var, high_cor), drop = FALSE] 87 | } else { 88 | stop("One or more of the variables entered is not an option.") 89 | } 90 | } 91 | } else { 92 | return(tibble::as.tibble(data)) 93 | } 94 | 95 | } 96 | -------------------------------------------------------------------------------- /docs/articles/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Articles • anomalyDetection 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 44 | 45 | 46 | 47 | 48 | 49 |
    50 |
    51 | 105 | 106 | 107 |
    108 | 109 |
    110 |
    111 | 114 | 115 |
    116 |

    All vignettes

    117 |

    118 | 119 | 122 |
    123 |
    124 |
    125 | 126 |
    127 | 130 | 131 |
    132 |

    Site built with pkgdown.

    133 |
    134 | 135 |
    136 |
    137 | 138 | 139 | 140 | 141 | 142 | 143 | -------------------------------------------------------------------------------- /docs/pkgdown.css: -------------------------------------------------------------------------------- 1 | /* Sticky footer */ 2 | 3 | /** 4 | * Basic idea: https://philipwalton.github.io/solved-by-flexbox/demos/sticky-footer/ 5 | * Details: https://github.com/philipwalton/solved-by-flexbox/blob/master/assets/css/components/site.css 6 | * 7 | * .Site -> body > .container 8 | * .Site-content -> body > .container .row 9 | * .footer -> footer 10 | * 11 | * Key idea seems to be to ensure that .container and __all its parents__ 12 | * have height set to 100% 13 | * 14 | */ 15 | 16 | html, body { 17 | height: 100%; 18 | } 19 | 20 | body > .container { 21 | display: flex; 22 | height: 100%; 23 | flex-direction: column; 24 | 25 | padding-top: 60px; 26 | } 27 | 28 | body > .container .row { 29 | flex: 1 0 auto; 30 | } 31 | 32 | footer { 33 | margin-top: 45px; 34 | padding: 35px 0 36px; 35 | border-top: 1px solid #e5e5e5; 36 | color: #666; 37 | display: flex; 38 | flex-shrink: 0; 39 | } 40 | footer p { 41 | margin-bottom: 0; 42 | } 43 | footer div { 44 | flex: 1; 45 | } 46 | footer .pkgdown { 47 | text-align: right; 48 | } 49 | footer p { 50 | margin-bottom: 0; 51 | } 52 | 53 | img.icon { 54 | float: right; 55 | } 56 | 57 | img { 58 | max-width: 100%; 59 | } 60 | 61 | /* Typographic tweaking ---------------------------------*/ 62 | 63 | .contents h1.page-header { 64 | margin-top: calc(-60px + 1em); 65 | } 66 | 67 | /* Section anchors ---------------------------------*/ 68 | 69 | a.anchor { 70 | margin-left: -30px; 71 | display:inline-block; 72 | width: 30px; 73 | height: 30px; 74 | visibility: hidden; 75 | 76 | background-image: url(./link.svg); 77 | background-repeat: no-repeat; 78 | background-size: 20px 20px; 79 | background-position: center center; 80 | } 81 | 82 | .hasAnchor:hover a.anchor { 83 | visibility: visible; 84 | } 85 | 86 | @media (max-width: 767px) { 87 | .hasAnchor:hover a.anchor { 88 | visibility: hidden; 89 | } 90 | } 91 | 92 | 93 | /* Fixes for fixed navbar --------------------------*/ 94 | 95 | .contents h1, .contents h2, .contents h3, .contents h4 { 96 | padding-top: 60px; 97 | margin-top: -40px; 98 | } 99 | 100 | /* Static header placement on mobile devices */ 101 | @media (max-width: 767px) { 102 | .navbar-fixed-top { 103 | position: absolute; 104 | } 105 | .navbar { 106 | padding: 0; 107 | } 108 | } 109 | 110 | 111 | /* Sidebar --------------------------*/ 112 | 113 | #sidebar { 114 | margin-top: 30px; 115 | } 116 | #sidebar h2 { 117 | font-size: 1.5em; 118 | margin-top: 1em; 119 | } 120 | 121 | #sidebar h2:first-child { 122 | margin-top: 0; 123 | } 124 | 125 | #sidebar .list-unstyled li { 126 | margin-bottom: 0.5em; 127 | } 128 | 129 | .orcid { 130 | height: 16px; 131 | vertical-align: middle; 132 | } 133 | 134 | /* Reference index & topics ----------------------------------------------- */ 135 | 136 | .ref-index th {font-weight: normal;} 137 | 138 | .ref-index td {vertical-align: top;} 139 | .ref-index .alias {width: 40%;} 140 | .ref-index .title {width: 60%;} 141 | 142 | .ref-index .alias {width: 40%;} 143 | .ref-index .title {width: 60%;} 144 | 145 | .ref-arguments th {text-align: right; padding-right: 10px;} 146 | .ref-arguments th, .ref-arguments td {vertical-align: top;} 147 | .ref-arguments .name {width: 20%;} 148 | .ref-arguments .desc {width: 80%;} 149 | 150 | /* Nice scrolling for wide elements --------------------------------------- */ 151 | 152 | table { 153 | display: block; 154 | overflow: auto; 155 | } 156 | 157 | /* Syntax highlighting ---------------------------------------------------- */ 158 | 159 | pre { 160 | word-wrap: normal; 161 | word-break: normal; 162 | border: 1px solid #eee; 163 | } 164 | 165 | pre, code { 166 | background-color: #f8f8f8; 167 | color: #333; 168 | } 169 | 170 | pre code { 171 | overflow: auto; 172 | word-wrap: normal; 173 | white-space: pre; 174 | } 175 | 176 | pre .img { 177 | margin: 5px 0; 178 | } 179 | 180 | pre .img img { 181 | background-color: #fff; 182 | display: block; 183 | height: auto; 184 | } 185 | 186 | code a, pre a { 187 | color: #375f84; 188 | } 189 | 190 | a.sourceLine:hover { 191 | text-decoration: none; 192 | } 193 | 194 | .fl {color: #1514b5;} 195 | .fu {color: #000000;} /* function */ 196 | .ch,.st {color: #036a07;} /* string */ 197 | .kw {color: #264D66;} /* keyword */ 198 | .co {color: #888888;} /* comment */ 199 | 200 | .message { color: black; font-weight: bolder;} 201 | .error { color: orange; font-weight: bolder;} 202 | .warning { color: #6A0366; font-weight: bolder;} 203 | 204 | /* Clipboard --------------------------*/ 205 | 206 | .hasCopyButton { 207 | position: relative; 208 | } 209 | 210 | .btn-copy-ex { 211 | position: absolute; 212 | right: 0; 213 | top: 0; 214 | visibility: hidden; 215 | } 216 | 217 | .hasCopyButton:hover button.btn-copy-ex { 218 | visibility: visible; 219 | } 220 | 221 | /* mark.js ----------------------------*/ 222 | 223 | mark { 224 | background-color: rgba(255, 255, 51, 0.5); 225 | border-bottom: 2px solid rgba(255, 153, 51, 0.3); 226 | padding: 1px; 227 | } 228 | 229 | /* vertical spacing after htmlwidgets */ 230 | .html-widget { 231 | margin-bottom: 10px; 232 | } 233 | -------------------------------------------------------------------------------- /R/factor_analysis.R: -------------------------------------------------------------------------------- 1 | #' @title Factor Analysis with Varimax Rotation 2 | #' 3 | #' @description 4 | #' \code{factor_analysis} reduces the structure of the data by relating the 5 | #' correlation between variables to a set of factors, using the eigen-decomposition 6 | #' of the correlation matrix. 7 | #' 8 | #' @param data numeric data 9 | #' @param hc_points vector of eigenvalues [designed to use output from \code{\link{horns_curve}}] 10 | #' 11 | #' @return A list containing: 12 | #' \enumerate{ 13 | #' \item \code{fa_loadings}: numerical matrix with the original factor loadings 14 | #' \item \code{fa_scores}: numerical matrix with the row scores for each factor 15 | #' \item \code{fa_loadings_rotated}: numerical matrix with the varimax rotated factor loadings 16 | #' \item \code{fa_scores_rotated}: numerical matrix with the row scores for each varimax rotated factor 17 | #' \item \code{num_factors}: numeric vector identifying the number of factors 18 | #' } 19 | #' 20 | #' @references 21 | #' 22 | #' H. F. Kaiser, "The Application of Electronic Computers to Factor Analysis," 23 | #' Educational and Psychological Measurement, 1960. 24 | #' 25 | #' @seealso 26 | #' 27 | #' \code{\link{horns_curve}} for computing the average eigenvalues used for \code{hc_points} argument 28 | #' 29 | #' @examples 30 | #' 31 | #' # Perform Factor Analysis with matrix \code{x} 32 | #' x <- matrix(rnorm(200*3), ncol = 10) 33 | #' 34 | #' x %>% 35 | #' horns_curve() %>% 36 | #' factor_analysis(x, hc_points = .) 37 | #' 38 | #' @export 39 | 40 | factor_analysis <- function(data, hc_points) { 41 | 42 | # return error if parameters are missing 43 | if(missing(data)) { 44 | stop("Missing argument: data argument", call. = FALSE) 45 | } 46 | if(missing(hc_points)) { 47 | stop("Missing argument: hc_points argument", call. = FALSE) 48 | } 49 | 50 | data <- as.matrix(data) 51 | N <- nrow(data) 52 | M <- ncol(data) 53 | R <- stats::cor(data) 54 | tmp <- eigen(R) 55 | tmp2 <- sort(tmp$values, decreasing = TRUE, index.return = TRUE) 56 | eigval <- tmp2$x 57 | eigvec <- tmp$vectors[, order(tmp2$ix)] 58 | 59 | # dimensionality assessment - finds the number of factors needed to account for 60 | # the variance in the data 61 | num_factors <- sum(eigval >= hc_points) 62 | 63 | xbar <- colMeans(data) 64 | Xd <- data - matrix(1, N, 1) %*% t(xbar) 65 | v <- suppressWarnings(diag(M) * diag(1 / sqrt(stats::var(data)))) 66 | Xs <- Xd %*% v 67 | 68 | eigval2 <- eigval[seq_len(num_factors)] 69 | eigvec2 <- eigvec[,seq_len(num_factors)] 70 | 71 | lambda_mat <- vapply(seq_len(num_factors), function(i) sqrt(eigval2[i]) %*% eigvec2[ , i], numeric(M)) 72 | 73 | # generalized inverse is necessary to avoid matrix close to singularity 74 | fa_scores <- Xs %*% MASS::ginv(R) %*% lambda_mat 75 | 76 | rotation <- stats::varimax(lambda_mat) 77 | B <- lambda_mat %*% rotation$rotmat 78 | fa_scores_rotated <- Xs %*% MASS::ginv(R) %*% B 79 | 80 | output <- list(fa_loadings = lambda_mat, 81 | fa_scores = fa_scores, 82 | fa_loadings_rotated = B, 83 | fa_scores_rotated = fa_scores_rotated, 84 | num_factors = num_factors) 85 | 86 | return(output) 87 | 88 | } 89 | 90 | 91 | #' Easy Access to Factor Analysis Results 92 | #' 93 | #' \code{factor_analysis_result} Provides easy access to factor analysis results 94 | #' 95 | #' @param data list output from \code{factor_analysis} 96 | #' @param results factor analysis results to extract. Can use either results 97 | #' name or number (i.e. fa_scores or 2):: 98 | #' \enumerate{ 99 | #' \item \code{fa_loadings} (default) 100 | #' \item \code{fa_scores} 101 | #' \item \code{fa_loadings_rotated} 102 | #' \item \code{fa_scores_rotated} 103 | #' \item \code{num_factors} 104 | #' } 105 | #' 106 | #' 107 | #' @return Returns the one of the selected results: 108 | #' \enumerate{ 109 | #' \item \code{fa_loadings}: numerical matrix with the original factor loadings 110 | #' \item \code{fa_scores}: numerical matrix with the row scores for each factor 111 | #' \item \code{fa_loadings_rotated}: numerical matrix with the varimax rotated factor loadings 112 | #' \item \code{fa_scores_rotated}: numerical matrix with the row scores for each varimax rotated factor 113 | #' \item \code{num_factors}: numeric vector identifying the number of factors 114 | #' } 115 | #' 116 | #' @seealso 117 | #' 118 | #' \code{\link{factor_analysis}} for computing the factor analysis results 119 | #' 120 | #' @examples 121 | #' 122 | #' # An efficient means for getting factor analysis results 123 | #' x <- matrix(rnorm(200*3), ncol = 10) 124 | #' N <- nrow(x) 125 | #' p <- ncol(x) 126 | #' 127 | #' x %>% 128 | #' horns_curve() %>% 129 | #' factor_analysis(x, hc_points = .) %>% 130 | #' factor_analysis_results(fa_scores_rotated) 131 | #' 132 | #' @export 133 | 134 | factor_analysis_results <- function(data, results = 1) { 135 | 136 | result_input <- deparse(substitute(results)) 137 | result_options <- names(data) 138 | 139 | # return error if parameters are missing 140 | if(missing(data)) { 141 | stop("Missing argument: data argument", call. = FALSE) 142 | } 143 | if(missing(result_input)) { 144 | stop("Missing argument: results argument", call. = FALSE) 145 | } 146 | 147 | if(result_input %in% as.character(1:5)) { 148 | data[[results]] 149 | } else if(result_input %in% result_options) { 150 | data[[result_input]] 151 | } else { 152 | stop("Invalid results argument: see ?factor_analysis_results for options", call. = FALSE) 153 | } 154 | } 155 | 156 | 157 | 158 | 159 | -------------------------------------------------------------------------------- /docs/reference/anomalyDetection.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | anomalyDetection: An R package for implementing augmented network log anomoly 10 | detection procedures. — anomalyDetection • anomalyDetection 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 36 | 37 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 50 | 51 | 52 | 53 | 54 | 55 |
    56 |
    57 | 111 | 112 | 113 |
    114 | 115 |
    116 |
    117 | 123 | 124 |
    125 | 126 |

    anomalyDetection: An R package for implementing augmented network log anomoly 127 | detection procedures.

    128 | 129 |
    130 | 131 | 132 | 133 |
    134 | 140 |
    141 | 142 |
    143 | 146 | 147 |
    148 |

    Site built with pkgdown.

    149 |
    150 | 151 |
    152 |
    153 | 154 | 155 | 156 | 157 | 158 | 159 | -------------------------------------------------------------------------------- /R/pca.R: -------------------------------------------------------------------------------- 1 | #' @title Principal Component Analysis 2 | #' 3 | #' @description 4 | #' \code{principal_components} relates the data to a set of a components through 5 | #' the eigen-decomposition of the correlation matrix, where each component explains 6 | #' some variance of the data and returns the results as an object of class prcomp. 7 | #' 8 | #' @param data numeric data. 9 | #' @param retx a logical value indicating whether the rotated variables should be returned. 10 | #' @param center a logical value indicating whether the variables should be shifted to be 11 | #' zero centered. Alternately, a vector of length equal the number of columns of x can 12 | #' be supplied. The value is passed to scale. 13 | #' @param scale. a logical value indicating whether the variables should be scaled to have 14 | #' unit variance before the analysis takes place. The default is FALSE for consistency 15 | #' with S, but in general scaling is advisable. Alternatively, a vector of length equal 16 | #' the number of columns of \code{data} can be supplied. The value is passed to scale. 17 | #' @param tol a value indicating the magnitude below which components should be omitted. 18 | #' (Components are omitted if their standard deviations are less than or equal to tol 19 | #' times the standard deviation of the first component.) With the default null setting, 20 | #' no components are omitted. Other settings for tol could be \code{tol = 0} or 21 | #' \code{tol = sqrt(.Machine$double.eps)}, which would omit essentially constant components. 22 | #' @param ... arguments passed to or from other methods. 23 | #' 24 | #' @details 25 | #' 26 | #' The calculation is done by a singular value decomposition of the (centered and 27 | #' possibly scaled) data matrix, not by using eigen on the covariance matrix. 28 | #' This is generally the preferred method for numerical accuracy 29 | #' 30 | #' @return \code{principal_components} returns a list containing the following components: 31 | #' \enumerate{ 32 | #' \item \code{pca_sdev}: the standard deviations of the principal components (i.e., the square roots of the eigenvalues of the correlation matrix, though the calculation is actually done with the singular values of the data matrix). 33 | #' \item \code{pca_loadings}: the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors). 34 | #' \item \code{pca_rotated}: if \code{retx} is \code{TRUE} the value of the rotated data (the centred (and scaled if requested) data multiplied by the rotation matrix) is returned. Hence, \code{cov(x)} is the diagonal matrix \code{diag(sdev^2)}. 35 | #' \item \code{pca_center}: the centering used 36 | #' \item \code{pca_scale}: whether scaling was used 37 | #' } 38 | #' 39 | #' @seealso 40 | #' 41 | #' \code{\link{prcomp}}, \code{\link{biplot.prcomp}}, \code{\link{screeplot}}, \code{\link{cor}}, 42 | #' \code{\link{cov}}, \code{\link{svd}}, \code{\link{eigen}} 43 | #' 44 | #' 45 | #' @examples 46 | #' x <- matrix(rnorm(200 * 3), ncol = 10) 47 | #' principal_components(x) 48 | #' principal_components(x, scale = TRUE) 49 | #' 50 | #' @export 51 | 52 | principal_components <- function(data, retx = TRUE, center = TRUE, scale. = FALSE, tol = NULL, ...) { 53 | 54 | pca <- stats::prcomp(data, retx = TRUE, center = TRUE, scale. = FALSE, tol = NULL, ...) 55 | list( 56 | pca_sdev = pca[[1]], 57 | pca_loadings = pca[[2]], 58 | pca_rotated = pca[[5]], 59 | pca_center = pca[[3]], 60 | pca_scale = pca[[4]] 61 | ) 62 | 63 | } 64 | 65 | #' Easy Access to Principal Component Analysis Results 66 | #' 67 | #' \code{principal_components_result} Provides easy access to principal 68 | #' component analysis results 69 | #' 70 | #' @param data list output from \code{principal_components} 71 | #' @param results principal component analysis results to extract. Can use either 72 | #' results name or number (i.e. pca_loadings or 2): 73 | #' \enumerate{ 74 | #' \item \code{pca_sdev} 75 | #' \item \code{pca_loadings} (default) 76 | #' \item \code{pca_rotated} 77 | #' \item \code{pca_center} 78 | #' \item \code{pca_scale} 79 | #' } 80 | #' 81 | #' 82 | #' @return Returns one of the selected results: 83 | #' \enumerate{ 84 | #' \item \code{pca_sdev}: the standard deviations of the principal components (i.e., the square roots of the eigenvalues of the correlation matrix, though the calculation is actually done with the singular values of the data matrix). 85 | #' \item \code{pca_loadings}: the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors). 86 | #' \item \code{pca_rotated}: if \code{retx} is \code{TRUE} the value of the rotated data (the centred (and scaled if requested) data multiplied by the rotation matrix) is returned. Hence, \code{cov(x)} is the diagonal matrix \code{diag(sdev^2)}. 87 | #' \item \code{pca_center}: the centering used 88 | #' \item \code{pca_scale}: whether scaling was used 89 | #' } 90 | #' 91 | #' @seealso 92 | #' 93 | #' \code{\link{principal_components}} for computing the principal components results 94 | #' 95 | #' @examples 96 | #' 97 | #' # An efficient means for getting principal component analysis results 98 | #' x <- matrix(rnorm(200 * 3), ncol = 10) 99 | #' 100 | #' principal_components(x) %>% 101 | #' principal_components_result(pca_loadings) 102 | #' 103 | #' @export 104 | 105 | principal_components_result <- function(data, results = 2) { 106 | 107 | result_input <- deparse(substitute(results)) 108 | result_options <- names(data) 109 | 110 | # return error if parameters are missing 111 | if(missing(data)) { 112 | stop("Missing argument: data argument", call. = FALSE) 113 | } 114 | if(missing(result_input)) { 115 | stop("Missing argument: results argument", call. = FALSE) 116 | } 117 | 118 | if(result_input %in% as.character(1:5)) { 119 | data[[results]] 120 | } else if(result_input %in% result_options) { 121 | data[[result_input]] 122 | } else { 123 | stop("Invalid results argument: see ?principal_components_result for options", call. = FALSE) 124 | } 125 | } 126 | -------------------------------------------------------------------------------- /docs/authors.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Citation and Authors • anomalyDetection 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 44 | 45 | 46 | 47 | 48 | 49 |
    50 |
    51 | 105 | 106 | 107 |
    108 | 109 |
    110 |
    111 | 114 | 115 |

    Gutierrez RJ, Boehmke BC, Bauer KW, Saie CM, Bihl TJ (2017). 116 | “anomalyDetection: Implementation of augmented network log anomaly detection procedures.” 117 | The R Journal, 9(2), 354–365. 118 | https://journal.r-project.org/archive/2017/RJ-2017-039/index.html. 119 |

    120 |
    @Article{,
    121 |   title = {anomalyDetection: Implementation of augmented network log anomaly detection procedures.},
    122 |   author = {Robert J. Gutierrez and Bradley C. Boehmke and Kenneth W. Bauer and Cade M. Saie and Trevor J. Bihl},
    123 |   journal = {The R Journal},
    124 |   year = {2017},
    125 |   volume = {9},
    126 |   number = {2},
    127 |   pages = {354--365},
    128 |   url = {https://journal.r-project.org/archive/2017/RJ-2017-039/index.html},
    129 | }
    130 | 133 | 134 |
      135 |
    • 136 |

      Bradley Boehmke. Author, maintainer. 137 |

      138 |
    • 139 |
    • 140 |

      Brandon Greenwell. Author. 141 |

      142 |
    • 143 |
    • 144 |

      Jason Freels. Author. 145 |

      146 |
    • 147 |
    • 148 |

      Robert Gutierrez. Author. 149 |

      150 |
    • 151 |
    152 | 153 |
    154 | 155 |
    156 | 157 | 158 |
    159 | 162 | 163 |
    164 |

    Site built with pkgdown.

    165 |
    166 | 167 |
    168 |
    169 | 170 | 171 | 172 | 173 | 174 | 175 | -------------------------------------------------------------------------------- /docs/reference/security_logs.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Security Log Data — security_logs • anomalyDetection 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 47 | 48 | 49 | 50 | 51 | 52 |
    53 |
    54 | 108 | 109 | 110 |
    111 | 112 |
    113 |
    114 | 119 | 120 |
    121 | 122 |

    A mock dataset containing common information that appears in security logs.

    123 | 124 |
    125 | 126 |
    security_logs
    127 | 128 |

    Format

    129 | 130 |

    A data frame with 300 rows and 10 variables:

    131 |
    Device_Vendor

    Company who made the device

    132 |
    Device_Product

    Name of the security device

    133 |
    Device_Action

    Outcome result of access

    134 |
    Src_IP

    IP address of the source

    135 |
    Dst_IP

    IP address of the destination

    136 |
    Src_Port

    Port identifier of the source

    137 |
    Dst_Port

    Port identifier of the destination

    138 |
    Protocol

    Transport protocol used

    139 |
    Country_Src

    Country of the source

    140 |
    Bytes_TRF

    Number of bytes transferred

    141 |
    142 | 143 | 144 |
    145 | 153 |
    154 | 155 |
    156 | 159 | 160 |
    161 |

    Site built with pkgdown.

    162 |
    163 | 164 |
    165 |
    166 | 167 | 168 | 169 | 170 | 171 | 172 | -------------------------------------------------------------------------------- /docs/news/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Changelog • anomalyDetection 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 44 | 45 | 46 | 47 | 48 | 49 |
    50 |
    51 | 105 | 106 | 107 |
    108 | 109 |
    110 |
    111 | 115 | 116 |
    117 |

    118 | anomalyDetection 0.2.6 Unreleased 119 |

    120 |
      121 |
    • Tested and fixed any issues with new tibble version dependency
    • 122 |
    123 |
    124 |
    125 |

    126 | anomalyDetection 0.2.5 2018-03-07 127 |

    128 |
      129 |
    • Tested and fixed any issues with new tidyr version dependency
    • 130 |
    • Updated maintainership and active authors
    • 131 |
    • Updated URL for new Github organization holding this package
    • 132 |
    • Updated vignette package loading in vignette to resolve 1 note
    • 133 |
    • Added reference in the Description field
    • 134 |
    135 |
    136 |
    137 |

    138 | anomalyDetection 0.2.4 2017-09-22 139 |

    140 |
      141 |
    • Added NEWS file.
    • 142 |
    • Better tolerance in mahalanobis_distance when inverting covariance matrices.
    • 143 |
    • 144 | mahalanobis_distance and horns_curve have been rewritten in C++ using the RcppArmadillo package. This greatly improved the speed (and accuracy) of these functions.
    • 145 |
    • 146 | tabulate_state_vector has been rewritten using the dplyr package, greatly improving the speed of this function. Greater traceability is now also present for missing values and numeric variables.
    • 147 |
    • Producing histogram matrix to visually display anomalous blocks made easier with addition of hmat function
    • 148 |
    • Properly registered native routines and disabled symbol search.
    • 149 |
    150 |
    151 |
    152 | 153 | 163 | 164 |
    165 | 166 |
    167 | 170 | 171 |
    172 |

    Site built with pkgdown.

    173 |
    174 | 175 |
    176 |
    177 | 178 | 179 | 180 | 181 | 182 | 183 | -------------------------------------------------------------------------------- /docs/reference/get_all_factors.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Find All Factors — get_all_factors • anomalyDetection 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 48 | 49 | 50 | 51 | 52 | 53 |
    54 |
    55 | 109 | 110 | 111 |
    112 | 113 |
    114 |
    115 | 120 | 121 |
    122 | 123 |

    get_all_factors finds all factor pairs for a given integer (i.e. a number 124 | that divides evenly into another number).

    125 | 126 |
    127 | 128 |
    get_all_factors(n)
    129 | 130 |

    Arguments

    131 | 132 | 133 | 134 | 135 | 136 | 137 |
    n

    number to be factored

    138 | 139 |

    Source

    140 | 141 |

    http://stackoverflow.com/a/6425597/3851274

    142 | 143 |

    Value

    144 | 145 |

    A list containing the integer vector(s) containing all factors for the given 146 | n inputs.

    147 | 148 | 149 |

    Examples

    150 |
    151 | # Find all the factors of 39304 152 | get_all_factors(39304)
    #> $`39304` 153 | #> [1] 1 2 4 8 17 34 68 136 289 578 1156 2312 154 | #> [13] 4913 9826 19652 39304 155 | #>
    156 |
    157 |
    158 | 171 |
    172 | 173 |
    174 | 177 | 178 |
    179 |

    Site built with pkgdown.

    180 |
    181 | 182 |
    183 |
    184 | 185 | 186 | 187 | 188 | 189 | 190 | -------------------------------------------------------------------------------- /R/hmat.R: -------------------------------------------------------------------------------- 1 | #' @title Plot a Histogram Matrix 2 | #' 3 | #' @description Display a histogram matrix for visual inspection of anomalous 4 | #' observation detection. The color of the blocks represents how anomalous each 5 | #' block is, where a lighter blue represents a more anomalous block. The size 6 | #' of the points indicate which values are driving the anomaly, with larger 7 | #' blocks representing more anomalous values. 8 | #' 9 | #' @param data the data set (data frame or matrix) 10 | #' @param input the type of input data being passed to the function. \code{data} for 11 | #' a raw categorical data set, \code{SV} for a state vector input, and \code{MD} if the 12 | #' input has already had the Mahalanobis distances calculated 13 | #' @param top how many of the most anomalous blocks you would like to display 14 | #' (default 20) 15 | #' @param order whether to show the anomalous blocks in numeric order or in order of 16 | #' most anomalous to least anomalous (default is "numeric", other choice is "anomaly") 17 | #' @param block_length argument fed into \code{tabulate_state_vector}, necessary if 18 | #' \code{input = data} 19 | #' @param level_limit argument fed into \code{tabulate_state_vector}, if the 20 | #' number of unique categories for a variable exceeds this number, only keep 21 | #' a limited number of the most popular values (default 50) 22 | #' @param level_keep argument fed into \code{tabulate_state_vector}, if \code{level_limit} 23 | #' is exceeded, keep this many of the most popular values (default 10) 24 | #' @param partial_block argument fed into \code{tabulate_state_vector}, if the number of 25 | #' entries is not divisible by the \code{block_length}, this logical decides 26 | #' whether to keep the smaller last block (default \code{TRUE}) 27 | #' @param na.rm whether to keep track of missing values as part of the analysis or 28 | #' ignore them (default \code{FALSE}) 29 | #' @param min_var argument fed into \code{mc_adjust}, if a column in the state 30 | #' vector has variance less than this value, remove it (default 0.1) 31 | #' @param max_cor argument fed into \code{mc_adjust}, if a column in the state 32 | #' vector has correlation greater than this value, remove it (default 0.9) 33 | #' @param action argument fed into \code{mc_adjust}, if a column does not fall in 34 | #' the specified range, determine what to do with it (default "exclude") 35 | #' @param output argument fed into \code{mahalanobis_distance} that decides 36 | #' whether to add a column for the Mahalanobis Distance ('MD'), the breakdown 37 | #' distances ('BD') or both (default "both") 38 | #' @param normalize argument fed into \code{mahalanobis_distance} that decides 39 | #' whether to normalize the values by column (default = FALSE) 40 | #' 41 | #' @examples 42 | #' \dontrun{ 43 | #' # Data set input 44 | #' hmat(security_logs,block_length = 8) 45 | #' 46 | #' # Data Set input with top 10 blocks displayed 47 | #' hmat(security_logs, top = 10, block_length = 5) 48 | #' 49 | #' # State Vector Input 50 | #' tabulate_state_vector(security_logs, block_length = 6, level_limit = 20) %>% 51 | #' hmat(input = "SV") 52 | #' } 53 | #' 54 | #' @export 55 | 56 | hmat <- function(data, input = "data", top = 20, order = "numeric", block_length = NULL, 57 | level_limit = 50, level_keep = 10, partial_block = TRUE, na.rm = FALSE, 58 | min_var = 0.1, max_cor = 0.9, action = "exclude", 59 | output = "both", normalize = FALSE) { 60 | 61 | # if the input is a vector or NULL, throw a warning 62 | if (is.vector(data)) { 63 | stop("data must be a matrix or data frame") 64 | } 65 | if (is.null(data)) { 66 | stop("data is NULL") 67 | } 68 | 69 | # if the order input is not "numeric" or "anomaly", stop 70 | if (order != "numeric" & order != "anomaly") { 71 | stop("Invalid Input: argument order must be either 'numeric' or 'anomaly'") 72 | } 73 | 74 | 75 | # if the input is a raw data set, not a state vector 76 | if (input == "data") { 77 | 78 | if(is.null(block_length)) { 79 | stop("Please supply a block_length argument") 80 | } 81 | 82 | suppressWarnings( 83 | suppressMessages( 84 | temp <- dplyr::mutate_( 85 | tibble::as.tibble( 86 | anomalyDetection::mahalanobis_distance( 87 | anomalyDetection::mc_adjust( 88 | anomalyDetection::tabulate_state_vector(data,block_length, 89 | level_limit, 90 | level_keep, 91 | partial_block, 92 | na.rm) 93 | ,min_var, max_cor, action) 94 | ,output, normalize) 95 | ) 96 | ,.dots = list(Block = quote(as.factor(1:n())))) 97 | ) 98 | ) 99 | temp$Ranked <- rank(dplyr::desc(temp$MD), ties.method = "random") 100 | temp <- dplyr::filter_( 101 | tidyr::gather_(temp,"Variable","BD",names(temp)[-c(1,length(names(temp))-1,length(names(temp)))]) 102 | ,.dots = ~ Ranked <= top) 103 | temp$Variable <- substr(temp$Variable,1,nchar(temp$Variable)-3) 104 | if (order == "anomaly") { 105 | temp$Block <- stats::reorder(temp$Block, temp$Ranked) 106 | } 107 | return( 108 | ggplot2::ggplot(temp, 109 | ggplot2::aes_string(x = "Block", y = "Variable", 110 | color = "MD", size = "BD")) + 111 | ggplot2::geom_point() 112 | ) 113 | 114 | 115 | } else if (input == "SV") { 116 | 117 | suppressWarnings( 118 | suppressMessages( 119 | temp <- dplyr::mutate_( 120 | tibble::as.tibble( 121 | anomalyDetection::mahalanobis_distance( 122 | anomalyDetection::mc_adjust(data,min_var, max_cor, action) 123 | ,output, normalize) 124 | ) 125 | ,.dots = list(Block = quote(as.factor(1:n())))) 126 | ) 127 | ) 128 | temp$Ranked <- rank(dplyr::desc(temp$MD), ties.method = "random") 129 | temp <- dplyr::filter_( 130 | tidyr::gather_(temp,"Variable","BD",names(temp)[-c(1,length(names(temp))-1,length(names(temp)))]) 131 | ,.dots = ~ Ranked <= top) 132 | temp$Variable <- substr(temp$Variable,1,nchar(temp$Variable)-3) 133 | if (order == "anomaly") { 134 | temp$Block <- stats::reorder(temp$Block, temp$Ranked) 135 | } 136 | return( 137 | ggplot2::ggplot(temp, 138 | ggplot2::aes_string(x = "Block", y = "Variable", 139 | color = "MD", size = "BD")) + 140 | ggplot2::geom_point() 141 | ) 142 | 143 | } else if (input == "MD") { 144 | 145 | suppressWarnings( 146 | suppressMessages( 147 | temp <- dplyr::mutate_( 148 | tibble::as.tibble(data) 149 | ,.dots = list(Block = quote(as.factor(1:n())))) 150 | ) 151 | ) 152 | temp$Ranked <- rank(dplyr::desc(temp$MD), ties.method = "random") 153 | temp <- dplyr::filter_( 154 | tidyr::gather_(temp,"Variable","BD",names(temp)[-c(1,length(names(temp))-1,length(names(temp)))]) 155 | ,.dots = ~ Ranked <= top) 156 | temp$Variable <- substr(temp$Variable,1,nchar(temp$Variable)-3) 157 | if (order == "anomaly") { 158 | temp$Block <- stats::reorder(temp$Block, temp$Ranked) 159 | } 160 | return( 161 | ggplot2::ggplot(temp, 162 | ggplot2::aes_string(x = "Block", y = "Variable", 163 | color = "MD", size = "BD")) + 164 | ggplot2::geom_point() 165 | ) 166 | 167 | } else { 168 | 169 | stop("Invalid Input: input should be set as 'data','SV', or 'MD'.") 170 | 171 | } 172 | } 173 | -------------------------------------------------------------------------------- /R/tabulate_state_vector.R: -------------------------------------------------------------------------------- 1 | #' @title Tabulate State Vector 2 | #' 3 | #' @description 4 | #' \code{tabulate_state_vector} employs a tabulated vector approach to transform 5 | #' security log data into unique counts of data attributes based on time blocks. 6 | #' Taking a contingency table approach, this function separates variables of type 7 | #' character or factor into their unique levels and counts the number of occurrences 8 | #' for those levels within each block. Due to the large number of unique IP addresses, 9 | #' this function allows for the user to determine how many IP addresses they would 10 | #' like to investigate. The function tabulates the most popular IP addresses. 11 | #' 12 | #' @param data data 13 | #' @param block_length integer value to divide data by 14 | #' @param level_limit integer value to determine the cutoff for the number of 15 | #' factors in a column to display before being reduced to show the number of 16 | #' levels to keep (default is 50) 17 | #' @param level_keep integer value indicating the top number of factor levels to 18 | #' retain if a column has more than the level limit (default is 10) 19 | #' @param partial_block a logical which determines whether incomplete blocks are kept in 20 | #' the analysis in the case where the number of log entries isn't evenly 21 | #' divisible by the \code{block_length} 22 | #' @param na.rm whether to keep track of missing values as part of the analysis or 23 | #' ignore them 24 | #' 25 | #' @return A data frame where each row represents one block and the columns count 26 | #' the number of occurrences that character/factor level occurred in that block 27 | #' 28 | #' @examples 29 | #' tabulate_state_vector(security_logs, 30) 30 | #' 31 | #' @export 32 | 33 | tabulate_state_vector <- function(data, block_length, level_limit = 50L, 34 | level_keep = 10L, partial_block = FALSE, 35 | na.rm = FALSE) { 36 | 37 | # return error if parameters are missing 38 | if(missing(data)) { 39 | stop("Missing argument: data argument", call. = FALSE) 40 | } 41 | if(missing(block_length)) { 42 | stop("Missing argument: block_length argument", call. = FALSE) 43 | } 44 | 45 | # return error if arguments are wrong type 46 | if(!is.data.frame(data) & !is.matrix(data)) { 47 | stop("data must be a data frame or matrix", call. = FALSE) 48 | } 49 | if(is.null(nrow(data)) | isTRUE(nrow(data) < block_length)) { 50 | stop("Your data input does not have sufficient number of rows", 51 | call. = FALSE) 52 | } 53 | if(!is.numeric(block_length)) { 54 | stop("block_length must be a numeric input", call. = FALSE) 55 | } 56 | if(!is.numeric(level_limit)) { 57 | stop("level_limit must be a numeric input", call. = FALSE) 58 | } 59 | if(!is.numeric(level_keep)) { 60 | stop("level_keep must be a numeric input", call. = FALSE) 61 | } 62 | if(!is.logical(partial_block)) { 63 | stop("partial_block must be a logical input", call. = FALSE) 64 | } 65 | if(!is.logical(na.rm)) { 66 | stop("na.rm must be a logical input", call. = FALSE) 67 | } 68 | 69 | # Calculate the number of leftover log entries 70 | left <- nrow(data) %% block_length 71 | 72 | # Calculate number of blocks necessary and throw a warning if 73 | # observations are removed 74 | if (partial_block == TRUE) { 75 | nblocks <- ceiling(nrow(data)/block_length) 76 | if (left != 0) { 77 | message(paste("The last block contains only",as.character(left), 78 | "observations")) 79 | } 80 | } else { 81 | nblocks <- floor(nrow(data)/block_length) 82 | data <- data[1:(nrow(data)-left),] 83 | if (left != 0) { 84 | message(paste("You have removed the last",as.character(left), 85 | "observations from the data set")) 86 | } 87 | } 88 | 89 | # Determine which variables have more categories than level_limit 90 | counts <- purrr::map_dbl(purrr::map(data,unique),length) 91 | overflowvars <- names(subset(counts,counts > level_limit)) 92 | 93 | # Only evaluate if there is an issue with overflow variables 94 | if (length(overflowvars) > 0) { 95 | 96 | # Throw warning indicating there is overflow 97 | message("Some variables contain more than ",as.character(level_limit), 98 | " levels. Only the ",as.character(level_keep)," most popular", 99 | " levels of these variables will be tabulated.") 100 | 101 | # Find and store 10 most popular levels of each overflow variable 102 | suppressMessages( 103 | popvars <- dplyr::select_( 104 | tidyr::spread_( 105 | dplyr::mutate_( 106 | dplyr::group_by_( 107 | dplyr::top_n( 108 | dplyr::count_( 109 | dplyr::group_by_( 110 | dplyr::filter_( 111 | stats::na.omit( 112 | tidyr::gather_( 113 | dplyr::select_(data, ~overflowvars) 114 | ,"Variables","Values",overflowvars) 115 | ) 116 | ,.dots = ~ substr(Values,1,1) != "0") 117 | ,"Variables") 118 | ,"Values") 119 | ,level_keep) 120 | ,"Variables") 121 | ,.dots = list(n = quote(1:n()))) 122 | ,"Variables","Values") 123 | , ~overflowvars) 124 | ) 125 | 126 | # replace unpopular categories with "0" 127 | suppressWarnings( 128 | for (i in overflowvars) { 129 | data[[i]][!is.na(data[[i]]) & !(data[[i]] %in% popvars[[i]])] <- "0" 130 | } 131 | ) 132 | } 133 | 134 | 135 | # construct state vector 136 | if (na.rm == TRUE) { 137 | 138 | # Remove NAs 139 | temp <- dplyr::mutate_(data, .dots = list(BLK = quote(as.numeric(floor((1:n()-1)/block_length)+1)))) 140 | temp <- tidyr::spread_( 141 | tibble::as.tibble( 142 | base::table( 143 | dplyr::group_by_( 144 | dplyr::select_( 145 | dplyr::mutate_( 146 | dplyr::filter_( 147 | stats::na.omit( 148 | tidyr::gather_(temp,"Variables","Values",names(temp)[-length(names(temp))]) 149 | ) 150 | ,.dots = ~ Values != "0") 151 | ,Values = ~ dplyr::if_else(grepl("[A-Za-z]", Values), 152 | Values, 153 | paste0(Variables,"_",Values))) 154 | ,~ c(BLK,Values)) 155 | ,"BLK") 156 | ) 157 | ) 158 | ,"Values","n") 159 | empty <- as.data.frame(matrix(0,nrow = (nblocks - nrow(temp)), 160 | ncol = ncol(temp))) 161 | colnames(empty) <- colnames(temp) 162 | empty$BLK <- setdiff(1:nblocks,temp$BLK) 163 | temp <- dplyr::arrange_(rbind(temp,empty),.dots = ~ as.numeric(BLK)) 164 | return( 165 | tibble::as.tibble( 166 | purrr::map_df( 167 | dplyr::select_(temp, ~ names(temp)[-1]) 168 | ,base::as.numeric) 169 | ) 170 | ) 171 | 172 | 173 | 174 | } else { 175 | 176 | # Do not remove NAs 177 | temp <- dplyr::mutate_(data, .dots = list(BLK = quote(as.numeric(floor((1:n()-1)/block_length)+1)))) 178 | temp <- tidyr::spread_( 179 | tibble::as.tibble( 180 | base::table( 181 | dplyr::group_by_( 182 | dplyr::select_( 183 | dplyr::mutate_( 184 | dplyr::filter_( 185 | dplyr::mutate_( 186 | tidyr::gather_(temp,"Variables","Values",names(temp)[-length(names(temp))]) 187 | , Values = ~ dplyr::if_else(is.na(Values), 188 | paste0(Variables,"_NA"), 189 | Values)) 190 | ,.dots = ~ Values != "0") 191 | ,Values = ~ dplyr::if_else(grepl("[A-Za-z]", Values), 192 | Values, 193 | paste0(Variables,"_",Values))) 194 | ,~ c(BLK,Values)) 195 | ,"BLK") 196 | ) 197 | ) 198 | ,"Values","n") 199 | empty <- as.data.frame(matrix(0,nrow = (nblocks - nrow(temp)), 200 | ncol = ncol(temp))) 201 | colnames(empty) <- colnames(temp) 202 | empty$BLK <- setdiff(1:nblocks,temp$BLK) 203 | temp <- dplyr::arrange_(rbind(temp,empty),.dots = ~ as.numeric(BLK)) 204 | return( 205 | tibble::as.tibble( 206 | purrr::map_df( 207 | dplyr::select_(temp, ~ names(temp)[-1]) 208 | ,base::as.numeric) 209 | ) 210 | ) 211 | 212 | } 213 | 214 | } 215 | -------------------------------------------------------------------------------- /docs/reference/horns_curve.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Horn's Parallel Analysis — horns_curve • anomalyDetection 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 49 | 50 | 51 | 52 | 53 | 54 |
    55 |
    56 | 110 | 111 | 112 |
    113 | 114 |
    115 |
    116 | 121 | 122 |
    123 | 124 |

    Computes the average eigenvalues produced by a Monte Carlo simulation that 125 | randomly generates a large number of nxp matrices of standard 126 | normal deviates.

    127 | 128 |
    129 | 130 |
    horns_curve(data, n, p, nsim = 1000L)
    131 | 132 |

    Arguments

    133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 151 | 152 |
    data

    A matrix or data frame.

    n

    Integer specifying the number of rows.

    p

    Integer specifying the number of columns.

    nsim

    Integer specifying the number of Monte Carlo simulations to run. 150 | Default is 1000.

    153 | 154 |

    Value

    155 | 156 |

    A vector of length p containing the averaged eigenvalues. The 157 | values can then be plotted or compared to the true eigenvalues from a dataset 158 | for a dimensionality reduction assessment.

    159 | 160 |

    References

    161 | 162 |

    J. L. Horn, "A rationale and test for the number of factors in factor 163 | analysis," Psychometrika, vol. 30, no. 2, pp. 179-185, 1965.

    164 | 165 | 166 |

    Examples

    167 |
    # Perform Horn's Parallel analysis with matrix n x p dimensions 168 | x <- matrix(rnorm(200 * 10), ncol = 10) 169 | horns_curve(x)
    #> [1] 1.4101301 1.2833549 1.1824088 1.0973557 1.0211361 0.9482947 0.8806428 170 | #> [8] 0.8098271 0.7381270 0.6545347
    horns_curve(n = 200, p = 10)
    #> [1] 1.4106276 1.2772393 1.1798234 1.0963995 1.0196483 0.9475956 0.8772685 171 | #> [8] 0.8093003 0.7386109 0.6543157
    plot(horns_curve(x)) # scree plot
    172 |
    173 | 186 |
    187 | 188 |
    189 | 192 | 193 |
    194 |

    Site built with pkgdown.

    195 |
    196 | 197 |
    198 |
    199 | 200 | 201 | 202 | 203 | 204 | 205 | -------------------------------------------------------------------------------- /docs/reference/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Function reference • anomalyDetection 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 44 | 45 | 46 | 47 | 48 | 49 |
    50 |
    51 | 105 | 106 | 107 |
    108 | 109 |
    110 |
    111 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 129 | 130 | 131 | 132 | 135 | 137 | 138 | 139 | 142 | 143 | 144 | 145 | 148 | 149 | 150 | 151 | 154 | 155 | 156 | 157 | 160 | 161 | 162 | 163 | 166 | 167 | 168 | 169 | 172 | 173 | 174 | 175 | 178 | 179 | 180 | 181 | 184 | 185 | 186 | 187 | 190 | 191 | 192 | 193 | 196 | 197 | 198 | 199 | 202 | 203 | 204 | 205 | 208 | 209 | 210 | 211 | 214 | 215 | 216 | 217 | 220 | 221 | 222 | 223 | 226 | 227 | 228 | 229 |
    126 |

    All functions

    127 |

    128 |
    133 |

    anomalyDetection

    134 |

    anomalyDetection: An R package for implementing augmented network log anomoly 136 | detection procedures.

    140 |

    bd_row()

    141 |

    Breakdown for Mahalanobis Distance

    146 |

    factor_analysis()

    147 |

    Factor Analysis with Varimax Rotation

    152 |

    factor_analysis_results()

    153 |

    Easy Access to Factor Analysis Results

    158 |

    get_all_factors()

    159 |

    Find All Factors

    164 |

    hmat()

    165 |

    Plot a Histogram Matrix

    170 |

    horns_curve()

    171 |

    Horn's Parallel Analysis

    176 |

    inspect_block()

    177 |

    Block Inspection

    182 |

    kaisers_index()

    183 |

    Kaiser's Index of Factorial Simplicity

    188 |

    mahalanobis_distance()

    189 |

    Mahalanobis Distance

    194 |

    mc_adjust()

    195 |

    Multi-Collinearity Adjustment

    200 |

    %>%

    201 |

    Pipe functions

    206 |

    principal_components()

    207 |

    Principal Component Analysis

    212 |

    principal_components_result()

    213 |

    Easy Access to Principal Component Analysis Results

    218 |

    security_logs

    219 |

    Security Log Data

    224 |

    tabulate_state_vector()

    225 |

    Tabulate State Vector

    230 |
    231 | 232 | 238 |
    239 | 240 |
    241 | 244 | 245 |
    246 |

    Site built with pkgdown.

    247 |
    248 | 249 |
    250 |
    251 | 252 | 253 | 254 | 255 | 256 | 257 | -------------------------------------------------------------------------------- /docs/reference/mc_adjust.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Multi-Collinearity Adjustment — mc_adjust • anomalyDetection 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 47 | 48 | 49 | 50 | 51 | 52 |
    53 |
    54 | 108 | 109 | 110 |
    111 | 112 |
    113 |
    114 | 119 | 120 |
    121 | 122 |

    mc_adjust handles issues with multi-collinearity.

    123 | 124 |
    125 | 126 |
    mc_adjust(data, min_var = 0.1, max_cor = 0.9, action = "exclude")
    127 | 128 |

    Arguments

    129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 150 | 151 |
    data

    named numeric data object (either data frame or matrix)

    min_var

    numeric value between 0-1 for the minimum acceptable variance (default = 0.1)

    max_cor

    numeric value between 0-1 for the maximum acceptable correlation (default = 0.9)

    action

    select action for handling columns causing multi-collinearity issues

      146 |
    1. exclude: exclude all columns causing multi-collinearity issues (default)

    2. 147 |
    3. select: identify the columns causing multi-collinearity issues 148 | and allow the user to interactively select those columns to remove

    4. 149 |
    152 | 153 |

    Value

    154 | 155 |

    mc_adjust returns the numeric data object supplied minus variables 156 | violating the minimum acceptable variance (min_var) and the 157 | maximum acceptable correlation (max_cor) levels.

    158 | 159 |

    Details

    160 | 161 |

    mc_adjust handles issues with multi-collinearity by first removing 162 | any columns whose variance is close to or less than min_var. Then, it 163 | removes linearly dependent columns. Finally, it removes any columns that have 164 | a high absolute correlation value equal to or greater than max_cor.

    165 | 166 | 167 |

    Examples

    168 |
    # NOT RUN {
    169 | x <- matrix(runif(100), ncol = 10)
    170 | x %>%
    171 |   mc_adjust()
    172 | 
    173 | x %>%
    174 |   mc_adjust(min_var = .15, max_cor = .75, action = "select")
    175 | # }
    176 |
    177 |
    178 | 191 |
    192 | 193 |
    194 | 197 | 198 |
    199 |

    Site built with pkgdown.

    200 |
    201 | 202 |
    203 |
    204 | 205 | 206 | 207 | 208 | 209 | 210 | -------------------------------------------------------------------------------- /docs/reference/bd_row.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Breakdown for Mahalanobis Distance — bd_row • anomalyDetection 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 48 | 49 | 50 | 51 | 52 | 53 |
    54 |
    55 | 109 | 110 | 111 |
    112 | 113 |
    114 |
    115 | 120 | 121 |
    122 | 123 |

    bd_row indicates which variables in data are driving the Mahalanobis 124 | distance for a specific row r, relative to the mean vector of the data.

    125 | 126 |
    127 | 128 |
    bd_row(data, row, n = NULL)
    129 | 130 |

    Arguments

    131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 146 | 147 |
    data

    numeric data

    row

    row of interest

    n

    number of values to return. By default, will return all variables 144 | (columns) with their respective differences. However, you can choose to view 145 | only the top n variables by setting the n value.

    148 | 149 |

    Value

    150 | 151 |

    Returns a vector indicating the variables in data that are driving the 152 | Mahalanobis distance for the respective row.

    153 | 154 |

    See also

    155 | 156 |

    mahalanobis_distance for computing the Mahalanobis Distance values

    157 | 158 | 159 |

    Examples

    160 |
    # NOT RUN {
    161 | x = matrix(rnorm(200*3), ncol = 10)
    162 | colnames(x) = paste0("C", 1:ncol(x))
    163 | 
    164 | # compute the relative differences for row 5 and return all variables
    165 | x %>%
    166 |   mahalanobis_distance("bd", normalize = TRUE) %>%
    167 |   bd_row(5)
    168 | 
    169 | # compute the relative differences for row 5 and return the top 3 variables
    170 | # that are influencing the Mahalanobis Distance the most
    171 | x %>%
    172 |   mahalanobis_distance("bd", normalize = TRUE) %>%
    173 |   bd_row(5, 3)
    174 | 
    175 | # }
    176 |
    177 |
    178 | 191 |
    192 | 193 |
    194 | 197 | 198 |
    199 |

    Site built with pkgdown.

    200 |
    201 | 202 |
    203 |
    204 | 205 | 206 | 207 | 208 | 209 | 210 | -------------------------------------------------------------------------------- /docs/reference/kaisers_index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Kaiser's Index of Factorial Simplicity — kaisers_index • anomalyDetection 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 47 | 48 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 61 | 62 | 63 | 64 | 65 | 66 |
    67 |
    68 | 122 | 123 | 124 |
    125 | 126 |
    127 |
    128 | 133 | 134 |
    135 | 136 |

    kaisers_index computes scores designed to assess the quality of a factor 137 | analysis solution. It measures the tendency towards unifactoriality for both 138 | a given row and the entire matrix as a whole. Kaiser proposed the evaluations 139 | of the score shown below:

    140 |

      141 |
    1. In the .90s: Marvelous

    2. 142 |
    3. In the .80s: Meritorious

    4. 143 |
    5. In the .70s: Middling

    6. 144 |
    7. In the .60s: Mediocre

    8. 145 |
    9. In the .50s: Miserable

    10. 146 |
    11. < .50: Unacceptable

    12. 147 |
    148 | 149 |

    Use as basis for selecting original or rotated loadings/scores in 150 | factor_analysis.

    151 | 152 |
    153 | 154 |
    kaisers_index(loadings)
    155 | 156 |

    Arguments

    157 | 158 | 159 | 160 | 161 | 162 | 163 |
    loadings

    numerical matrix of the factor loadings

    164 | 165 |

    Value

    166 | 167 |

    Vector containing the computed score

    168 | 169 |

    References

    170 | 171 |

    H. F. Kaiser, "An index of factorial simplicity," Psychometrika, vol. 39, no. 1, pp. 31-36, 1974.

    172 | 173 |

    See also

    174 | 175 |

    factor_analysis for computing the factor analysis loadings

    176 | 177 | 178 |

    Examples

    179 |
    # Perform Factor Analysis with matrix \code{x} 180 | x <- matrix(rnorm(200*3), ncol = 10) 181 | 182 | x %>% 183 | horns_curve() %>% 184 | factor_analysis(x, hc_points = .) %>% 185 | factor_analysis_results(fa_loadings_rotated) %>% 186 | kaisers_index()
    #> [1] 0.8162322
    187 |
    188 |
    189 | 204 |
    205 | 206 |
    207 | 210 | 211 |
    212 |

    Site built with pkgdown.

    213 |
    214 | 215 |
    216 |
    217 | 218 | 219 | 220 | 221 | 222 | 223 | -------------------------------------------------------------------------------- /docs/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | Implementation of Augmented Network Log Anomaly Detection Procedures • anomalyDetection 9 | 10 | 11 | 12 | 13 | 17 | 18 | 22 | 23 | 24 |
    25 |
    79 | 80 | 81 | 82 |
    83 |
    84 | 85 | 86 | 87 | 88 | 89 | 90 |
    91 | 94 |

    anomalyDetection implements procedures to aid in detecting network log anomalies. By combining various multivariate analytic approaches relevant to network anomaly detection, it provides cyber analysts efficient means to detect suspected anomalies requiring further evaluation.

    95 |
    96 |

    97 | Installation

    98 |

    You can install anomalyDetection two ways.

    99 |
      100 |
    • Using the latest released version from CRAN:
    • 101 |
    102 |
    install.packages("anomalyDetection")
    103 |
      104 |
    • Using the latest development version from GitHub:
    • 105 |
    106 |
    if (packageVersion("devtools") < 1.6) {
    107 |   install.packages("devtools")
    108 | }
    109 | 
    110 | devtools::install_github("koalaverse/anomalyDetection", build_vignettes = TRUE)
    111 |
    112 |
    113 |

    114 | Learning

    115 |

    To get started with anomalyDetection, read the intro vignette: vignette("Introduction", package = "anomalyDetection"). This will provide a thorough introduction to the functions provided in the package.

    116 |
    117 |
    118 |

    119 | References

    120 |

    Gutierrez, R.J., Boehmke, B.C., Bauer, K.W., Saie, C.M. & Bihl, T.J. (2017) “anomalyDetection: Implementation of augmented network log anomaly detection procedures.” The R Journal, 9(2), 354-365. link

    121 |
    122 |
    123 |
    124 | 125 | 171 | 172 |
    173 | 174 | 175 | 184 |
    185 | 186 | 187 | 188 | 189 | 190 | --------------------------------------------------------------------------------