9 | =====================================================================================================
10 |
11 | `anomalyDetection` implements procedures to aid in detecting network log anomalies. By combining various multivariate analytic approaches relevant to network anomaly detection, it provides cyber analysts efficient means to detect suspected anomalies requiring further evaluation.
12 |
13 | Installation
14 | ------------
15 |
16 | You can install `anomalyDetection` two ways.
17 |
18 | - Using the latest released version from CRAN:
19 |
20 |
21 |
22 | install.packages("anomalyDetection")
23 |
24 | - Using the latest development version from GitHub:
25 |
26 |
27 |
28 | if (packageVersion("devtools") < 1.6) {
29 | install.packages("devtools")
30 | }
31 |
32 | devtools::install_github("koalaverse/anomalyDetection", build_vignettes = TRUE)
33 |
34 | Learning
35 | --------
36 |
37 | To get started with `anomalyDetection`, read the intro [vignette](https://cran.r-project.org/web/packages/anomalyDetection/vignettes/Introduction.html): `vignette("Introduction", package = "anomalyDetection")`. This will provide a thorough introduction to the functions provided in the package.
38 |
39 | References
40 | ----------
41 |
42 | Gutierrez, R.J., Boehmke, B.C., Bauer, K.W., Saie, C.M. & Bihl, T.J. (2017) "`anomalyDetection`: Implementation of augmented network log anomaly detection procedures." The R Journal, 9(2), 354-365. [link](https://journal.r-project.org/archive/2017/RJ-2017-039/index.html)
43 |
--------------------------------------------------------------------------------
/man/mahalanobis_distance.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/mahalanobis_distance.R
3 | \name{mahalanobis_distance}
4 | \alias{mahalanobis_distance}
5 | \alias{mahalanobis_distance.matrix}
6 | \alias{mahalanobis_distance.data.frame}
7 | \title{Mahalanobis Distance}
8 | \usage{
9 | mahalanobis_distance(data, output = c("md", "bd", "both"),
10 | normalize = FALSE)
11 |
12 | \method{mahalanobis_distance}{matrix}(data, output = c("md", "bd", "both"),
13 | normalize = FALSE)
14 |
15 | \method{mahalanobis_distance}{data.frame}(data, output = c("md", "bd",
16 | "both"), normalize = FALSE)
17 | }
18 | \arguments{
19 | \item{data}{A matrix or data frame. Data frames will be converted to matrices
20 | via \code{data.matrix}.}
21 |
22 | \item{output}{Character string specifying which distance metric(s) to
23 | compute. Current options include: \code{"md"} for Mahalanobis distance
24 | (default); \code{"bd"} for absolute breakdown distance (used to see which
25 | columns drive the Mahalanobis distance); and \code{"both"} to return both
26 | distance metrics.}
27 |
28 | \item{normalize}{Logical indicating whether or not to normalize the breakdown
29 | distances within each column (so that breakdown distances across columns can
30 | be compared).}
31 | }
32 | \value{
33 | If \code{output = "md"}, then a vector containing the Mahalanobis
34 | distances is returned. Otherwise, a matrix.
35 | }
36 | \description{
37 | Calculates the distance between the elements in a data set and the mean
38 | vector of the data for outlier detection. Values are independent of the scale
39 | between variables.
40 | }
41 | \examples{
42 | \dontrun{
43 | # Simulate some data
44 | x <- data.frame(C1 = rnorm(100), C2 = rnorm(100), C3 = rnorm(100))
45 |
46 | # Add Mahalanobis distances
47 | x \%>\% dplyr::mutate(MD = mahalanobis_distance(x))
48 |
49 | # Add Mahalanobis and breakdown distances
50 | x \%>\% cbind(mahalanobis_distance(x, output = "both"))
51 |
52 | # Add Mahalanobis and normalized breakdown distances
53 | x \%>\% cbind(mahalanobis_distance(x, output = "both", normalize = TRUE))
54 | }
55 | }
56 | \references{
57 | W. Wang and R. Battiti, "Identifying Intrusions in Computer Networks with
58 | Principal Component Analysis," in First International Conference on
59 | Availability, Reliability and Security, 2006.
60 | }
61 |
--------------------------------------------------------------------------------
/docs/docsearch.js:
--------------------------------------------------------------------------------
1 | $(function() {
2 |
3 | // register a handler to move the focus to the search bar
4 | // upon pressing shift + "/" (i.e. "?")
5 | $(document).on('keydown', function(e) {
6 | if (e.shiftKey && e.keyCode == 191) {
7 | e.preventDefault();
8 | $("#search-input").focus();
9 | }
10 | });
11 |
12 | $(document).ready(function() {
13 | // do keyword highlighting
14 | /* modified from https://jsfiddle.net/julmot/bL6bb5oo/ */
15 | var mark = function() {
16 |
17 | var referrer = document.URL ;
18 | var paramKey = "q" ;
19 |
20 | if (referrer.indexOf("?") !== -1) {
21 | var qs = referrer.substr(referrer.indexOf('?') + 1);
22 | var qs_noanchor = qs.split('#')[0];
23 | var qsa = qs_noanchor.split('&');
24 | var keyword = "";
25 |
26 | for (var i = 0; i < qsa.length; i++) {
27 | var currentParam = qsa[i].split('=');
28 |
29 | if (currentParam.length !== 2) {
30 | continue;
31 | }
32 |
33 | if (currentParam[0] == paramKey) {
34 | keyword = decodeURIComponent(currentParam[1].replace(/\+/g, "%20"));
35 | }
36 | }
37 |
38 | if (keyword !== "") {
39 | $(".contents").unmark({
40 | done: function() {
41 | $(".contents").mark(keyword);
42 | }
43 | });
44 | }
45 | }
46 | };
47 |
48 | mark();
49 | });
50 | });
51 |
52 | /* Search term highlighting ------------------------------*/
53 |
54 | function matchedWords(hit) {
55 | var words = [];
56 |
57 | var hierarchy = hit._highlightResult.hierarchy;
58 | // loop to fetch from lvl0, lvl1, etc.
59 | for (var idx in hierarchy) {
60 | words = words.concat(hierarchy[idx].matchedWords);
61 | }
62 |
63 | var content = hit._highlightResult.content;
64 | if (content) {
65 | words = words.concat(content.matchedWords);
66 | }
67 |
68 | // return unique words
69 | var words_uniq = [...new Set(words)];
70 | return words_uniq;
71 | }
72 |
73 | function updateHitURL(hit) {
74 |
75 | var words = matchedWords(hit);
76 | var url = "";
77 |
78 | if (hit.anchor) {
79 | url = hit.url_without_anchor + '?q=' + escape(words.join(" ")) + '#' + hit.anchor;
80 | } else {
81 | url = hit.url + '?q=' + escape(words.join(" "));
82 | }
83 |
84 | return url;
85 | }
86 |
--------------------------------------------------------------------------------
/R/bd_row.R:
--------------------------------------------------------------------------------
1 | #' @title Breakdown for Mahalanobis Distance
2 | #'
3 | #' @description
4 | #' \code{bd_row} indicates which variables in data are driving the Mahalanobis
5 | #' distance for a specific row \code{r}, relative to the mean vector of the data.
6 | #'
7 | #' @param data numeric data
8 | #' @param row row of interest
9 | #' @param n number of values to return. By default, will return all variables
10 | #' (columns) with their respective differences. However, you can choose to view
11 | #' only the top \code{n} variables by setting the \code{n} value.
12 | #'
13 | #' @return
14 | #'
15 | #' Returns a vector indicating the variables in \code{data} that are driving the
16 | #' Mahalanobis distance for the respective row.
17 | #'
18 | #' @seealso
19 | #'
20 | #' \code{\link{mahalanobis_distance}} for computing the Mahalanobis Distance values
21 | #'
22 | #' @examples
23 | #' \dontrun{
24 | #' x = matrix(rnorm(200*3), ncol = 10)
25 | #' colnames(x) = paste0("C", 1:ncol(x))
26 | #'
27 | #' # compute the relative differences for row 5 and return all variables
28 | #' x %>%
29 | #' mahalanobis_distance("bd", normalize = TRUE) %>%
30 | #' bd_row(5)
31 | #'
32 | #' # compute the relative differences for row 5 and return the top 3 variables
33 | #' # that are influencing the Mahalanobis Distance the most
34 | #' x %>%
35 | #' mahalanobis_distance("bd", normalize = TRUE) %>%
36 | #' bd_row(5, 3)
37 | #' }
38 | #'
39 | #' @export
40 |
41 | bd_row <- function(data, row, n = NULL) {
42 |
43 | # return error if parameters are missing or invalid
44 | if(missing(data)) {
45 | stop("Missing data argument", call. = FALSE)
46 | }
47 | if(! row %in% seq_len(nrow(data))) {
48 | stop("Invalid row value", call. = FALSE)
49 | }
50 | if(length(row) != 1) {
51 | stop("row value must be a single integer", call. = FALSE)
52 | }
53 | if(!isTRUE(n %in% seq_len(ncol(data))) && !is.null(n)) {
54 | stop("Invalid n value", call. = FALSE)
55 | }
56 |
57 | C <- stats::cov(data)
58 | CM <- as.matrix(colMeans(data))
59 | D <- (data[row,] - t(CM))
60 | bd <- D /sqrt(diag(C))
61 | bd_abs <- abs(bd)
62 | tmp <- sort(bd_abs, decreasing = TRUE, index.return = TRUE)
63 | bd_sort <- tmp$x
64 | bd_index <- tmp$ix
65 |
66 | output <- bd_sort
67 | names(output) <- colnames(data)[bd_index]
68 |
69 | if(!is.null(n)) {
70 | output <- output[seq_len(n)]
71 | }
72 |
73 | return(output)
74 |
75 | }
76 |
77 |
--------------------------------------------------------------------------------
/README.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | output: github_document
3 | ---
4 |
5 |
6 |
7 | ```{r, echo = FALSE}
8 | knitr::opts_chunk$set(
9 | collapse = TRUE,
10 | comment = "#>",
11 | fig.path = "README-"
12 | )
13 | ```
14 |
15 | [](https://cran.r-project.org/package=anomalyDetection)
16 | [](https://travis-ci.org/koalaverse/anomalyDetection)
17 | [](https://ci.appveyor.com/project/bradleyboehmke/anomalyDetection)
18 | [](https://codecov.io/gh/koalaverse/anomalyDetection)
19 | [](http://cranlogs.r-pkg.org/badges/anomalyDetection)
20 | [](http://cranlogs.r-pkg.org/badges/grand-total/anomalyDetection)
21 |
22 | # anomalyDetection
23 |
24 | `anomalyDetection` implements procedures to aid in detecting network log anomalies. By combining various multivariate analytic approaches relevant to network anomaly detection, it provides cyber analysts efficient means to detect suspected anomalies requiring further evaluation.
25 |
26 |
27 | ## Installation
28 |
29 | You can install `anomalyDetection` two ways.
30 |
31 | - Using the latest released version from CRAN:
32 |
33 | ```
34 | install.packages("anomalyDetection")
35 | ```
36 |
37 | - Using the latest development version from GitHub:
38 |
39 | ```
40 | if (packageVersion("devtools") < 1.6) {
41 | install.packages("devtools")
42 | }
43 |
44 | devtools::install_github("koalaverse/anomalyDetection", build_vignettes = TRUE)
45 | ```
46 |
47 | ## Learning
48 |
49 | To get started with `anomalyDetection`, read the intro [vignette](https://cran.r-project.org/web/packages/anomalyDetection/vignettes/Introduction.html): `vignette("Introduction", package = "anomalyDetection")`. This will provide a thorough introduction to the functions provided in the package.
50 |
51 | ## References
52 |
53 | Gutierrez, R.J., Boehmke, B.C., Bauer, K.W., Saie, C.M. & Bihl, T.J. (2017) "`anomalyDetection`: Implementation of augmented network log anomaly detection procedures." The R Journal, 9(2), 354-365. [link](https://journal.r-project.org/archive/2017/RJ-2017-039/index.html)
54 |
--------------------------------------------------------------------------------
/docs/jquery.sticky-kit.min.js:
--------------------------------------------------------------------------------
1 | /*
2 | Sticky-kit v1.1.2 | WTFPL | Leaf Corcoran 2015 | http://leafo.net
3 | */
4 | (function(){var b,f;b=this.jQuery||window.jQuery;f=b(window);b.fn.stick_in_parent=function(d){var A,w,J,n,B,K,p,q,k,E,t;null==d&&(d={});t=d.sticky_class;B=d.inner_scrolling;E=d.recalc_every;k=d.parent;q=d.offset_top;p=d.spacer;w=d.bottoming;null==q&&(q=0);null==k&&(k=void 0);null==B&&(B=!0);null==t&&(t="is_stuck");A=b(document);null==w&&(w=!0);J=function(a,d,n,C,F,u,r,G){var v,H,m,D,I,c,g,x,y,z,h,l;if(!a.data("sticky_kit")){a.data("sticky_kit",!0);I=A.height();g=a.parent();null!=k&&(g=g.closest(k));
5 | if(!g.length)throw"failed to find stick parent";v=m=!1;(h=null!=p?p&&a.closest(p):b(""))&&h.css("position",a.css("position"));x=function(){var c,f,e;if(!G&&(I=A.height(),c=parseInt(g.css("border-top-width"),10),f=parseInt(g.css("padding-top"),10),d=parseInt(g.css("padding-bottom"),10),n=g.offset().top+c+f,C=g.height(),m&&(v=m=!1,null==p&&(a.insertAfter(h),h.detach()),a.css({position:"",top:"",width:"",bottom:""}).removeClass(t),e=!0),F=a.offset().top-(parseInt(a.css("margin-top"),10)||0)-q,
6 | u=a.outerHeight(!0),r=a.css("float"),h&&h.css({width:a.outerWidth(!0),height:u,display:a.css("display"),"vertical-align":a.css("vertical-align"),"float":r}),e))return l()};x();if(u!==C)return D=void 0,c=q,z=E,l=function(){var b,l,e,k;if(!G&&(e=!1,null!=z&&(--z,0>=z&&(z=E,x(),e=!0)),e||A.height()===I||x(),e=f.scrollTop(),null!=D&&(l=e-D),D=e,m?(w&&(k=e+u+c>C+n,v&&!k&&(v=!1,a.css({position:"fixed",bottom:"",top:c}).trigger("sticky_kit:unbottom"))),eR/anomalyDetection.R
121 | anomalyDetection.RdanomalyDetection: An R package for implementing augmented network log anomoly 127 | detection procedures.
128 | 129 |A mock dataset containing common information that appears in security logs.
123 | 124 |security_logs
127 |
128 | A data frame with 300 rows and 10 variables:
Company who made the device
Name of the security device
Outcome result of access
IP address of the source
IP address of the destination
Port identifier of the source
Port identifier of the destination
Transport protocol used
Country of the source
Number of bytes transferred
NEWS.md
114 | NEWS file.mahalanobis_distance when inverting covariance matrices.mahalanobis_distance and horns_curve have been rewritten in C++ using the RcppArmadillo package. This greatly improved the speed (and accuracy) of these functions.tabulate_state_vector has been rewritten using the dplyr package, greatly improving the speed of this function. Greater traceability is now also present for missing values and numeric variables.hmat functionget_all_factors finds all factor pairs for a given integer (i.e. a number
124 | that divides evenly into another number).
get_all_factors(n)129 | 130 |
| n | 135 |number to be factored |
136 |
|---|
http://stackoverflow.com/a/6425597/3851274
142 | 143 |A list containing the integer vector(s) containing all factors for the given
146 | n inputs.
157 |151 | # Find all the factors of 39304 152 | get_all_factors(39304)#> $`39304` 153 | #> [1] 1 2 4 8 17 34 68 136 289 578 1156 2312 154 | #> [13] 4913 9826 19652 39304 155 | #>156 |
Computes the average eigenvalues produced by a Monte Carlo simulation that
125 | randomly generates a large number of nxp matrices of standard
126 | normal deviates.
horns_curve(data, n, p, nsim = 1000L)131 | 132 |
| data | 137 |A matrix or data frame. |
138 |
|---|---|
| n | 141 |Integer specifying the number of rows. |
142 |
| p | 145 |Integer specifying the number of columns. |
146 |
| nsim | 149 |Integer specifying the number of Monte Carlo simulations to run.
150 | Default is |
151 |
A vector of length p containing the averaged eigenvalues. The
157 | values can then be plotted or compared to the true eigenvalues from a dataset
158 | for a dimensionality reduction assessment.
J. L. Horn, "A rationale and test for the number of factors in factor 163 | analysis," Psychometrika, vol. 30, no. 2, pp. 179-185, 1965.
164 | 165 | 166 |172 |# Perform Horn's Parallel analysis with matrix n x p dimensions 168 | x <- matrix(rnorm(200 * 10), ncol = 10) 169 | horns_curve(x)#> [1] 1.4101301 1.2833549 1.1824088 1.0973557 1.0211361 0.9482947 0.8806428 170 | #> [8] 0.8098271 0.7381270 0.6545347horns_curve(n = 200, p = 10)#> [1] 1.4106276 1.2772393 1.1798234 1.0963995 1.0196483 0.9475956 0.8772685 171 | #> [8] 0.8093003 0.7386109 0.6543157plot(horns_curve(x)) # scree plot
mc_adjust handles issues with multi-collinearity.
mc_adjust(data, min_var = 0.1, max_cor = 0.9, action = "exclude")127 | 128 |
| data | 133 |named numeric data object (either data frame or matrix) |
134 |
|---|---|
| min_var | 137 |numeric value between 0-1 for the minimum acceptable variance (default = 0.1) |
138 |
| max_cor | 141 |numeric value between 0-1 for the maximum acceptable correlation (default = 0.9) |
142 |
| action | 145 |select action for handling columns causing multi-collinearity issues
|
150 |
mc_adjust returns the numeric data object supplied minus variables
156 | violating the minimum acceptable variance (min_var) and the
157 | maximum acceptable correlation (max_cor) levels.
mc_adjust handles issues with multi-collinearity by first removing
162 | any columns whose variance is close to or less than min_var. Then, it
163 | removes linearly dependent columns. Finally, it removes any columns that have
164 | a high absolute correlation value equal to or greater than max_cor.
# NOT RUN { 169 | x <- matrix(runif(100), ncol = 10) 170 | x %>% 171 | mc_adjust() 172 | 173 | x %>% 174 | mc_adjust(min_var = .15, max_cor = .75, action = "select") 175 | # }177 |176 |
bd_row indicates which variables in data are driving the Mahalanobis
124 | distance for a specific row r, relative to the mean vector of the data.
bd_row(data, row, n = NULL)129 | 130 |
| data | 135 |numeric data |
136 |
|---|---|
| row | 139 |row of interest |
140 |
| n | 143 |number of values to return. By default, will return all variables
144 | (columns) with their respective differences. However, you can choose to view
145 | only the top |
146 |
Returns a vector indicating the variables in data that are driving the
152 | Mahalanobis distance for the respective row.
mahalanobis_distance for computing the Mahalanobis Distance values
# NOT RUN { 161 | x = matrix(rnorm(200*3), ncol = 10) 162 | colnames(x) = paste0("C", 1:ncol(x)) 163 | 164 | # compute the relative differences for row 5 and return all variables 165 | x %>% 166 | mahalanobis_distance("bd", normalize = TRUE) %>% 167 | bd_row(5) 168 | 169 | # compute the relative differences for row 5 and return the top 3 variables 170 | # that are influencing the Mahalanobis Distance the most 171 | x %>% 172 | mahalanobis_distance("bd", normalize = TRUE) %>% 173 | bd_row(5, 3) 174 | 175 | # }177 |176 |
R/kaisers_index.R
131 | kaisers_index.Rdkaisers_index computes scores designed to assess the quality of a factor
137 | analysis solution. It measures the tendency towards unifactoriality for both
138 | a given row and the entire matrix as a whole. Kaiser proposed the evaluations
139 | of the score shown below:
In the .90s: Marvelous
In the .80s: Meritorious
In the .70s: Middling
In the .60s: Mediocre
In the .50s: Miserable
< .50: Unacceptable
Use as basis for selecting original or rotated loadings/scores in
150 | factor_analysis.
kaisers_index(loadings)155 | 156 |
| loadings | 161 |numerical matrix of the factor loadings |
162 |
|---|
Vector containing the computed score
168 | 169 |H. F. Kaiser, "An index of factorial simplicity," Psychometrika, vol. 39, no. 1, pp. 31-36, 1974.
172 | 173 |factor_analysis for computing the factor analysis loadings
188 |# Perform Factor Analysis with matrix \code{x} 180 | x <- matrix(rnorm(200*3), ncol = 10) 181 | 182 | x %>% 183 | horns_curve() %>% 184 | factor_analysis(x, hc_points = .) %>% 185 | factor_analysis_results(fa_loadings_rotated) %>% 186 | kaisers_index()#> [1] 0.8162322187 |
anomalyDetection implements procedures to aid in detecting network log anomalies. By combining various multivariate analytic approaches relevant to network anomaly detection, it provides cyber analysts efficient means to detect suspected anomalies requiring further evaluation.
You can install anomalyDetection two ways.
install.packages("anomalyDetection")
103 | if (packageVersion("devtools") < 1.6) {
107 | install.packages("devtools")
108 | }
109 |
110 | devtools::install_github("koalaverse/anomalyDetection", build_vignettes = TRUE)
111 | To get started with anomalyDetection, read the intro vignette: vignette("Introduction", package = "anomalyDetection"). This will provide a thorough introduction to the functions provided in the package.
Gutierrez, R.J., Boehmke, B.C., Bauer, K.W., Saie, C.M. & Bihl, T.J. (2017) “anomalyDetection: Implementation of augmented network log anomaly detection procedures.” The R Journal, 9(2), 354-365. link