├── .Rbuildignore ├── .gitignore ├── LICENSE ├── data └── breastcancer.rda ├── OneR.Rproj ├── man ├── is.OneR.Rd ├── print.OneR.Rd ├── plot.OneR.Rd ├── summary.OneR.Rd ├── breastcancer.Rd ├── maxlevels.Rd ├── predict.OneR.Rd ├── eval_model.Rd ├── bin.Rd ├── OneR.Rd └── optbin.Rd ├── NAMESPACE ├── DESCRIPTION ├── R ├── OneR_data.R ├── OneR_internal.R ├── OneR_main.R └── OneR.R ├── README.md ├── vignettes ├── OneR.R ├── OneR.Rmd └── OneR.html └── NEWS /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^.*\.Rproj$ 2 | ^\.Rproj\.user$ 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | inst/doc 6 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | YEAR: 2016 - 2017 2 | COPYRIGHT HOLDER: Holger von Jouanne-Diedrich 3 | -------------------------------------------------------------------------------- /data/breastcancer.rda: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vonjd/OneR/HEAD/data/breastcancer.rda -------------------------------------------------------------------------------- /OneR.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: knitr 13 | LaTeX: pdfLaTeX 14 | 15 | AutoAppendNewline: Yes 16 | StripTrailingWhitespace: Yes 17 | 18 | BuildType: Package 19 | PackageUseDevtools: Yes 20 | PackageInstallArgs: --no-multiarch --with-keep.source 21 | PackageRoxygenize: rd,collate,namespace,vignette 22 | -------------------------------------------------------------------------------- /man/is.OneR.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/OneR.R 3 | \name{is.OneR} 4 | \alias{is.OneR} 5 | \title{Test OneR model objects} 6 | \usage{ 7 | is.OneR(x) 8 | } 9 | \arguments{ 10 | \item{x}{object to be tested.} 11 | } 12 | \value{ 13 | a logical whether object is of class "OneR". 14 | } 15 | \description{ 16 | Test if object is a OneR model. 17 | } 18 | \examples{ 19 | model <- OneR(iris) 20 | is.OneR(model) # evaluates to TRUE 21 | } 22 | \references{ 23 | \url{https://github.com/vonjd/OneR} 24 | } 25 | \author{ 26 | Holger von Jouanne-Diedrich 27 | } 28 | \keyword{OneR} 29 | \keyword{model} 30 | -------------------------------------------------------------------------------- /man/print.OneR.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/OneR.R 3 | \name{print.OneR} 4 | \alias{print.OneR} 5 | \title{Print OneR models} 6 | \usage{ 7 | \method{print}{OneR}(x, ...) 8 | } 9 | \arguments{ 10 | \item{x}{object of class \code{"OneR"}.} 11 | 12 | \item{...}{further arguments passed to or from other methods.} 13 | } 14 | \description{ 15 | \code{print} method for class \code{OneR}. 16 | } 17 | \details{ 18 | Prints the rules and the accuracy of an OneR model. 19 | } 20 | \examples{ 21 | model <- OneR(iris) 22 | print(model) 23 | } 24 | \references{ 25 | \url{https://github.com/vonjd/OneR} 26 | } 27 | \seealso{ 28 | \code{\link{OneR}} 29 | } 30 | \author{ 31 | Holger von Jouanne-Diedrich 32 | } 33 | -------------------------------------------------------------------------------- /NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | S3method(OneR,data.frame) 4 | S3method(OneR,default) 5 | S3method(OneR,formula) 6 | S3method(optbin,data.frame) 7 | S3method(optbin,default) 8 | S3method(optbin,formula) 9 | S3method(plot,OneR) 10 | S3method(predict,OneR) 11 | S3method(print,OneR) 12 | S3method(summary,OneR) 13 | export(OneR) 14 | export(bin) 15 | export(eval_model) 16 | export(is.OneR) 17 | export(maxlevels) 18 | export(optbin) 19 | importFrom(graphics,mosaicplot) 20 | importFrom(stats,addmargins) 21 | importFrom(stats,binom.test) 22 | importFrom(stats,binomial) 23 | importFrom(stats,chisq.test) 24 | importFrom(stats,coef) 25 | importFrom(stats,filter) 26 | importFrom(stats,glm) 27 | importFrom(stats,kmeans) 28 | importFrom(stats,model.frame) 29 | importFrom(stats,na.omit) 30 | importFrom(stats,quantile) 31 | -------------------------------------------------------------------------------- /man/plot.OneR.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/OneR.R 3 | \name{plot.OneR} 4 | \alias{plot.OneR} 5 | \title{Plot Diagnostics for an OneR object} 6 | \usage{ 7 | \method{plot}{OneR}(x, ...) 8 | } 9 | \arguments{ 10 | \item{x}{object of class \code{"OneR"}.} 11 | 12 | \item{...}{further arguments passed to or from other methods.} 13 | } 14 | \description{ 15 | Plots a mosaic plot for the feature attribute and the target of the OneR model. 16 | } 17 | \details{ 18 | If more than 20 levels are present for either the feature attribute or the target the function stops with an error. 19 | } 20 | \examples{ 21 | model <- OneR(iris) 22 | plot(model) 23 | } 24 | \references{ 25 | \url{https://github.com/vonjd/OneR} 26 | } 27 | \seealso{ 28 | \code{\link{OneR}} 29 | } 30 | \author{ 31 | Holger von Jouanne-Diedrich 32 | } 33 | \keyword{diagnostics} 34 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: OneR 2 | Type: Package 3 | Title: One Rule Machine Learning Classification Algorithm with Enhancements 4 | Version: 2.2 5 | Date: 2017-05-05 6 | Author: Holger von Jouanne-Diedrich 7 | Maintainer: Holger von Jouanne-Diedrich 8 | Depends: R (>= 2.10) 9 | Description: Implements the One Rule (OneR) Machine Learning classification algorithm (Holte, R.C. (1993) ) with enhancements for sophisticated handling of numeric data and missing values together with extensive diagnostic functions. It is useful as a baseline for machine learning models and the rules are often helpful heuristics. 10 | License: MIT + file LICENSE 11 | URL: https://github.com/vonjd/OneR 12 | BugReports: https://github.com/vonjd/OneR/issues 13 | LazyData: TRUE 14 | RoxygenNote: 6.0.1 15 | Suggests: knitr, 16 | rmarkdown 17 | VignetteBuilder: knitr 18 | -------------------------------------------------------------------------------- /man/summary.OneR.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/OneR.R 3 | \name{summary.OneR} 4 | \alias{summary.OneR} 5 | \title{Summarize OneR models} 6 | \usage{ 7 | \method{summary}{OneR}(object, ...) 8 | } 9 | \arguments{ 10 | \item{object}{object of class \code{"OneR"}.} 11 | 12 | \item{...}{further arguments passed to or from other methods.} 13 | } 14 | \description{ 15 | \code{summary} method for class \code{OneR}. 16 | } 17 | \details{ 18 | Prints the rules of the OneR model, the accuracy, a contingency table of the feature attribute and the target and performs a chi-squared test on this table. 19 | 20 | In the contingency table the maximum values in each column are highlighted by adding a '*', thereby representing the rules of the OneR model. 21 | } 22 | \examples{ 23 | model <- OneR(iris) 24 | summary(model) 25 | } 26 | \references{ 27 | \url{https://github.com/vonjd/OneR} 28 | } 29 | \seealso{ 30 | \code{\link{OneR}} 31 | } 32 | \author{ 33 | Holger von Jouanne-Diedrich 34 | } 35 | \keyword{diagnostics} 36 | -------------------------------------------------------------------------------- /R/OneR_data.R: -------------------------------------------------------------------------------- 1 | #' Breast Cancer Wisconsin Original Data Set 2 | #' 3 | #' Dataset containing the original Wisconsin breast cancer data. 4 | #' 5 | #' \enumerate{ 6 | #' \item Clump Thickness: 1 - 10 7 | #' \item Uniformity of Cell Size: 1 - 10 8 | #' \item Uniformity of Cell Shape: 1 - 10 9 | #' \item Marginal Adhesion: 1 - 10 10 | #' \item Single Epithelial Cell Size: 1 - 10 11 | #' \item Bare Nuclei: 1 - 10 12 | #' \item Bland Chromatin: 1 - 10 13 | #' \item Normal Nucleoli: 1 - 10 14 | #' \item Mitoses: 1 - 10 15 | #' \item Class: benign, malignant 16 | #' } 17 | #' 18 | #' @name breastcancer 19 | #' @docType data 20 | #' @references The data were obtained from the UCI machine learning repository, see \url{https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)} 21 | #' @keywords data datasets Wisconsin breast cancer 22 | #' @usage data(breastcancer) 23 | #' @format A data frame with 699 instances and 10 attributes. The variables are as follows: 24 | #' @examples 25 | #' data(breastcancer) 26 | #' data <- optbin(breastcancer, method = "infogain") 27 | #' model <- OneR(data, verbose = TRUE) 28 | #' summary(model) 29 | #' plot(model) 30 | #' prediction <- predict(model, data) 31 | #' eval_model(prediction, data) 32 | NULL 33 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # OneR 2 | This R package implements the One Rule (OneR) Machine Learning classification algorithm with enhancements for sophisticated handling of numeric data and missing values together with extensive diagnostic functions. It is useful as a baseline for machine learning models and the rules are often helpful heuristics. 3 | 4 | ## Documentation 5 | 6 | This video gives a step-by-step introduction: [Quick Start Guide for the OneR package](https://www.youtube.com/watch?v=AGC0oRlXxgU) 7 | 8 | You can find the vignette and full documentation in the package and on CRAN: [OneR: One Rule Machine Learning Classification Algorithm with Enhancements](https://cran.r-project.org/package=OneR) 9 | 10 | ## Installation 11 | 12 | Install the latest stable version from [CRAN](https://cran.r-project.org/package=OneR): 13 | 14 | ```r 15 | install.packages("OneR") 16 | ``` 17 | 18 | Install the latest development version from GitHub: 19 | 20 | ```R 21 | install.packages("devtools") 22 | library(devtools) 23 | install_github("vonjd/OneR") 24 | ``` 25 | 26 | ## Contact 27 | 28 | I would love to hear about your experiences with the OneR package. Please drop me a note - you can reach me at my university account: [Holger K. von Jouanne-Diedrich](https://www.h-ab.de/nc/eng/about-aschaffenburg-university-of-applied-sciences/organisation/personal/?tx_fhapersonal_pi1%5BshowUid%5D=jouanne-diedrich) 29 | -------------------------------------------------------------------------------- /man/breastcancer.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/OneR_data.R 3 | \docType{data} 4 | \name{breastcancer} 5 | \alias{breastcancer} 6 | \title{Breast Cancer Wisconsin Original Data Set} 7 | \format{A data frame with 699 instances and 10 attributes. The variables are as follows:} 8 | \usage{ 9 | data(breastcancer) 10 | } 11 | \description{ 12 | Dataset containing the original Wisconsin breast cancer data. 13 | } 14 | \details{ 15 | \enumerate{ 16 | \item Clump Thickness: 1 - 10 17 | \item Uniformity of Cell Size: 1 - 10 18 | \item Uniformity of Cell Shape: 1 - 10 19 | \item Marginal Adhesion: 1 - 10 20 | \item Single Epithelial Cell Size: 1 - 10 21 | \item Bare Nuclei: 1 - 10 22 | \item Bland Chromatin: 1 - 10 23 | \item Normal Nucleoli: 1 - 10 24 | \item Mitoses: 1 - 10 25 | \item Class: benign, malignant 26 | } 27 | } 28 | \examples{ 29 | data(breastcancer) 30 | data <- optbin(breastcancer, method = "infogain") 31 | model <- OneR(data, verbose = TRUE) 32 | summary(model) 33 | plot(model) 34 | prediction <- predict(model, data) 35 | eval_model(prediction, data) 36 | } 37 | \references{ 38 | The data were obtained from the UCI machine learning repository, see \url{https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)} 39 | } 40 | \keyword{Wisconsin} 41 | \keyword{breast} 42 | \keyword{cancer} 43 | \keyword{data} 44 | \keyword{datasets} 45 | -------------------------------------------------------------------------------- /man/maxlevels.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/OneR.R 3 | \name{maxlevels} 4 | \alias{maxlevels} 5 | \title{Remove factors with too many levels} 6 | \usage{ 7 | maxlevels(data, maxlevels = 20, na.omit = TRUE) 8 | } 9 | \arguments{ 10 | \item{data}{data frame which contains the data.} 11 | 12 | \item{maxlevels}{number of maximum factor levels.} 13 | 14 | \item{na.omit}{logical value whether missing values should be treated as a level, defaults to omit missing values before counting.} 15 | } 16 | \value{ 17 | A data frame. 18 | } 19 | \description{ 20 | Removes all columns of a data frame where a factor (or character string) has more than a maximum number of levels. 21 | } 22 | \details{ 23 | Often categories that have very many levels are not useful in modelling OneR rules because they result in too many rules and tend to overfit. 24 | Examples are IDs or names. 25 | 26 | Character strings are treated as factors although they keep their datatype. Numeric data is left untouched. 27 | If data contains unused factor levels (e.g. due to subsetting) these are ignored and a warning is given. 28 | } 29 | \examples{ 30 | df <- data.frame(numeric = c(1:26), alphabet = letters) 31 | str(df) 32 | str(maxlevels(df)) 33 | } 34 | \references{ 35 | \url{https://github.com/vonjd/OneR} 36 | } 37 | \seealso{ 38 | \code{\link{OneR}} 39 | } 40 | \author{ 41 | Holger von Jouanne-Diedrich 42 | } 43 | -------------------------------------------------------------------------------- /man/predict.OneR.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/OneR.R 3 | \name{predict.OneR} 4 | \alias{predict.OneR} 5 | \title{Predict method for OneR models} 6 | \usage{ 7 | \method{predict}{OneR}(object, newdata, type = c("class", "prob"), ...) 8 | } 9 | \arguments{ 10 | \item{object}{object of class \code{"OneR"}.} 11 | 12 | \item{newdata}{data frame in which to look for the feature variable with which to predict.} 13 | 14 | \item{type}{character string denoting the type of predicted value returned. Default \code{"class"} gives a named vector with the predicted classes, \code{"prob"} gives a matrix whose columns are the probability of the first, second, etc. class.} 15 | 16 | \item{...}{further arguments passed to or from other methods.} 17 | } 18 | \value{ 19 | The default is a factor with the predicted classes, if \code{"type = prob"} a matrix is returned whose columns are the probability of the first, second, etc. class. 20 | } 21 | \description{ 22 | Predict cases or probabilities based on OneR model object. 23 | } 24 | \details{ 25 | \code{newdata} can have the same format as used for building the model but must at least have the feature variable that is used in the OneR rules. 26 | If cases appear that were not present when building the model the predicted case is \code{UNSEEN} or \code{NA} when \code{"type = prob"}. 27 | } 28 | \examples{ 29 | model <- OneR(iris) 30 | prediction <- predict(model, iris[1:4]) 31 | eval_model(prediction, iris[5]) 32 | 33 | ## type prob 34 | predict(model, data.frame(Petal.Width = seq(0, 3, 0.5))) 35 | predict(model, data.frame(Petal.Width = seq(0, 3, 0.5)), type = "prob") 36 | } 37 | \references{ 38 | \url{https://github.com/vonjd/OneR} 39 | } 40 | \seealso{ 41 | \code{\link{OneR}} 42 | } 43 | \author{ 44 | Holger von Jouanne-Diedrich 45 | } 46 | -------------------------------------------------------------------------------- /man/eval_model.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/OneR.R 3 | \name{eval_model} 4 | \alias{eval_model} 5 | \title{Classification Evaluation function} 6 | \usage{ 7 | eval_model(prediction, actual, dimnames = c("Prediction", "Actual"), 8 | zero.print = "0") 9 | } 10 | \arguments{ 11 | \item{prediction}{vector which contains the predicted values.} 12 | 13 | \item{actual}{data frame which contains the actual data. When there is more than one column the last last column is taken. A single vector is allowed too.} 14 | 15 | \item{dimnames}{character vector of printed dimnames for the confusion matrices.} 16 | 17 | \item{zero.print}{character specifying how zeros should be printed; for sparse confusion matrices, using "." can produce more readable results.} 18 | } 19 | \value{ 20 | Invisibly returns a list with the number of correctly classified and total instances and a confusion matrix with the absolute numbers. 21 | } 22 | \description{ 23 | Function for evaluating a OneR classification model. Prints confusion matrices with prediction vs. actual in absolute and relative numbers. Additionally it gives the accuracy, error rate as well as the error rate reduction versus the base rate accuracy together with a p-value. 24 | } 25 | \details{ 26 | Error rate reduction versus the base rate accuracy is calculated by the following formula:\cr\cr 27 | \eqn{(Accuracy(Prediction) - Accuracy(Baserate)) / (1 - Accuracy(Baserate))},\cr\cr 28 | giving a number between 0 (no error reduction) and 1 (no error).\cr\cr 29 | In some borderline cases when the model is performing worse than the base rate negative numbers can result. This shows that something is seriously wrong with the model generating this prediction.\cr\cr 30 | The provided p-value gives the probability of obtaining a distribution of predictions like this (or even more unambiguous) under the assumption that the real accuracy is equal to or lower than the base rate accuracy. 31 | More technicaly it is derived from a one-sided binomial test with the alternative hypothesis that the prediction's accuracy is bigger than the base rate accuracy. 32 | Loosly speaking a low p-value (< 0.05) signifies that the model really is able to give predictions that are better than the base rate. 33 | } 34 | \examples{ 35 | data <- iris 36 | model <- OneR(data) 37 | summary(model) 38 | prediction <- predict(model, data) 39 | eval_model(prediction, data) 40 | } 41 | \references{ 42 | \url{https://github.com/vonjd/OneR} 43 | } 44 | \author{ 45 | Holger von Jouanne-Diedrich 46 | } 47 | \keyword{accuracy} 48 | \keyword{evaluation} 49 | -------------------------------------------------------------------------------- /man/bin.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/OneR.R 3 | \name{bin} 4 | \alias{bin} 5 | \title{Binning function} 6 | \usage{ 7 | bin(data, nbins = 5, labels = NULL, method = c("length", "content", 8 | "clusters"), na.omit = TRUE) 9 | } 10 | \arguments{ 11 | \item{data}{data frame or vector which contains the data.} 12 | 13 | \item{nbins}{number of bins (= levels).} 14 | 15 | \item{labels}{character vector of labels for the resulting category.} 16 | 17 | \item{method}{character string specifying the binning method, see 'Details'; can be abbreviated.} 18 | 19 | \item{na.omit}{logical value whether instances with missing values should be removed.} 20 | } 21 | \value{ 22 | A data frame or vector. 23 | } 24 | \description{ 25 | Discretizes all numerical data in a data frame into categorical bins of equal length or content or based on automatically determined clusters. 26 | } 27 | \details{ 28 | Character strings and logical strings are coerced into factors. Matrices are coerced into data frames. When called with a single vector only the respective factor (and not a data frame) is returned. 29 | Method \code{"length"} gives intervals of equal length, method \code{"content"} gives intervals of equal content (via quantiles). 30 | Method \code{"clusters"} determins \code{"nbins"} clusters via 1D kmeans with deterministic seeding of the initial cluster centres (Jenks natural breaks optimization). 31 | 32 | When \code{"na.omit = FALSE"} an additional level \code{"NA"} is added to each factor with missing values. 33 | } 34 | \examples{ 35 | data <- iris 36 | str(data) 37 | str(bin(data)) 38 | str(bin(data, nbins = 3)) 39 | str(bin(data, nbins = 3, labels = c("small", "medium", "large"))) 40 | 41 | ## Difference between methods "length" and "content" 42 | set.seed(1); table(bin(rnorm(900), nbins = 3)) 43 | set.seed(1); table(bin(rnorm(900), nbins = 3, method = "content")) 44 | 45 | ## Method "clusters" 46 | intervals <- paste(levels(bin(faithful$waiting, nbins = 2, method = "cluster")), collapse = " ") 47 | hist(faithful$waiting, main = paste("Intervals:", intervals)) 48 | abline(v = c(42.9, 67.5, 96.1), col = "blue") 49 | 50 | ## Missing values 51 | bin(c(1:10, NA), nbins = 2, na.omit = FALSE) # adds new level "NA" 52 | bin(c(1:10, NA), nbins = 2) # omits missing values by default (with warning) 53 | } 54 | \references{ 55 | \url{https://github.com/vonjd/OneR} 56 | } 57 | \seealso{ 58 | \code{\link{OneR}}, \code{\link{optbin}} 59 | } 60 | \author{ 61 | Holger von Jouanne-Diedrich 62 | } 63 | \keyword{Jenks} 64 | \keyword{binning} 65 | \keyword{breaks} 66 | \keyword{clusters} 67 | \keyword{discretization} 68 | \keyword{discretize} 69 | -------------------------------------------------------------------------------- /man/OneR.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/OneR_main.R 3 | \name{OneR} 4 | \alias{OneR} 5 | \alias{OneR.formula} 6 | \alias{OneR.data.frame} 7 | \title{One Rule function} 8 | \usage{ 9 | OneR(x, ...) 10 | 11 | \method{OneR}{formula}(formula, data, ties.method = c("first", "chisq"), 12 | verbose = FALSE, ...) 13 | 14 | \method{OneR}{data.frame}(x, ties.method = c("first", "chisq"), 15 | verbose = FALSE, ...) 16 | } 17 | \arguments{ 18 | \item{x}{data frame with the last column containing the target variable.} 19 | 20 | \item{...}{arguments passed to or from other methods.} 21 | 22 | \item{formula}{formula, additionally the argument \code{data} is needed.} 23 | 24 | \item{data}{data frame which contains the data, only needed when using the formula interface.} 25 | 26 | \item{ties.method}{character string specifying how ties are treated, see 'Details'; can be abbreviated.} 27 | 28 | \item{verbose}{if \code{TRUE} prints rank, names and predictive accuracy of the attributes in decreasing order (with \code{ties.method = "first"}).} 29 | } 30 | \value{ 31 | Returns an object of class "OneR". Internally this is a list consisting of the function call with the specified arguments, the names of the target and feature variables, 32 | a list of the rules, the number of correctly classified and total instances and the contingency table of the best predictor vs. the target variable. 33 | } 34 | \description{ 35 | Builds a model according to the One Rule (OneR) machine learning classification algorithm. 36 | } 37 | \details{ 38 | All numerical data is automatically converted into five categorical bins of equal length. Instances with missing values are removed. 39 | This is done by internally calling the default version of \code{\link{bin}} before starting the OneR algorithm. 40 | To finetune this behaviour data preprocessing with the \code{\link{bin}} or \code{\link{optbin}} functions should be performed. 41 | If data contains unused factor levels (e.g. due to subsetting) these are ignored and a warning is given. 42 | 43 | When there is more than one attribute with best performance either the first (from left to right) is being chosen (method \code{"first"}) or 44 | the one with the lowest p-value of a chi-squared test (method \code{"chisq"}). 45 | } 46 | \section{Methods (by class)}{ 47 | \itemize{ 48 | \item \code{formula}: method for formulas. 49 | 50 | \item \code{data.frame}: method for data frames. 51 | }} 52 | 53 | \examples{ 54 | data <- optbin(iris) 55 | model <- OneR(data, verbose = TRUE) 56 | summary(model) 57 | plot(model) 58 | prediction <- predict(model, data) 59 | eval_model(prediction, data) 60 | 61 | ## The same with the formula interface: 62 | data <- optbin(iris) 63 | model <- OneR(Species ~., data = data, verbose = TRUE) 64 | summary(model) 65 | plot(model) 66 | prediction <- predict(model, data) 67 | eval_model(prediction, data) 68 | } 69 | \references{ 70 | \url{https://github.com/vonjd/OneR} 71 | } 72 | \seealso{ 73 | \code{\link{bin}}, \code{\link{optbin}}, \code{\link{eval_model}}, \code{\link{maxlevels}} 74 | } 75 | \author{ 76 | Holger von Jouanne-Diedrich 77 | } 78 | \keyword{1R} 79 | \keyword{One} 80 | \keyword{OneR} 81 | \keyword{Rule} 82 | -------------------------------------------------------------------------------- /vignettes/OneR.R: -------------------------------------------------------------------------------- 1 | ## ------------------------------------------------------------------------ 2 | library(OneR) 3 | 4 | ## ------------------------------------------------------------------------ 5 | data <- optbin(iris) 6 | 7 | ## ------------------------------------------------------------------------ 8 | model <- OneR(data, verbose = TRUE) 9 | 10 | ## ------------------------------------------------------------------------ 11 | summary(model) 12 | 13 | ## ---- fig.width=7.15, fig.height=5--------------------------------------- 14 | plot(model) 15 | 16 | ## ------------------------------------------------------------------------ 17 | prediction <- predict(model, data) 18 | 19 | ## ------------------------------------------------------------------------ 20 | eval_model(prediction, data) 21 | 22 | ## ------------------------------------------------------------------------ 23 | data(breastcancer) 24 | data <- breastcancer 25 | 26 | ## ------------------------------------------------------------------------ 27 | set.seed(12) # for reproducibility 28 | random <- sample(1:nrow(data), 0.8 * nrow(data)) 29 | data_train <- optbin(data[random, ], method = "infogain") 30 | data_test <- data[-random, ] 31 | 32 | ## ------------------------------------------------------------------------ 33 | model_train <- OneR(data_train, verbose = TRUE) 34 | 35 | ## ------------------------------------------------------------------------ 36 | summary(model_train) 37 | 38 | ## ---- fig.width=7.15, fig.height=5--------------------------------------- 39 | plot(model_train) 40 | 41 | ## ------------------------------------------------------------------------ 42 | prediction <- predict(model_train, data_test) 43 | 44 | ## ------------------------------------------------------------------------ 45 | eval_model(prediction, data_test) 46 | 47 | ## ------------------------------------------------------------------------ 48 | data <- iris 49 | str(data) 50 | str(bin(data)) 51 | str(bin(data, nbins = 3)) 52 | str(bin(data, nbins = 3, labels = c("small", "medium", "large"))) 53 | 54 | ## ------------------------------------------------------------------------ 55 | set.seed(1); table(bin(rnorm(900), nbins = 3)) 56 | set.seed(1); table(bin(rnorm(900), nbins = 3, method = "content")) 57 | 58 | ## ---- fig.width=7.15, fig.height=5--------------------------------------- 59 | intervals <- paste(levels(bin(faithful$waiting, nbins = 2, method = "cluster")), collapse = " ") 60 | hist(faithful$waiting, main = paste("Intervals:", intervals)) 61 | abline(v = c(42.9, 67.5, 96.1), col = "blue") 62 | 63 | ## ------------------------------------------------------------------------ 64 | bin(c(1:10, NA), nbins = 2, na.omit = FALSE) # adds new level "NA" 65 | bin(c(1:10, NA), nbins = 2) 66 | 67 | ## ------------------------------------------------------------------------ 68 | df <- data.frame(numeric = c(1:26), alphabet = letters) 69 | str(df) 70 | str(maxlevels(df)) 71 | 72 | ## ------------------------------------------------------------------------ 73 | model <- OneR(iris) 74 | predict(model, data.frame(Petal.Width = seq(0, 3, 0.5))) 75 | 76 | ## ------------------------------------------------------------------------ 77 | predict(model, data.frame(Petal.Width = seq(0, 3, 0.5)), type = "prob") 78 | 79 | ## ---- eval=FALSE--------------------------------------------------------- 80 | # help(package = OneR) 81 | 82 | -------------------------------------------------------------------------------- /man/optbin.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/OneR.R 3 | \name{optbin} 4 | \alias{optbin} 5 | \alias{optbin.formula} 6 | \alias{optbin.data.frame} 7 | \title{Optimal Binning function} 8 | \usage{ 9 | optbin(x, ...) 10 | 11 | \method{optbin}{formula}(formula, data, method = c("logreg", "infogain", 12 | "naive"), na.omit = TRUE, ...) 13 | 14 | \method{optbin}{data.frame}(x, method = c("logreg", "infogain", "naive"), 15 | na.omit = TRUE, ...) 16 | } 17 | \arguments{ 18 | \item{x}{data frame with the last column containing the target variable.} 19 | 20 | \item{...}{arguments passed to or from other methods.} 21 | 22 | \item{formula}{formula, additionally the argument \code{data} is needed.} 23 | 24 | \item{data}{data frame which contains the data, only needed when using the formula interface.} 25 | 26 | \item{method}{character string specifying the method for optimal binning, see 'Details'; can be abbreviated.} 27 | 28 | \item{na.omit}{logical value whether instances with missing values should be removed.} 29 | } 30 | \value{ 31 | A data frame with the target variable being in the last column. 32 | } 33 | \description{ 34 | Discretizes all numerical data in a data frame into categorical bins where the cut points are optimally aligned with the target categories, thereby a factor is returned. 35 | When building a OneR model this could result in fewer rules with enhanced accuracy. 36 | } 37 | \details{ 38 | The cutpoints are calculated by pairwise logistic regressions (method \code{"logreg"}), information gain (method \code{"infogain"}) or as the means of the expected values of the respective classes (\code{"naive"}). 39 | The function is likely to give unsatisfactory results when the distributions of the respective classes are not (linearly) separable. Method \code{"naive"} should only be used when distributions are (approximately) normal, 40 | although in this case \code{"logreg"} should give comparable results, so it is the preferable (and therefore default) method. 41 | 42 | Method \code{"infogain"} is an entropy based method which calculates cut points based on information gain. The idea is that uncertainty is minimized by making the resulting bins as pure as possible. This method is the standard method of many decision tree algorithms. 43 | 44 | Character strings and logical strings are coerced into factors. Matrices are coerced into data frames. If the target is numeric it is turned into a factor with the number of levels equal to the number of values. Additionally a warning is given. 45 | 46 | When \code{"na.omit = FALSE"} an additional level \code{"NA"} is added to each factor with missing values. 47 | If the target contains unused factor levels (e.g. due to subsetting) these are ignored and a warning is given. 48 | } 49 | \section{Methods (by class)}{ 50 | \itemize{ 51 | \item \code{formula}: method for formulas. 52 | 53 | \item \code{data.frame}: method for data frames. 54 | }} 55 | 56 | \examples{ 57 | data <- iris # without optimal binning 58 | model <- OneR(data, verbose = TRUE) 59 | summary(model) 60 | 61 | data_opt <- optbin(iris) # with optimal binning 62 | model_opt <- OneR(data_opt, verbose = TRUE) 63 | summary(model_opt) 64 | 65 | ## The same with the formula interface: 66 | data_opt <- optbin(Species ~., data = iris) 67 | model_opt <- OneR(data_opt, verbose = TRUE) 68 | summary(model_opt) 69 | 70 | } 71 | \references{ 72 | \url{https://github.com/vonjd/OneR} 73 | } 74 | \seealso{ 75 | \code{\link{OneR}}, \code{\link{bin}} 76 | } 77 | \author{ 78 | Holger von Jouanne-Diedrich 79 | } 80 | \keyword{binning} 81 | \keyword{discretization} 82 | \keyword{discretize} 83 | -------------------------------------------------------------------------------- /NEWS: -------------------------------------------------------------------------------- 1 | OneR 2.2 (2017-05-05) 2 | ===================== 3 | 4 | MAJOR IMPORVEMENTS 5 | - OneR: massive speedup of main OneR function (> 30 times faster). 6 | - OneR & optbin: standard S3 method interface added for formulas and data frames. 7 | 8 | MINOR IMPROVEMENTS 9 | - optbin: speedup of method "infogain". 10 | - Some minor corrections in documentation. 11 | 12 | 13 | OneR 2.1 (2016-10-24) 14 | ===================== 15 | 16 | NEW FEATURES 17 | - eval_model: all instances of "prediction" and "actual" are now being printed in the confusion matrices. Two new arguments were added: "dimnames" for the printed dimnames of the confusion matrices and "zero.print" specifying how zeros should be printed; for sparse confusion matrices, using "." can produce more readable results. A new performance measure "error rate reduction versus the base rate accuracy" was added together with a p-value. 18 | 19 | MINOR IMPROVEMENTS 20 | - Some minor corrections in documentation. 21 | - Some streamlining and consolidation of code for better maintenance. 22 | 23 | 24 | OneR 2.0 (2016-08-12) 25 | ===================== 26 | 27 | NEW FEATURES 28 | - Added a vignette. 29 | - breastcancer: Breast Cancer Wisconsin Original Data Set now included in the package. 30 | - predict: new type "prob" which gives a matrix whose columns are the probability of the first, second, etc. class. 31 | - optbin: new method "infogain" (information gain) which is an entropy based method to determine the cutpoints which make the resulting bins as pure as possible. 32 | - OneR, optbin, maxlevels: consistent handling of unused factor levels (e.g. due to subsetting) was added. These are dropped for analysis and a warning is given. 33 | 34 | MINOR IMPROVEMENTS 35 | - bin & optbin: in case of removing instances due to missing values the resulting warning gives the number of removed instances. 36 | - maxlevels: with data containing missing values an unhelpful warning was given. 37 | - predict: numerical values that are smaller or bigger than model limits are now transformed into (-Inf, min] or (max, Inf] respectively. 38 | - predict: output of type "class" is a factor now. 39 | - Some streamlining and consolidation of code for better maintenance. 40 | 41 | BUGFIXES 42 | - bin & optbin: in some borderline cases when the function addNA was used in preprocessing print.OneR stopped with an error. 43 | 44 | 45 | OneR 1.3 (2016-07-22) 46 | ===================== 47 | 48 | NEW FEATURES 49 | - bin: new method "clusters", which determines the bins according to automatically determined clusters in the data. 50 | - OneR: a new element "call" with the specified arguments of the actual function call was added to the internal class structure of OneR objects. 51 | - print & summary: the function call with the specified arguments which was used to build the model is printed first. 52 | 53 | MINOR IMPROVEMENTS 54 | - bin & optbin: in cases where there were missing values and already a factor level "NA" the functions gave an unhelpful warning. 55 | - eval_model: added warning when actual contains missing values. 56 | - eval_model: added "Confusion matrix" to printout for clarity. 57 | - Extension of and minor corrections in documentation 58 | - Some minor streamlining of code. 59 | 60 | BUGFIXES 61 | - predict: the combination of intervals and "NA"s caused an error. 62 | - bin: the method "content" stopped with an error in case of missing values. 63 | - optbin: the method "logreg" stopped in some borderline cases with missing values with an error. 64 | - optbin: some borderline cases could result in a "breaks are not unique" error. 65 | - OneR: in some borderline cases with very large datasets the numbering of printed ranks (verbose = TRUE) could be wrong due to rounding errors. 66 | 67 | 68 | OneR 1.2 (2016-06-20) 69 | ===================== 70 | 71 | Initial release on CRAN 72 | -------------------------------------------------------------------------------- /R/OneR_internal.R: -------------------------------------------------------------------------------- 1 | # Internal OneR functions 2 | 3 | # modified cut function for ensuring consistency of cut points and chosen cut points 4 | # http://stackoverflow.com/questions/37899503/inconsistent-behaviour-of-cut-different-intervals-with-same-number-and-same-d 5 | CUT <- function(x, breaks, ...) { 6 | if (length(breaks) == 1L) { 7 | nb <- as.integer(breaks + 1) 8 | dx <- diff(rx <- range(x, na.rm = TRUE)) 9 | if (dx == 0) { 10 | dx <- abs(rx[1L]) 11 | breaks <- seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000, length.out = nb) 12 | } else { 13 | breaks <- seq.int(rx[1L], rx[2L], length.out = nb) 14 | breaks[c(1L, nb)] <- c(rx[1L] - dx/1000, rx[2L] + dx/1000) 15 | } 16 | } 17 | breaks.f <- c(breaks[1], as.numeric(formatC(0 + breaks[2:(length(breaks)-1)], digits = 3, width = 1L)), breaks[length(breaks)]) 18 | cut(x, breaks = unique(breaks.f), ...) 19 | } 20 | 21 | nerrors <- function(x) { 22 | sum(rowSums(x) - apply(x, 1, max)) 23 | } 24 | 25 | mode <- function(x) { 26 | names(sort(-table(x[ , ncol(x)])))[1] 27 | } 28 | 29 | ADDNA <- function(x) { 30 | if (is.factor(x) & !("NA" %in% levels(x))) x <- factor(x, levels = c(levels(x), "NA")) 31 | x[is.na(x)] <- "NA" 32 | x 33 | } 34 | 35 | add_range <- function(x, midpoints) { 36 | c(min(x, na.rm = TRUE) - 1/1000 * diff(range(x, na.rm = TRUE)), midpoints, max(x, na.rm = TRUE) + 1/1000 * diff(range(x, na.rm = TRUE))) 37 | } 38 | 39 | get_breaks <- function(x) { 40 | x <- x[x != "NA"] 41 | lower = as.numeric(sub("\\((.+),.*", "\\1", x)) 42 | upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", x)) 43 | breaks <- unique(na.omit(c(lower, upper))) 44 | breaks 45 | } 46 | 47 | #' @importFrom stats coef 48 | #' @importFrom stats glm 49 | #' @importFrom stats binomial 50 | logreg_midpoint <- function(data) { 51 | df <- data.frame(x = unlist(data), target = factor(rep(names(data), sapply(data, length)))) 52 | coefs <- suppressWarnings(coef(glm(target ~ x, data = df, family = binomial))) 53 | midpoint <- - coefs[1] / coefs[2] 54 | # test limits 55 | range <- sort(sapply(data, mean, na.rm = TRUE)) 56 | if (length(range) == 1) range <- c(range, range) 57 | if (is.na(midpoint)) return(mean(range, na.rm = TRUE)) 58 | if (midpoint < range[1]) return(range[1]) 59 | if (midpoint > range[2]) return(range[2]) 60 | # --- 61 | midpoint 62 | } 63 | 64 | entropy <- function(x) { 65 | freqs <- table(x) / length(x) 66 | - sum(freqs * log2(freqs)) 67 | } 68 | 69 | #' @importFrom stats na.omit 70 | infogain_midpoint <- function(data) { 71 | df <- data.frame(numvar = unlist(data), target = factor(rep(names(data), sapply(data, length)))) 72 | data <- na.omit(df[order(df[ , 1]), ]) 73 | numvar <- data$numvar; target <- data$target 74 | # determine midpoint candidates 75 | left_thresholds <- which(as.logical(diff(as.numeric(target)))) 76 | midpoints <- (numvar[left_thresholds] + numvar[(left_thresholds + 1)]) / 2 77 | # calculate average entropies for all midpoint candidates 78 | belows <- lapply(midpoints, function(x) as.character(data[numvar <= x, 2])) 79 | aboves <- lapply(midpoints, function(x) as.character(data[numvar > x, 2])) 80 | below_entropies <- sapply(belows, function(x) length(x)/length(target) * entropy(x)) 81 | above_entropies <- sapply(aboves, function(x) length(x)/length(target) * entropy(x)) 82 | # calculate entropies after split and choose lowest 83 | after_entropies <- below_entropies + above_entropies 84 | midpoints[which.min(after_entropies)] 85 | } 86 | 87 | #' @importFrom stats na.omit 88 | #' @importFrom stats filter 89 | optcut <- function(x, target, method) { 90 | orig <- x 91 | tmp <- na.omit(cbind(x, target)) 92 | x <- tmp[ , 1]; target <- tmp[ , 2] 93 | xs <- split(x, target) 94 | if (method == "naive") { 95 | midpoints <- sort(sapply(xs, mean, na.rm = TRUE)) 96 | # Cutpoints are the means of the expected values of the respective target levels. 97 | breaks <- add_range(x, na.omit(filter(midpoints, c(1/2, 1/2)))) 98 | } else { 99 | midpoints <- sapply(xs, mean, na.rm = TRUE) 100 | nl <- xs[order(midpoints)] 101 | pairs <- matrix(c(1:(length(nl) - 1), 2:length(nl)), ncol = 2, byrow = TRUE) 102 | if (method == "logreg") { 103 | midpoints <- apply(pairs, 1, function(x) logreg_midpoint(c(nl[x[1]], nl[x[2]]))) 104 | } 105 | if (method == "infogain") { 106 | midpoints <- apply(pairs, 1, function(x) infogain_midpoint(c(nl[x[1]], nl[x[2]]))) 107 | } 108 | breaks <- add_range(x, na.omit(midpoints)) 109 | } 110 | CUT(orig, breaks = unique(breaks)) 111 | } 112 | -------------------------------------------------------------------------------- /R/OneR_main.R: -------------------------------------------------------------------------------- 1 | # OneR main function 2 | 3 | #' One Rule function 4 | #' 5 | #' Builds a model according to the One Rule (OneR) machine learning classification algorithm. 6 | #' @param x data frame with the last column containing the target variable. 7 | #' @param formula formula, additionally the argument \code{data} is needed. 8 | #' @param data data frame which contains the data, only needed when using the formula interface. 9 | #' @param ties.method character string specifying how ties are treated, see 'Details'; can be abbreviated. 10 | #' @param verbose if \code{TRUE} prints rank, names and predictive accuracy of the attributes in decreasing order (with \code{ties.method = "first"}). 11 | #' @param ... arguments passed to or from other methods. 12 | #' @return Returns an object of class "OneR". Internally this is a list consisting of the function call with the specified arguments, the names of the target and feature variables, 13 | #' a list of the rules, the number of correctly classified and total instances and the contingency table of the best predictor vs. the target variable. 14 | #' @keywords 1R OneR One Rule 15 | #' @details All numerical data is automatically converted into five categorical bins of equal length. Instances with missing values are removed. 16 | #' This is done by internally calling the default version of \code{\link{bin}} before starting the OneR algorithm. 17 | #' To finetune this behaviour data preprocessing with the \code{\link{bin}} or \code{\link{optbin}} functions should be performed. 18 | #' If data contains unused factor levels (e.g. due to subsetting) these are ignored and a warning is given. 19 | #' 20 | #' When there is more than one attribute with best performance either the first (from left to right) is being chosen (method \code{"first"}) or 21 | #' the one with the lowest p-value of a chi-squared test (method \code{"chisq"}). 22 | #' @author Holger von Jouanne-Diedrich 23 | #' @references \url{https://github.com/vonjd/OneR} 24 | #' @seealso \code{\link{bin}}, \code{\link{optbin}}, \code{\link{eval_model}}, \code{\link{maxlevels}} 25 | #' @examples 26 | #' data <- optbin(iris) 27 | #' model <- OneR(data, verbose = TRUE) 28 | #' summary(model) 29 | #' plot(model) 30 | #' prediction <- predict(model, data) 31 | #' eval_model(prediction, data) 32 | #' 33 | #' ## The same with the formula interface: 34 | #' data <- optbin(iris) 35 | #' model <- OneR(Species ~., data = data, verbose = TRUE) 36 | #' summary(model) 37 | #' plot(model) 38 | #' prediction <- predict(model, data) 39 | #' eval_model(prediction, data) 40 | #' @importFrom stats model.frame 41 | #' @importFrom stats chisq.test 42 | #' @export 43 | OneR <- function(x, ...) UseMethod("OneR") 44 | 45 | #' @export 46 | OneR.default <- function(x, ...) { 47 | stop("data type not supported") 48 | } 49 | 50 | #' @export 51 | #' @describeIn OneR method for formulas. 52 | OneR.formula <- function(formula, data, ties.method = c("first", "chisq"), verbose = FALSE, ...) { 53 | call <- match.call() 54 | method <- match.arg(ties.method) 55 | mf <- model.frame(formula = formula, data = data, na.action = NULL) 56 | data <- mf[c(2:ncol(mf), 1)] 57 | OneR.data.frame(x = data, ties.method = ties.method, verbose = verbose, fcall = call) 58 | } 59 | 60 | #' @export 61 | #' @describeIn OneR method for data frames. 62 | OneR.data.frame <- function(x, ties.method = c("first", "chisq"), verbose = FALSE, ...) { 63 | if (!is.null(list(...)$fcall)) call <- list(...)$fcall 64 | else call <- match.call() 65 | method <- match.arg(ties.method) 66 | data <- x 67 | if (dim(data.frame(data))[2] < 2) stop("data must have at least two columns") 68 | data <- bin(data) 69 | if (nrow(data) == 0) stop("no data to analyse") 70 | # test if unused factor levels and drop them for analysis 71 | nlevels_orig <- sum(sapply(data, nlevels)) 72 | data <- droplevels(data) 73 | nlevels_new <- sum(sapply(data, nlevels)) 74 | if (nlevels_new < nlevels_orig) warning("data contains unused factor levels") 75 | # main routine for finding the best predictor(s) 76 | tables <- lapply(data[ , 1:(ncol(data)-1), drop = FALSE], table, data[ , ncol(data)]) 77 | errors <- sapply(tables, nerrors) 78 | perf <- nrow(data) - errors 79 | target <- names(data[ , ncol(data), drop = FALSE]) 80 | best <- which(perf == max(perf)) 81 | # method "chisq 82 | if (length(best) > 1) { 83 | if (method == "chisq") { 84 | features <- names(data[ , best, drop = FALSE]) 85 | p.values <- sapply(features, function(x) suppressWarnings(chisq.test(table(c(data[target], data[x])))$p.value)) 86 | p.values[is.na(p.values)] <- Inf 87 | if (all(p.values == Inf)) warning("chi-squared tests failed, first best attribute is chosen instead") 88 | best <- best[which.min(p.values)] 89 | } else best <- best[1] 90 | } 91 | # preparation and output of results 92 | groups <- split(data[ , ncol(data), drop = FALSE], data[ , best]) 93 | majority <- lapply(groups, mode) 94 | feature <- names(data[ , best, drop = FALSE]) 95 | cont_table <- table(c(data[target], data[feature])) 96 | output <- c(call = call, 97 | target = target, 98 | feature = feature, 99 | rules = list(majority), 100 | correct_instances = max(perf), 101 | total_instances = nrow(data), 102 | cont_table = list(cont_table)) 103 | class(output) <- "OneR" 104 | # print additional diagnostic information if wanted 105 | if (verbose == TRUE) { 106 | newbest <- which(which(perf == max(perf)) == best) 107 | accs <- round(100 * sort(perf, decreasing = TRUE) / nrow(data), 2) 108 | attr <- colnames(data[order(perf, decreasing = TRUE)]) 109 | M <- matrix(c(as.character(attr), paste0(accs, "%")), ncol = 2) 110 | rownames(M) <- rank((100 - sort(perf, decreasing = TRUE)), ties.method = "min") 111 | rownames(M)[newbest] <- paste0(rownames(M)[newbest], " *") 112 | colnames(M) <- c("Attribute", "Accuracy") 113 | cat("\n") 114 | print(M, quote = FALSE) 115 | cat("---\nChosen attribute due to accuracy\nand ties method (if applicable): '*'\n\n") 116 | } 117 | output 118 | } 119 | -------------------------------------------------------------------------------- /vignettes/OneR.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "OneR - Establishing a New Baseline for Machine Learning Classification Models" 3 | author: "An R package by Holger K. von Jouanne-Diedrich" 4 | date: "`r Sys.Date()`" 5 | output: rmarkdown::html_vignette 6 | vignette: > 7 | %\VignetteIndexEntry{OneR - Establishing a New Baseline for Machine Learning Classification Models} 8 | %\VignetteEngine{knitr::rmarkdown} 9 | %\VignetteEncoding{UTF-8} 10 | --- 11 | 12 | **Note:** You can find a step-by-step introduction on YouTube: [Quick Start Guide for the OneR package](https://www.youtube.com/watch?v=AGC0oRlXxgU) 13 | 14 | ## Introduction 15 | 16 | The following story is one of the most often told in the Data Science community: some time ago the military built a system which aim it was to distinguish military vehicles from civilian ones. They chose a neural network approach and trained the system with pictures of tanks, humvees and missile launchers on the one hand and normal cars, pickups and trucks on the other. After having reached a satisfactory accuracy they brought the system into the field (quite literally). It failed completely, performing no better than a coin toss. What had happened? No one knew, so they re-engineered the black box (no small feat in itself) and found that most of the military pics where taken at dusk or dawn and most civilian pics under brighter weather conditions. The neural net had learned the difference between light and dark! 17 | 18 | Although this might be an urban legend the fact that it is so often told wants to tell us something: 19 | 20 | 1. Many of our Machine Learning models are so complex that we cannot understand them ourselves. 21 | 2. Because of 1. we cannot differentiate between the simpler aspects of a problem which can be tackled by simple models and the more sophisticated ones which need specialized treatment. 22 | 23 | The above is not only true for neural networks (and especially deep neural networks) but for most of the methods used today, especially Support Vector Machines and Random Forests and in general all kinds of ensemble based methods. 24 | 25 | In one word: we need a good baseline which builds “the best simple model” that strikes a balance between the best accuracy possible with a model that is still simple enough to understand: I have developed the OneR package for finding this sweet spot and thereby establishing a new baseline for classification models in Machine Learning (ML). 26 | 27 | This package is filling a longstanding gap because only a JAVA based implementation was available so far ([RWeka package](https://cran.r-project.org/package=RWeka) as an interface for the [OneR JAVA class](http://weka.sourceforge.net/doc.dev/weka/classifiers/rules/OneR.html)). Additionally several enhancements have been made (see below). 28 | 29 | ## Design principles for the OneR package 30 | 31 | The following design principles were followed for programming the package: 32 | 33 | - Easy: the learning curve for new users should be minimal. Results should be obtained with ease and only minimal preprocessing and modeling steps should be necessary. 34 | - Versatile: all types of data, i.e. categorical and numeric, should be computable - as input variable as well as as target. 35 | - Fast: the running times of model trainings should be short. 36 | - Accurate: the accuracy of trained models should be good overall. 37 | - Robust: models should not be prone to overfitting; the reached accuracy on training data should be comparable to the accuracy of predictions from new, unseen cases. 38 | - Comprehensible: it should be easy to understand which rules the model has learned. Not only should the rules be easily comprehensible but they should serve as heuristics that are usable even without a computer. 39 | - Reproducible: because the used algorithms are strictly deterministic one will always get the same models on the same data. Many ML algorithms have stochastic components so that the data scientist will get a different model every time. 40 | - Intuitive: model diagnostics should be presented in form of simple tables and plots. 41 | - Native R: the whole package is written in native R code. Thereby the source code can be easily checked and the whole package is very lean. Additionally the package has no dependencies at all other than base R itself. 42 | 43 | The package is based on the – as the name might reveal – one rule classification algorithm [Holte93]. Although the underlying method is simple enough (basically 1-level decision trees, you can find out more here: [OneR](http://www.saedsayad.com/oner.htm)) several enhancements have been made: 44 | 45 | - Discretization of numeric data: the OneR algorithm can only handle categorical data, so numeric data has to be discretized. The original OneR algorithm separates the respective values in ever smaller and smaller buckets until the best possible accuracy is being reached. It can be argued that this is the definition of overfitting and contradicts the original spirit of OneR because tons of rules (one for every bucket) will result. One can of course introduce a new parameter “maximum bucket size” but finding the right value for this one doesn’t come naturally either. Therefore I take a radically different approach: there are several methods for handling numeric data in the package (in the bin and the optbin function), the most promising one is the (default) “logreg” method in the optbin function which gives only as many bins as there are target categories and which optimizes the cut points according to pairwise logistic regressions. 46 | - Missing values: in the original algorithm missing values were always handled as a separate level in the respective attribute. While missing values can sometimes reveal interesting patterns in other cases they are, well, just values that are missing. In the OneR package missing values can be handled as separate levels (level “NA”) or they can be omitted (the default). 47 | - Tie breaking: sometimes the OneR algorithm will find several attributes that provide rules which all give the same best accuracy. The original algorithm just took the first attribute. While this is implemented in the OneR function as the default too a different method for tie breaking can be chosen: the contingency tables of all “best” rules are tested against each other with a Pearson’s Chi squared test and the one with the smallest p-value is being chosen. The rationale behind this is that thereby the attribute with the best signal-to-noise ratio is being found. 48 | 49 | ## Getting started with a simple example 50 | 51 | You can also watch this video which goes through the following example step-by-step: 52 | 53 | [Quick Start Guide for the OneR package (Video)](https://www.youtube.com/watch?v=AGC0oRlXxgU) 54 | 55 | After installing from CRAN load package 56 | 57 | ```{r} 58 | library(OneR) 59 | ``` 60 | 61 | Use the famous Iris dataset and determine optimal bins for numeric data 62 | 63 | ```{r} 64 | data <- optbin(iris) 65 | ``` 66 | 67 | Build model with best predictor 68 | 69 | ```{r} 70 | model <- OneR(data, verbose = TRUE) 71 | ``` 72 | 73 | Show learned rules and model diagnostics 74 | 75 | ```{r} 76 | summary(model) 77 | ``` 78 | 79 | Plot model diagnostics 80 | 81 | ```{r, fig.width=7.15, fig.height=5} 82 | plot(model) 83 | ``` 84 | 85 | Use model to predict data 86 | 87 | ```{r} 88 | prediction <- predict(model, data) 89 | ``` 90 | 91 | Evaluate prediction statistics 92 | 93 | ```{r} 94 | eval_model(prediction, data) 95 | ``` 96 | 97 | Please note that the very good accuracy of 96% is reached effortlessly. 98 | 99 | "Petal.Width" is identified as the attribute with the highest predictive value. The cut points of the intervals are found automatically (via the included optbin function). The results are three very simple, yet accurate, rules to predict the respective species. 100 | 101 | The nearly perfect separation of the areas in the diagnostic plot give a good indication of the model’s ability to separate the different species. 102 | 103 | ## A more sophisticated real-world example 104 | 105 | The next example tries to find a model for the identification of breast cancer. The data were obtained from the UCI machine learning repository (see also the package documentation). According to this source the best out-of-sample performance was 95.9%, so let's see what we can achieve with the OneR package... 106 | 107 | 108 | ```{r} 109 | data(breastcancer) 110 | data <- breastcancer 111 | ``` 112 | 113 | Divide training (80%) and test set (20%) 114 | 115 | ```{r} 116 | set.seed(12) # for reproducibility 117 | random <- sample(1:nrow(data), 0.8 * nrow(data)) 118 | data_train <- optbin(data[random, ], method = "infogain") 119 | data_test <- data[-random, ] 120 | ``` 121 | 122 | Train OneR model on training set 123 | 124 | ```{r} 125 | model_train <- OneR(data_train, verbose = TRUE) 126 | ``` 127 | 128 | Show model and diagnostics 129 | 130 | ```{r} 131 | summary(model_train) 132 | ``` 133 | 134 | Plot model diagnostics 135 | 136 | ```{r, fig.width=7.15, fig.height=5} 137 | plot(model_train) 138 | ``` 139 | 140 | Use trained model to predict test set 141 | 142 | ```{r} 143 | prediction <- predict(model_train, data_test) 144 | ``` 145 | 146 | Evaluate model performance on test set 147 | 148 | ```{r} 149 | eval_model(prediction, data_test) 150 | ``` 151 | 152 | The best reported out-of-sample accuracy on this dataset was at 95.9% and it was reached with considerable effort. The reached accuracy for the test set here lies at 94.3%! This is achieved with just one simple rule that when "Uniformity of Cell Size" is bigger than 2 the examined tissue is malignant. The cut points of the intervals are again found automatically (via the included optbin function). The very good separation of the areas in the diagnostic plot give a good indication of the model’s ability to differentiate between benign and malignant tissue. Additionally when you look at the distribution of misclassifications not a single malignant instance is missed, which is obviously very desirable in a clinical context. 153 | 154 | ## Included functions 155 | 156 | ### OneR 157 | 158 | OneR is the main function of the package. It builds a model according to the One Rule machine learning algorithm for categorical data. All numerical data is automatically converted into five categorical bins of equal length. When verbose is TRUE it gives the predictive accuracy of the attributes in decreasing order. 159 | 160 | ### bin 161 | 162 | bin discretizes all numerical data in a data frame into categorical bins of equal length or equal content or based on automatically determined clusters. 163 | 164 | Examples 165 | ```{r} 166 | data <- iris 167 | str(data) 168 | str(bin(data)) 169 | str(bin(data, nbins = 3)) 170 | str(bin(data, nbins = 3, labels = c("small", "medium", "large"))) 171 | ``` 172 | 173 | Difference between methods "length" and "content" 174 | 175 | ```{r} 176 | set.seed(1); table(bin(rnorm(900), nbins = 3)) 177 | set.seed(1); table(bin(rnorm(900), nbins = 3, method = "content")) 178 | ``` 179 | 180 | Method "clusters" 181 | ```{r, fig.width=7.15, fig.height=5} 182 | intervals <- paste(levels(bin(faithful$waiting, nbins = 2, method = "cluster")), collapse = " ") 183 | hist(faithful$waiting, main = paste("Intervals:", intervals)) 184 | abline(v = c(42.9, 67.5, 96.1), col = "blue") 185 | ``` 186 | 187 | Handling of missing values 188 | 189 | ```{r} 190 | bin(c(1:10, NA), nbins = 2, na.omit = FALSE) # adds new level "NA" 191 | bin(c(1:10, NA), nbins = 2) 192 | ``` 193 | 194 | ### optbin 195 | 196 | optbin discretizes all numerical data in a data frame into categorical bins where the cut points are optimally aligned with the target categories, thereby a factor is returned. When building a OneR model this could result in fewer rules with enhanced accuracy. The cutpoints are calculated by pairwise logistic regressions (method "logreg") or as the means of the expected values of the respective classes ("naive"). The function is likely to give unsatisfactory results when the distributions of the respective classes are not (linearly) separable. Method "naive" should only be used when distributions are (approximately) normal, although in this case "logreg" should give comparable results, so it is the preferable (and therefore default) method. 197 | 198 | Method "infogain" is an entropy based method which calculates cut points based on information gain. The idea is that uncertainty is minimized by making the resulting bins as pure as possible. This method is the standard method of many decision tree algorithms. 199 | 200 | ### maxlevels 201 | 202 | maxlavels removes all columns of a data frame where a factor (or character string) has more than a maximum number of levels. Often categories that have very many levels are not useful in modelling OneR rules because they result in too many rules and tend to overfit. Examples are IDs or names. 203 | 204 | ```{r} 205 | df <- data.frame(numeric = c(1:26), alphabet = letters) 206 | str(df) 207 | str(maxlevels(df)) 208 | ``` 209 | 210 | ### predict 211 | 212 | predict is a S3 method for predicting cases or probabilites based on OneR model objects. The second argument "newdata"" can have the same format as used for building the model but must at least have the feature variable that is used in the OneR rules. The default output is a factor with the predicted classes. 213 | 214 | ```{r} 215 | model <- OneR(iris) 216 | predict(model, data.frame(Petal.Width = seq(0, 3, 0.5))) 217 | ``` 218 | 219 | If "type = prob" a matrix is returned whose columns are the probability of the first, second, etc. class. 220 | 221 | ```{r} 222 | predict(model, data.frame(Petal.Width = seq(0, 3, 0.5)), type = "prob") 223 | ``` 224 | 225 | ### eval_model 226 | 227 | eval_model is a simple function for evaluating a OneR classification model. It prints confusion matrices with prediction vs. actual in absolute and relative numbers. Additionally it gives the accuracy, error rate as well as the error rate reduction versus the base rate accuracy together with a p-value. The second argument "actual" is a data frame which contains the actual data in the last column. A single vector is allowed too. 228 | 229 | For the details please consult the available help entries. 230 | 231 | ## Help overview 232 | 233 | From within R: 234 | 235 | ```{r, eval=FALSE} 236 | help(package = OneR) 237 | ``` 238 | 239 | ...or as a pdf here: [OneR.pdf](https://cran.r-project.org/package=OneR/OneR.pdf) 240 | 241 | Issues can be posted here: https://github.com/vonjd/OneR/issues 242 | 243 | The latest version of the package (and full sourcecode) can be found here: https://github.com/vonjd/OneR 244 | 245 | ## Sources 246 | 247 | [Holte93] R. Holte: Very Simple Classification Rules Perform Well on Most Commonly Used Datasets, 1993. Available online here: https://link.springer.com/article/10.1023/A:1022631118932 248 | 249 | ## Contact 250 | 251 | I would love to hear about your experiences with the OneR package. Please drop me a note - you can reach me at my university account: [Holger K. von Jouanne-Diedrich](https://www.h-ab.de/nc/eng/about-aschaffenburg-university-of-applied-sciences/organisation/personal/?tx_fhapersonal_pi1%5BshowUid%5D=jouanne-diedrich) 252 | 253 | ## License 254 | 255 | This package is under [MIT License](https://cran.r-project.org/package=OneR/LICENSE). 256 | -------------------------------------------------------------------------------- /R/OneR.R: -------------------------------------------------------------------------------- 1 | # OneR helper functions 2 | 3 | #' Binning function 4 | #' 5 | #' Discretizes all numerical data in a data frame into categorical bins of equal length or content or based on automatically determined clusters. 6 | #' @param data data frame or vector which contains the data. 7 | #' @param nbins number of bins (= levels). 8 | #' @param labels character vector of labels for the resulting category. 9 | #' @param method character string specifying the binning method, see 'Details'; can be abbreviated. 10 | #' @param na.omit logical value whether instances with missing values should be removed. 11 | #' @return A data frame or vector. 12 | #' @keywords binning discretization discretize clusters Jenks breaks 13 | #' @details Character strings and logical strings are coerced into factors. Matrices are coerced into data frames. When called with a single vector only the respective factor (and not a data frame) is returned. 14 | #' Method \code{"length"} gives intervals of equal length, method \code{"content"} gives intervals of equal content (via quantiles). 15 | #' Method \code{"clusters"} determins \code{"nbins"} clusters via 1D kmeans with deterministic seeding of the initial cluster centres (Jenks natural breaks optimization). 16 | #' 17 | #' When \code{"na.omit = FALSE"} an additional level \code{"NA"} is added to each factor with missing values. 18 | #' @author Holger von Jouanne-Diedrich 19 | #' @references \url{https://github.com/vonjd/OneR} 20 | #' @seealso \code{\link{OneR}}, \code{\link{optbin}} 21 | #' @examples 22 | #' data <- iris 23 | #' str(data) 24 | #' str(bin(data)) 25 | #' str(bin(data, nbins = 3)) 26 | #' str(bin(data, nbins = 3, labels = c("small", "medium", "large"))) 27 | #' 28 | #' ## Difference between methods "length" and "content" 29 | #' set.seed(1); table(bin(rnorm(900), nbins = 3)) 30 | #' set.seed(1); table(bin(rnorm(900), nbins = 3, method = "content")) 31 | #' 32 | #' ## Method "clusters" 33 | #' intervals <- paste(levels(bin(faithful$waiting, nbins = 2, method = "cluster")), collapse = " ") 34 | #' hist(faithful$waiting, main = paste("Intervals:", intervals)) 35 | #' abline(v = c(42.9, 67.5, 96.1), col = "blue") 36 | #' 37 | #' ## Missing values 38 | #' bin(c(1:10, NA), nbins = 2, na.omit = FALSE) # adds new level "NA" 39 | #' bin(c(1:10, NA), nbins = 2) # omits missing values by default (with warning) 40 | #' @importFrom stats quantile 41 | #' @importFrom stats kmeans 42 | #' @export 43 | bin <- function(data, nbins = 5, labels = NULL, method = c("length", "content", "clusters"), na.omit = TRUE) { 44 | method <- match.arg(method) 45 | vec <- FALSE 46 | if (is.atomic(data) == TRUE & is.null(dim(data)) == TRUE) { vec <- TRUE; data <- data.frame(data) } 47 | # could be a matrix -> data frame (even with only one column) 48 | if (is.list(data) == FALSE) data <- data.frame(data) 49 | if (na.omit == TRUE) { 50 | len_rows_orig <- nrow(data) 51 | data <- na.omit(data) 52 | len_rows_new <- nrow(data) 53 | no_removed <- len_rows_orig - len_rows_new 54 | if (no_removed > 0) warning(paste(no_removed, "instance(s) removed due to missing values")) 55 | } 56 | if (!is.null(labels)) if (nbins != length(labels)) stop("number of 'nbins' and 'labels' differ") 57 | if (nbins <= 1) stop("nbins must be bigger than 1") 58 | data[] <- lapply(data, function(x) if (is.numeric(x)) { 59 | if (length(unique(x)) <= nbins) as.factor(x) 60 | else { 61 | if (method == "content") nbins <- add_range(x, na.omit(quantile(x, (1:(nbins-1)/nbins), na.rm = TRUE))) 62 | if (method == "clusters") { 63 | midpoints <- sort(kmeans(na.omit(x), centers = seq(min(x, na.rm = TRUE), max(x, na.rm = TRUE), length = nbins))$centers) 64 | nbins <- add_range(x, na.omit(filter(midpoints, c(1/2, 1/2)))) 65 | } 66 | CUT(x, breaks = unique(nbins), labels = labels) 67 | } 68 | } else as.factor(x)) 69 | data[] <- lapply(data, function(x) if(any(is.na(as.character(x)))) ADDNA(x) else x) 70 | if (vec) { data <- unlist(data); names(data) <- NULL } 71 | data 72 | } 73 | 74 | #' Optimal Binning function 75 | #' 76 | #' Discretizes all numerical data in a data frame into categorical bins where the cut points are optimally aligned with the target categories, thereby a factor is returned. 77 | #' When building a OneR model this could result in fewer rules with enhanced accuracy. 78 | #' @param x data frame with the last column containing the target variable. 79 | #' @param formula formula, additionally the argument \code{data} is needed. 80 | #' @param data data frame which contains the data, only needed when using the formula interface. 81 | #' @param method character string specifying the method for optimal binning, see 'Details'; can be abbreviated. 82 | #' @param na.omit logical value whether instances with missing values should be removed. 83 | #' @param ... arguments passed to or from other methods. 84 | #' @return A data frame with the target variable being in the last column. 85 | #' @keywords binning discretization discretize 86 | #' @details The cutpoints are calculated by pairwise logistic regressions (method \code{"logreg"}), information gain (method \code{"infogain"}) or as the means of the expected values of the respective classes (\code{"naive"}). 87 | #' The function is likely to give unsatisfactory results when the distributions of the respective classes are not (linearly) separable. Method \code{"naive"} should only be used when distributions are (approximately) normal, 88 | #' although in this case \code{"logreg"} should give comparable results, so it is the preferable (and therefore default) method. 89 | #' 90 | #' Method \code{"infogain"} is an entropy based method which calculates cut points based on information gain. The idea is that uncertainty is minimized by making the resulting bins as pure as possible. This method is the standard method of many decision tree algorithms. 91 | #' 92 | #' Character strings and logical strings are coerced into factors. Matrices are coerced into data frames. If the target is numeric it is turned into a factor with the number of levels equal to the number of values. Additionally a warning is given. 93 | #' 94 | #' When \code{"na.omit = FALSE"} an additional level \code{"NA"} is added to each factor with missing values. 95 | #' If the target contains unused factor levels (e.g. due to subsetting) these are ignored and a warning is given. 96 | #' @author Holger von Jouanne-Diedrich 97 | #' @references \url{https://github.com/vonjd/OneR} 98 | #' @seealso \code{\link{OneR}}, \code{\link{bin}} 99 | #' @examples 100 | #' data <- iris # without optimal binning 101 | #' model <- OneR(data, verbose = TRUE) 102 | #' summary(model) 103 | #' 104 | #' data_opt <- optbin(iris) # with optimal binning 105 | #' model_opt <- OneR(data_opt, verbose = TRUE) 106 | #' summary(model_opt) 107 | #' 108 | #' ## The same with the formula interface: 109 | #' data_opt <- optbin(Species ~., data = iris) 110 | #' model_opt <- OneR(data_opt, verbose = TRUE) 111 | #' summary(model_opt) 112 | #' 113 | #' @export 114 | optbin <- function(x, ...) UseMethod("optbin") 115 | 116 | #' @export 117 | optbin.default <- function(x, ...) { 118 | stop("data type not supported") 119 | } 120 | 121 | #' @export 122 | #' @describeIn optbin method for formulas. 123 | optbin.formula <- function(formula, data, method = c("logreg", "infogain", "naive"), na.omit = TRUE, ...) { 124 | method <- match.arg(method) 125 | mf <- model.frame(formula = formula, data = data, na.action = NULL) 126 | data <- mf[c(2:ncol(mf), 1)] 127 | optbin.data.frame(x = data, method = method, na.omit = na.omit) 128 | } 129 | 130 | #' @export 131 | #' @describeIn optbin method for data frames. 132 | optbin.data.frame <- function(x, method = c("logreg", "infogain", "naive"), na.omit = TRUE, ...) { 133 | method <- match.arg(method) 134 | data <- x 135 | if (dim(data)[2] < 2) stop("data must have at least two columns") 136 | if (is.numeric(data[ , ncol(data)]) == TRUE) warning("target is numeric") 137 | data[ncol(data)] <- as.factor(data[ , ncol(data)]) 138 | if (na.omit == TRUE) { 139 | len_rows_orig <- nrow(data) 140 | data <- na.omit(data) 141 | len_rows_new <- nrow(data) 142 | no_removed <- len_rows_orig - len_rows_new 143 | if (no_removed > 0) warning(paste(no_removed, "instance(s) removed due to missing values")) 144 | } else { 145 | # only add NA to target 146 | if(any(is.na(as.character(data[ , ncol(data)])))) data[ncol(data)] <- ADDNA(data[ , ncol(data)]) 147 | } 148 | target <- data[ , ncol(data)] 149 | # Test if unused factor levels and drop them for analysis 150 | nlevels_orig <- nlevels(target) 151 | target <- droplevels(target) 152 | nbins <- nlevels(target) 153 | if (nbins < nlevels_orig) warning("target contains unused factor levels") 154 | if (nbins <= 1) stop("number of target levels must be bigger than 1") 155 | data[] <- lapply(data, function(x) if (is.numeric(x)) { 156 | if (length(unique(x)) <= nbins) as.factor(x) else optcut(x, target, method) 157 | } else as.factor(x)) 158 | data[] <- lapply(data, function(x) if(any(is.na(as.character(x)))) ADDNA(x) else x) 159 | data 160 | } 161 | 162 | #' Remove factors with too many levels 163 | #' 164 | #' Removes all columns of a data frame where a factor (or character string) has more than a maximum number of levels. 165 | #' @param data data frame which contains the data. 166 | #' @param maxlevels number of maximum factor levels. 167 | #' @param na.omit logical value whether missing values should be treated as a level, defaults to omit missing values before counting. 168 | #' @return A data frame. 169 | #' @details Often categories that have very many levels are not useful in modelling OneR rules because they result in too many rules and tend to overfit. 170 | #' Examples are IDs or names. 171 | #' 172 | #' Character strings are treated as factors although they keep their datatype. Numeric data is left untouched. 173 | #' If data contains unused factor levels (e.g. due to subsetting) these are ignored and a warning is given. 174 | #' @author Holger von Jouanne-Diedrich 175 | #' @references \url{https://github.com/vonjd/OneR} 176 | #' @seealso \code{\link{OneR}} 177 | #' @examples 178 | #' df <- data.frame(numeric = c(1:26), alphabet = letters) 179 | #' str(df) 180 | #' str(maxlevels(df)) 181 | #' @export 182 | maxlevels <- function(data, maxlevels = 20, na.omit = TRUE) { 183 | if (is.list(data) == FALSE) stop("data must be a data frame") 184 | if (maxlevels <= 2) stop("maxlevels must be bigger than 2") 185 | tmp <- suppressWarnings(bin(data, nbins = 2, na.omit = na.omit)) 186 | # Test if unused factor levels and drop them for analysis 187 | nlevels_orig <- sapply(tmp, nlevels) 188 | tmp <- droplevels(tmp) 189 | nlevels_new <- sapply(tmp, nlevels) 190 | if (sum(nlevels_new) < sum(nlevels_orig)) warning("data contains unused factor levels") 191 | cols <- nlevels_new <= maxlevels 192 | data[cols] 193 | } 194 | 195 | #' Predict method for OneR models 196 | #' 197 | #' Predict cases or probabilities based on OneR model object. 198 | #' @param object object of class \code{"OneR"}. 199 | #' @param newdata data frame in which to look for the feature variable with which to predict. 200 | #' @param type character string denoting the type of predicted value returned. Default \code{"class"} gives a named vector with the predicted classes, \code{"prob"} gives a matrix whose columns are the probability of the first, second, etc. class. 201 | #' @param ... further arguments passed to or from other methods. 202 | #' @return The default is a factor with the predicted classes, if \code{"type = prob"} a matrix is returned whose columns are the probability of the first, second, etc. class. 203 | #' @details \code{newdata} can have the same format as used for building the model but must at least have the feature variable that is used in the OneR rules. 204 | #' If cases appear that were not present when building the model the predicted case is \code{UNSEEN} or \code{NA} when \code{"type = prob"}. 205 | #' @author Holger von Jouanne-Diedrich 206 | #' @references \url{https://github.com/vonjd/OneR} 207 | #' @seealso \code{\link{OneR}} 208 | #' @examples 209 | #' model <- OneR(iris) 210 | #' prediction <- predict(model, iris[1:4]) 211 | #' eval_model(prediction, iris[5]) 212 | #' 213 | #' ## type prob 214 | #' predict(model, data.frame(Petal.Width = seq(0, 3, 0.5))) 215 | #' predict(model, data.frame(Petal.Width = seq(0, 3, 0.5)), type = "prob") 216 | #' @export 217 | predict.OneR <- function(object, newdata, type = c("class", "prob"), ...) { 218 | type <- match.arg(type) 219 | if (is.list(newdata) == FALSE) stop("newdata must be a data frame") 220 | if (all(names(newdata) != object$feature)) stop("cannot find feature column in newdata") 221 | model <- object 222 | data <- newdata 223 | index <- which(names(data) == model$feature)[1] 224 | if (is.numeric(data[ , index])) { 225 | levels <- names(model$rules) 226 | if (substring(levels[1], 1, 1) == "(" & grepl(",", levels[1]) == TRUE & substring(levels[1], nchar(levels[1]), nchar(levels[1])) == "]") { 227 | features <- as.character(cut(data[ , index], breaks = c(-Inf, get_breaks(levels), Inf))) 228 | } else features <- as.character(data[ , index]) 229 | } else features <- as.character(data[ , index]) 230 | features[is.na(features)] <- "NA" 231 | if (type == "prob") { 232 | probs <- prop.table(model$cont_table, margin = 2) 233 | probrules <- lapply(names(model$rules), function(x) probs[ , x]) 234 | names(probrules) <- names(model$rules) 235 | M <- t(sapply(features, function(x) if (is.null(probrules[[x]]) == TRUE) rep(NA, dim(model$cont_table)[1]) else probrules[[x]])) 236 | colnames(M) <- rownames(model$cont_table) 237 | return(M) 238 | } 239 | factor(sapply(features, function(x) if (is.null(model$rules[[x]]) == TRUE) "UNSEEN" else model$rules[[x]])) 240 | } 241 | 242 | #' Summarize OneR models 243 | #' 244 | #' \code{summary} method for class \code{OneR}. 245 | #' @param object object of class \code{"OneR"}. 246 | #' @param ... further arguments passed to or from other methods. 247 | #' @details Prints the rules of the OneR model, the accuracy, a contingency table of the feature attribute and the target and performs a chi-squared test on this table. 248 | #' 249 | #' In the contingency table the maximum values in each column are highlighted by adding a '*', thereby representing the rules of the OneR model. 250 | #' @author Holger von Jouanne-Diedrich 251 | #' @references \url{https://github.com/vonjd/OneR} 252 | #' @seealso \code{\link{OneR}} 253 | #' @keywords diagnostics 254 | #' @examples 255 | #' model <- OneR(iris) 256 | #' summary(model) 257 | #' @importFrom stats addmargins 258 | #' @importFrom stats chisq.test 259 | #' @export 260 | summary.OneR <- function(object, ...) { 261 | model <- object 262 | print(model) 263 | tbl <- model$cont_table 264 | pos <- cbind(apply(tbl, 2, which.max), 1:dim(tbl)[2]) 265 | tbl <- addmargins(tbl) 266 | tbl[pos] <- paste("*", tbl[pos]) 267 | cat("Contingency table:\n") 268 | print(tbl, quote = FALSE, right = TRUE) 269 | cat("---\nMaximum in each column: '*'\n") 270 | # chi-squared test 271 | digits <- getOption("digits") 272 | x <- suppressWarnings(chisq.test(model$cont_table)) 273 | cat("\nPearson's Chi-squared test:\n") 274 | out <- character() 275 | if (!is.null(x$statistic)) 276 | out <- c(out, paste(names(x$statistic), "=", format(signif(x$statistic, max(1L, digits - 2L))))) 277 | if (!is.null(x$parameter)) 278 | out <- c(out, paste(names(x$parameter), "=", format(signif(x$parameter, max(1L, digits - 2L))))) 279 | if (!is.null(x$p.value)) { 280 | fp <- format.pval(x$p.value, digits = max(1L, digits - 3L)) 281 | out <- c(out, paste("p-value", if (substr(fp, 1L, 1L) == "<") fp else paste("=", fp))) 282 | } 283 | cat(strwrap(paste(out, collapse = ", ")), sep = "\n") 284 | cat("\n") 285 | } 286 | 287 | #' Print OneR models 288 | #' 289 | #' \code{print} method for class \code{OneR}. 290 | #' @param x object of class \code{"OneR"}. 291 | #' @param ... further arguments passed to or from other methods. 292 | #' @details Prints the rules and the accuracy of an OneR model. 293 | #' @author Holger von Jouanne-Diedrich 294 | #' @references \url{https://github.com/vonjd/OneR} 295 | #' @seealso \code{\link{OneR}} 296 | #' @examples 297 | #' model <- OneR(iris) 298 | #' print(model) 299 | #' @export 300 | print.OneR <- function(x, ...) { 301 | model <- x 302 | cat("\nCall:\n") 303 | print(model$call) 304 | cat("\nRules:\n") 305 | longest <- max(nchar(names(model$rules))) 306 | for (iter in 1:length(model$rules)) { 307 | len <- longest - nchar(names(model$rules[iter])) 308 | cat("If ", model$feature, " = ", names(model$rules[iter]), rep(" ", len)," then ", model$target, " = ", model$rules[[iter]], "\n", sep = "") 309 | } 310 | cat("\nAccuracy:\n") 311 | cat(model$correct_instances, " of ", model$total_instances, " instances classified correctly (", round(100 * model$correct_instances / model$total_instances, 2), "%)\n\n", sep = "") 312 | } 313 | 314 | #' Plot Diagnostics for an OneR object 315 | #' 316 | #' Plots a mosaic plot for the feature attribute and the target of the OneR model. 317 | #' @param x object of class \code{"OneR"}. 318 | #' @param ... further arguments passed to or from other methods. 319 | #' @details If more than 20 levels are present for either the feature attribute or the target the function stops with an error. 320 | #' @author Holger von Jouanne-Diedrich 321 | #' @references \url{https://github.com/vonjd/OneR} 322 | #' @seealso \code{\link{OneR}} 323 | #' @keywords diagnostics 324 | #' @examples 325 | #' model <- OneR(iris) 326 | #' plot(model) 327 | #' @importFrom graphics mosaicplot 328 | #' @export 329 | plot.OneR <- function(x, ...) { 330 | model <- x 331 | if (any(dim(model$cont_table) > 20)) stop("cannot plot more than 20 levels") 332 | mosaicplot(t(model$cont_table), color = TRUE, main = "OneR model diagnostic plot") 333 | } 334 | 335 | #' Test OneR model objects 336 | #' 337 | #' Test if object is a OneR model. 338 | #' @param x object to be tested. 339 | #' @return a logical whether object is of class "OneR". 340 | #' @keywords OneR model 341 | #' @author Holger von Jouanne-Diedrich 342 | #' @references \url{https://github.com/vonjd/OneR} 343 | #' @examples 344 | #' model <- OneR(iris) 345 | #' is.OneR(model) # evaluates to TRUE 346 | #' @export 347 | is.OneR <- function(x) inherits(x, "OneR") 348 | 349 | #' Classification Evaluation function 350 | #' 351 | #' Function for evaluating a OneR classification model. Prints confusion matrices with prediction vs. actual in absolute and relative numbers. Additionally it gives the accuracy, error rate as well as the error rate reduction versus the base rate accuracy together with a p-value. 352 | #' @param prediction vector which contains the predicted values. 353 | #' @param actual data frame which contains the actual data. When there is more than one column the last last column is taken. A single vector is allowed too. 354 | #' @param dimnames character vector of printed dimnames for the confusion matrices. 355 | #' @param zero.print character specifying how zeros should be printed; for sparse confusion matrices, using "." can produce more readable results. 356 | #' @details Error rate reduction versus the base rate accuracy is calculated by the following formula:\cr\cr 357 | #' \eqn{(Accuracy(Prediction) - Accuracy(Baserate)) / (1 - Accuracy(Baserate))},\cr\cr 358 | #' giving a number between 0 (no error reduction) and 1 (no error).\cr\cr 359 | #' In some borderline cases when the model is performing worse than the base rate negative numbers can result. This shows that something is seriously wrong with the model generating this prediction.\cr\cr 360 | #' The provided p-value gives the probability of obtaining a distribution of predictions like this (or even more unambiguous) under the assumption that the real accuracy is equal to or lower than the base rate accuracy. 361 | #' More technicaly it is derived from a one-sided binomial test with the alternative hypothesis that the prediction's accuracy is bigger than the base rate accuracy. 362 | #' Loosly speaking a low p-value (< 0.05) signifies that the model really is able to give predictions that are better than the base rate. 363 | #' @return Invisibly returns a list with the number of correctly classified and total instances and a confusion matrix with the absolute numbers. 364 | #' @author Holger von Jouanne-Diedrich 365 | #' @references \url{https://github.com/vonjd/OneR} 366 | #' @keywords evaluation accuracy 367 | #' @examples 368 | #' data <- iris 369 | #' model <- OneR(data) 370 | #' summary(model) 371 | #' prediction <- predict(model, data) 372 | #' eval_model(prediction, data) 373 | #' @importFrom stats addmargins 374 | #' @importFrom stats binom.test 375 | #' @export 376 | eval_model <- function (prediction, actual, dimnames = c("Prediction", "Actual"), zero.print = "0") { 377 | if (any(is.na(prediction))) stop("prediction contains missing values") 378 | prediction <- factor(prediction) 379 | if (!is.list(actual)) actual <- data.frame(actual) 380 | actual <- actual[ , ncol(actual)] 381 | actual <- factor(actual) 382 | if (any(is.na(actual))) actual <- ADDNA(actual) 383 | # make sure that all levels are included in the same format and order in each set 384 | all_levels <- sort(unique(c(levels(prediction), levels(actual)))) 385 | prediction <- factor(prediction, levels = all_levels, labels = all_levels) 386 | actual <- factor(actual, levels = all_levels, labels = all_levels) 387 | if (length(prediction) != length(actual)) stop("prediction and actual must have the same length") 388 | # create and print confusion matrices 389 | conf <- table(prediction, actual, dnn = dimnames) 390 | conf.m <- addmargins(conf) 391 | cat("\nConfusion matrix (absolute):\n") 392 | print(conf.m, zero.print = zero.print) 393 | conf.p <- prop.table(conf) 394 | conf.pm <- addmargins(conf.p) 395 | cat("\nConfusion matrix (relative):\n") 396 | print(round(conf.pm, 2), zero.print = zero.print) 397 | # calculate and print performance measures 398 | N <- sum(conf) 399 | correct_class <- sum(diag(conf)) 400 | acc <- correct_class / N 401 | cat("\nAccuracy:\n", round(acc, 4), " (", correct_class, "/", N, ")\n", sep = "") 402 | error.rt <- 1 - acc 403 | cat("\nError rate:\n", round(error.rt, 4), " (", N - correct_class, "/", N, ")\n", sep = "") 404 | base.rt <- max(conf.pm[nrow(conf.pm), 1:(ncol(conf.pm) - 1)]) 405 | errordown.p <- (acc - base.rt) / (1 - base.rt) 406 | # binomial test 407 | digits <- getOption("digits") 408 | out <- character() 409 | x <- binom.test(correct_class, N, p = base.rt, alternative = "greater") 410 | if (!is.null(x$p.value)) { 411 | fp <- format.pval(x$p.value, digits = max(1L, digits - 3L)) 412 | out <- c(out, paste("p-value", if (substr(fp, 1L, 1L) == "<") fp else paste("=", fp))) 413 | } 414 | cat("\nError rate reduction (vs. base rate):\n", round(errordown.p, 4), " (", out, ")\n\n", sep = "") 415 | # return list invisibly 416 | invisible(list(correct_instances = correct_class, total_instances = N, conf_matrix = conf)) 417 | } 418 | -------------------------------------------------------------------------------- /vignettes/OneR.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | OneR - Establishing a New Baseline for Machine Learning Classification Models 18 | 19 | 20 | 21 | 22 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 |

OneR - Establishing a New Baseline for Machine Learning Classification Models

72 |

An R package by Holger K. von Jouanne-Diedrich

73 |

2017-05-05

74 | 75 | 76 | 77 |

Note: You can find a step-by-step introduction on YouTube: Quick Start Guide for the OneR package

78 |
79 |

Introduction

80 |

The following story is one of the most often told in the Data Science community: some time ago the military built a system which aim it was to distinguish military vehicles from civilian ones. They chose a neural network approach and trained the system with pictures of tanks, humvees and missile launchers on the one hand and normal cars, pickups and trucks on the other. After having reached a satisfactory accuracy they brought the system into the field (quite literally). It failed completely, performing no better than a coin toss. What had happened? No one knew, so they re-engineered the black box (no small feat in itself) and found that most of the military pics where taken at dusk or dawn and most civilian pics under brighter weather conditions. The neural net had learned the difference between light and dark!

81 |

Although this might be an urban legend the fact that it is so often told wants to tell us something:

82 |
    83 |
  1. Many of our Machine Learning models are so complex that we cannot understand them ourselves.
  2. 84 |
  3. Because of 1. we cannot differentiate between the simpler aspects of a problem which can be tackled by simple models and the more sophisticated ones which need specialized treatment.
  4. 85 |
86 |

The above is not only true for neural networks (and especially deep neural networks) but for most of the methods used today, especially Support Vector Machines and Random Forests and in general all kinds of ensemble based methods.

87 |

In one word: we need a good baseline which builds “the best simple model” that strikes a balance between the best accuracy possible with a model that is still simple enough to understand: I have developed the OneR package for finding this sweet spot and thereby establishing a new baseline for classification models in Machine Learning (ML).

88 |

This package is filling a longstanding gap because only a JAVA based implementation was available so far (RWeka package as an interface for the OneR JAVA class). Additionally several enhancements have been made (see below).

89 |
90 |
91 |

Design principles for the OneR package

92 |

The following design principles were followed for programming the package:

93 |
    94 |
  • Easy: the learning curve for new users should be minimal. Results should be obtained with ease and only minimal preprocessing and modeling steps should be necessary.
  • 95 |
  • Versatile: all types of data, i.e. categorical and numeric, should be computable - as input variable as well as as target.
  • 96 |
  • Fast: the running times of model trainings should be short.
  • 97 |
  • Accurate: the accuracy of trained models should be good overall.
  • 98 |
  • Robust: models should not be prone to overfitting; the reached accuracy on training data should be comparable to the accuracy of predictions from new, unseen cases.
  • 99 |
  • Comprehensible: it should be easy to understand which rules the model has learned. Not only should the rules be easily comprehensible but they should serve as heuristics that are usable even without a computer.
  • 100 |
  • Reproducible: because the used algorithms are strictly deterministic one will always get the same models on the same data. Many ML algorithms have stochastic components so that the data scientist will get a different model every time.
  • 101 |
  • Intuitive: model diagnostics should be presented in form of simple tables and plots.
  • 102 |
  • Native R: the whole package is written in native R code. Thereby the source code can be easily checked and the whole package is very lean. Additionally the package has no dependencies at all other than base R itself.
  • 103 |
104 |

The package is based on the – as the name might reveal – one rule classification algorithm [Holte93]. Although the underlying method is simple enough (basically 1-level decision trees, you can find out more here: OneR) several enhancements have been made:

105 |
    106 |
  • Discretization of numeric data: the OneR algorithm can only handle categorical data, so numeric data has to be discretized. The original OneR algorithm separates the respective values in ever smaller and smaller buckets until the best possible accuracy is being reached. It can be argued that this is the definition of overfitting and contradicts the original spirit of OneR because tons of rules (one for every bucket) will result. One can of course introduce a new parameter “maximum bucket size” but finding the right value for this one doesn’t come naturally either. Therefore I take a radically different approach: there are several methods for handling numeric data in the package (in the bin and the optbin function), the most promising one is the (default) “logreg” method in the optbin function which gives only as many bins as there are target categories and which optimizes the cut points according to pairwise logistic regressions.
  • 107 |
  • Missing values: in the original algorithm missing values were always handled as a separate level in the respective attribute. While missing values can sometimes reveal interesting patterns in other cases they are, well, just values that are missing. In the OneR package missing values can be handled as separate levels (level “NA”) or they can be omitted (the default).
  • 108 |
  • Tie breaking: sometimes the OneR algorithm will find several attributes that provide rules which all give the same best accuracy. The original algorithm just took the first attribute. While this is implemented in the OneR function as the default too a different method for tie breaking can be chosen: the contingency tables of all “best” rules are tested against each other with a Pearson’s Chi squared test and the one with the smallest p-value is being chosen. The rationale behind this is that thereby the attribute with the best signal-to-noise ratio is being found.
  • 109 |
110 |
111 |
112 |

Getting started with a simple example

113 |

You can also watch this video which goes through the following example step-by-step:

114 |

Quick Start Guide for the OneR package (Video)

115 |

After installing from CRAN load package

116 |
library(OneR)
117 |

Use the famous Iris dataset and determine optimal bins for numeric data

118 |
data <- optbin(iris)
119 |

Build model with best predictor

120 |
model <- OneR(data, verbose = TRUE)
121 |
## 
122 | ##     Attribute    Accuracy
123 | ## 1 * Petal.Width  96%     
124 | ## 2   Petal.Length 95.33%  
125 | ## 3   Sepal.Length 74.67%  
126 | ## 4   Sepal.Width  55.33%  
127 | ## ---
128 | ## Chosen attribute due to accuracy
129 | ## and ties method (if applicable): '*'
130 |

Show learned rules and model diagnostics

131 |
summary(model)
132 |
## 
133 | ## Call:
134 | ## OneR.data.frame(x = data, verbose = TRUE)
135 | ## 
136 | ## Rules:
137 | ## If Petal.Width = (0.0976,0.791] then Species = setosa
138 | ## If Petal.Width = (0.791,1.63]   then Species = versicolor
139 | ## If Petal.Width = (1.63,2.5]     then Species = virginica
140 | ## 
141 | ## Accuracy:
142 | ## 144 of 150 instances classified correctly (96%)
143 | ## 
144 | ## Contingency table:
145 | ##             Petal.Width
146 | ## Species      (0.0976,0.791] (0.791,1.63] (1.63,2.5] Sum
147 | ##   setosa               * 50            0          0  50
148 | ##   versicolor              0         * 48          2  50
149 | ##   virginica               0            4       * 46  50
150 | ##   Sum                    50           52         48 150
151 | ## ---
152 | ## Maximum in each column: '*'
153 | ## 
154 | ## Pearson's Chi-squared test:
155 | ## X-squared = 266.35, df = 4, p-value < 2.2e-16
156 |

Plot model diagnostics

157 |
plot(model)
158 |

159 |

Use model to predict data

160 |
prediction <- predict(model, data)
161 |

Evaluate prediction statistics

162 |
eval_model(prediction, data)
163 |
## 
164 | ## Confusion matrix (absolute):
165 | ##             Actual
166 | ## Prediction   setosa versicolor virginica Sum
167 | ##   setosa         50          0         0  50
168 | ##   versicolor      0         48         4  52
169 | ##   virginica       0          2        46  48
170 | ##   Sum            50         50        50 150
171 | ## 
172 | ## Confusion matrix (relative):
173 | ##             Actual
174 | ## Prediction   setosa versicolor virginica  Sum
175 | ##   setosa       0.33       0.00      0.00 0.33
176 | ##   versicolor   0.00       0.32      0.03 0.35
177 | ##   virginica    0.00       0.01      0.31 0.32
178 | ##   Sum          0.33       0.33      0.33 1.00
179 | ## 
180 | ## Accuracy:
181 | ## 0.96 (144/150)
182 | ## 
183 | ## Error rate:
184 | ## 0.04 (6/150)
185 | ## 
186 | ## Error rate reduction (vs. base rate):
187 | ## 0.94 (p-value < 2.2e-16)
188 |

Please note that the very good accuracy of 96% is reached effortlessly.

189 |

“Petal.Width” is identified as the attribute with the highest predictive value. The cut points of the intervals are found automatically (via the included optbin function). The results are three very simple, yet accurate, rules to predict the respective species.

190 |

The nearly perfect separation of the areas in the diagnostic plot give a good indication of the model’s ability to separate the different species.

191 |
192 |
193 |

A more sophisticated real-world example

194 |

The next example tries to find a model for the identification of breast cancer. The data were obtained from the UCI machine learning repository (see also the package documentation). According to this source the best out-of-sample performance was 95.9%, so let’s see what we can achieve with the OneR package…

195 |
data(breastcancer)
196 | data <- breastcancer
197 |

Divide training (80%) and test set (20%)

198 |
set.seed(12) # for reproducibility
199 | random <- sample(1:nrow(data), 0.8 * nrow(data))
200 | data_train <- optbin(data[random, ], method = "infogain")
201 |
## Warning in optbin.data.frame(data[random, ], method = "infogain"): 12
202 | ## instance(s) removed due to missing values
203 |
data_test <- data[-random, ]
204 |

Train OneR model on training set

205 |
model_train <- OneR(data_train, verbose = TRUE)
206 |
## 
207 | ##     Attribute                   Accuracy
208 | ## 1 * Uniformity of Cell Size     92.32%  
209 | ## 2   Uniformity of Cell Shape    91.59%  
210 | ## 3   Bare Nuclei                 90.68%  
211 | ## 4   Bland Chromatin             90.31%  
212 | ## 5   Normal Nucleoli             90.13%  
213 | ## 6   Single Epithelial Cell Size 89.4%   
214 | ## 7   Marginal Adhesion           85.92%  
215 | ## 8   Clump Thickness             84.28%  
216 | ## 9   Mitoses                     78.24%  
217 | ## ---
218 | ## Chosen attribute due to accuracy
219 | ## and ties method (if applicable): '*'
220 |

Show model and diagnostics

221 |
summary(model_train)
222 |
## 
223 | ## Call:
224 | ## OneR.data.frame(x = data_train, verbose = TRUE)
225 | ## 
226 | ## Rules:
227 | ## If Uniformity of Cell Size = (0.991,2] then Class = benign
228 | ## If Uniformity of Cell Size = (2,10]    then Class = malignant
229 | ## 
230 | ## Accuracy:
231 | ## 505 of 547 instances classified correctly (92.32%)
232 | ## 
233 | ## Contingency table:
234 | ##            Uniformity of Cell Size
235 | ## Class       (0.991,2] (2,10] Sum
236 | ##   benign        * 318     30 348
237 | ##   malignant        12  * 187 199
238 | ##   Sum             330    217 547
239 | ## ---
240 | ## Maximum in each column: '*'
241 | ## 
242 | ## Pearson's Chi-squared test:
243 | ## X-squared = 381.78, df = 1, p-value < 2.2e-16
244 |

Plot model diagnostics

245 |
plot(model_train)
246 |

247 |

Use trained model to predict test set

248 |
prediction <- predict(model_train, data_test)
249 |

Evaluate model performance on test set

250 |
eval_model(prediction, data_test)
251 |
## 
252 | ## Confusion matrix (absolute):
253 | ##            Actual
254 | ## Prediction  benign malignant Sum
255 | ##   benign        92         0  92
256 | ##   malignant      8        40  48
257 | ##   Sum          100        40 140
258 | ## 
259 | ## Confusion matrix (relative):
260 | ##            Actual
261 | ## Prediction  benign malignant  Sum
262 | ##   benign      0.66      0.00 0.66
263 | ##   malignant   0.06      0.29 0.34
264 | ##   Sum         0.71      0.29 1.00
265 | ## 
266 | ## Accuracy:
267 | ## 0.9429 (132/140)
268 | ## 
269 | ## Error rate:
270 | ## 0.0571 (8/140)
271 | ## 
272 | ## Error rate reduction (vs. base rate):
273 | ## 0.8 (p-value = 7.993e-12)
274 |

The best reported out-of-sample accuracy on this dataset was at 95.9% and it was reached with considerable effort. The reached accuracy for the test set here lies at 94.3%! This is achieved with just one simple rule that when “Uniformity of Cell Size” is bigger than 2 the examined tissue is malignant. The cut points of the intervals are again found automatically (via the included optbin function). The very good separation of the areas in the diagnostic plot give a good indication of the model’s ability to differentiate between benign and malignant tissue. Additionally when you look at the distribution of misclassifications not a single malignant instance is missed, which is obviously very desirable in a clinical context.

275 |
276 |
277 |

Included functions

278 |
279 |

OneR

280 |

OneR is the main function of the package. It builds a model according to the One Rule machine learning algorithm for categorical data. All numerical data is automatically converted into five categorical bins of equal length. When verbose is TRUE it gives the predictive accuracy of the attributes in decreasing order.

281 |
282 |
283 |

bin

284 |

bin discretizes all numerical data in a data frame into categorical bins of equal length or equal content or based on automatically determined clusters.

285 |

Examples

286 |
data <- iris
287 | str(data)
288 |
## 'data.frame':    150 obs. of  5 variables:
289 | ##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
290 | ##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
291 | ##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
292 | ##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
293 | ##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
294 |
str(bin(data))
295 |
## 'data.frame':    150 obs. of  5 variables:
296 | ##  $ Sepal.Length: Factor w/ 5 levels "(4.3,5.02]","(5.02,5.74]",..: 2 1 1 1 1 2 1 1 1 1 ...
297 | ##  $ Sepal.Width : Factor w/ 5 levels "(2,2.48]","(2.48,2.96]",..: 4 3 3 3 4 4 3 3 2 3 ...
298 | ##  $ Petal.Length: Factor w/ 5 levels "(0.994,2.18]",..: 1 1 1 1 1 1 1 1 1 1 ...
299 | ##  $ Petal.Width : Factor w/ 5 levels "(0.0976,0.58]",..: 1 1 1 1 1 1 1 1 1 1 ...
300 | ##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
301 |
str(bin(data, nbins = 3))
302 |
## 'data.frame':    150 obs. of  5 variables:
303 | ##  $ Sepal.Length: Factor w/ 3 levels "(4.3,5.5]","(5.5,6.7]",..: 1 1 1 1 1 1 1 1 1 1 ...
304 | ##  $ Sepal.Width : Factor w/ 3 levels "(2,2.8]","(2.8,3.6]",..: 2 2 2 2 2 3 2 2 2 2 ...
305 | ##  $ Petal.Length: Factor w/ 3 levels "(0.994,2.97]",..: 1 1 1 1 1 1 1 1 1 1 ...
306 | ##  $ Petal.Width : Factor w/ 3 levels "(0.0976,0.9]",..: 1 1 1 1 1 1 1 1 1 1 ...
307 | ##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
308 |
str(bin(data, nbins = 3, labels = c("small", "medium", "large")))
309 |
## 'data.frame':    150 obs. of  5 variables:
310 | ##  $ Sepal.Length: Factor w/ 3 levels "small","medium",..: 1 1 1 1 1 1 1 1 1 1 ...
311 | ##  $ Sepal.Width : Factor w/ 3 levels "small","medium",..: 2 2 2 2 2 3 2 2 2 2 ...
312 | ##  $ Petal.Length: Factor w/ 3 levels "small","medium",..: 1 1 1 1 1 1 1 1 1 1 ...
313 | ##  $ Petal.Width : Factor w/ 3 levels "small","medium",..: 1 1 1 1 1 1 1 1 1 1 ...
314 | ##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
315 |

Difference between methods “length” and “content”

316 |
set.seed(1); table(bin(rnorm(900), nbins = 3))
317 |
## 
318 | ## (-3.01,-0.735]  (-0.735,1.54]    (1.54,3.82] 
319 | ##            212            623             65
320 |
set.seed(1); table(bin(rnorm(900), nbins = 3, method = "content"))
321 |
## 
322 | ## (-3.01,-0.423] (-0.423,0.444]   (0.444,3.82] 
323 | ##            300            300            300
324 |

Method “clusters”

325 |
intervals <- paste(levels(bin(faithful$waiting, nbins = 2, method = "cluster")), collapse = " ")
326 | hist(faithful$waiting, main = paste("Intervals:", intervals))
327 | abline(v = c(42.9, 67.5, 96.1), col = "blue")
328 |

329 |

Handling of missing values

330 |
bin(c(1:10, NA), nbins = 2, na.omit = FALSE) # adds new level "NA"
331 |
##  [1] (0.991,5.5] (0.991,5.5] (0.991,5.5] (0.991,5.5] (0.991,5.5]
332 | ##  [6] (5.5,10]    (5.5,10]    (5.5,10]    (5.5,10]    (5.5,10]   
333 | ## [11] NA         
334 | ## Levels: (0.991,5.5] (5.5,10] NA
335 |
bin(c(1:10, NA), nbins = 2)
336 |
## Warning in bin(c(1:10, NA), nbins = 2): 1 instance(s) removed due to
337 | ## missing values
338 |
##  [1] (0.991,5.5] (0.991,5.5] (0.991,5.5] (0.991,5.5] (0.991,5.5]
339 | ##  [6] (5.5,10]    (5.5,10]    (5.5,10]    (5.5,10]    (5.5,10]   
340 | ## Levels: (0.991,5.5] (5.5,10]
341 |
342 |
343 |

optbin

344 |

optbin discretizes all numerical data in a data frame into categorical bins where the cut points are optimally aligned with the target categories, thereby a factor is returned. When building a OneR model this could result in fewer rules with enhanced accuracy. The cutpoints are calculated by pairwise logistic regressions (method “logreg”) or as the means of the expected values of the respective classes (“naive”). The function is likely to give unsatisfactory results when the distributions of the respective classes are not (linearly) separable. Method “naive” should only be used when distributions are (approximately) normal, although in this case “logreg” should give comparable results, so it is the preferable (and therefore default) method.

345 |

Method “infogain” is an entropy based method which calculates cut points based on information gain. The idea is that uncertainty is minimized by making the resulting bins as pure as possible. This method is the standard method of many decision tree algorithms.

346 |
347 |
348 |

maxlevels

349 |

maxlavels removes all columns of a data frame where a factor (or character string) has more than a maximum number of levels. Often categories that have very many levels are not useful in modelling OneR rules because they result in too many rules and tend to overfit. Examples are IDs or names.

350 |
df <- data.frame(numeric = c(1:26), alphabet = letters)
351 | str(df)
352 |
## 'data.frame':    26 obs. of  2 variables:
353 | ##  $ numeric : int  1 2 3 4 5 6 7 8 9 10 ...
354 | ##  $ alphabet: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
355 |
str(maxlevels(df))
356 |
## 'data.frame':    26 obs. of  1 variable:
357 | ##  $ numeric: int  1 2 3 4 5 6 7 8 9 10 ...
358 |
359 |
360 |

predict

361 |

predict is a S3 method for predicting cases or probabilites based on OneR model objects. The second argument “newdata”" can have the same format as used for building the model but must at least have the feature variable that is used in the OneR rules. The default output is a factor with the predicted classes.

362 |
model <- OneR(iris)
363 | predict(model, data.frame(Petal.Width = seq(0, 3, 0.5)))
364 |
## (-Inf,0.0976] (0.0976,0.58]   (0.58,1.06]   (1.06,1.54]   (1.54,2.02] 
365 | ##        UNSEEN        setosa    versicolor    versicolor     virginica 
366 | ##    (2.02,2.5]    (2.5, Inf] 
367 | ##     virginica        UNSEEN 
368 | ## Levels: UNSEEN setosa versicolor virginica
369 |

If “type = prob” a matrix is returned whose columns are the probability of the first, second, etc. class.

370 |
predict(model, data.frame(Petal.Width = seq(0, 3, 0.5)), type = "prob")
371 |
##               setosa versicolor  virginica
372 | ## (-Inf,0.0976]     NA         NA         NA
373 | ## (0.0976,0.58]  1.000  0.0000000 0.00000000
374 | ## (0.58,1.06]    0.125  0.8750000 0.00000000
375 | ## (1.06,1.54]    0.000  0.9268293 0.07317073
376 | ## (1.54,2.02]    0.000  0.1724138 0.82758621
377 | ## (2.02,2.5]     0.000  0.0000000 1.00000000
378 | ## (2.5, Inf]        NA         NA         NA
379 |
380 |
381 |

eval_model

382 |

eval_model is a simple function for evaluating a OneR classification model. It prints confusion matrices with prediction vs. actual in absolute and relative numbers. Additionally it gives the accuracy, error rate as well as the error rate reduction versus the base rate accuracy together with a p-value. The second argument “actual” is a data frame which contains the actual data in the last column. A single vector is allowed too.

383 |

For the details please consult the available help entries.

384 |
385 |
386 |
387 |

Help overview

388 |

From within R:

389 |
help(package = OneR)
390 |

…or as a pdf here: OneR.pdf

391 |

Issues can be posted here: https://github.com/vonjd/OneR/issues

392 |

The latest version of the package (and full sourcecode) can be found here: https://github.com/vonjd/OneR

393 |
394 |
395 |

Sources

396 |

[Holte93] R. Holte: Very Simple Classification Rules Perform Well on Most Commonly Used Datasets, 1993. Available online here: https://link.springer.com/article/10.1023/A:1022631118932

397 |
398 |
399 |

Contact

400 |

I would love to hear about your experiences with the OneR package. Please drop me a note - you can reach me at my university account: Holger K. von Jouanne-Diedrich

401 |
402 |
403 |

License

404 |

This package is under MIT License.

405 |
406 | 407 | 408 | 409 | 410 | 418 | 419 | 420 | 421 | --------------------------------------------------------------------------------