├── .gitignore ├── LICENSE ├── README.md ├── preprocess ├── ami_preprocess_and_split.R ├── ed_preprocess_and_split.R ├── ip_preprocess_and_split.R ├── preprocess_and_split.R ├── reduce_columns.R └── roc.R ├── rf ├── rf2.py ├── rf3.py ├── tensor_forest.py └── tensor_forest_test.py └── tf ├── __init__.py ├── mnist_sda.py ├── sdautoencoder.py ├── softmax.py └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Compiled Lua sources 2 | luac.out 3 | 4 | # luarocks build files 5 | *.src.rock 6 | *.zip 7 | *.tar.gz 8 | 9 | # Object files 10 | *.o 11 | *.os 12 | *.ko 13 | *.obj 14 | *.elf 15 | 16 | # Precompiled Headers 17 | *.gch 18 | *.pch 19 | 20 | # Libraries 21 | *.lib 22 | *.a 23 | *.la 24 | *.lo 25 | *.def 26 | *.exp 27 | 28 | # Shared objects (inc. Windows DLLs) 29 | *.dll 30 | *.so 31 | *.so.* 32 | *.dylib 33 | 34 | # Executables 35 | *.exe 36 | *.out 37 | *.app 38 | *.i*86 39 | *.x86_64 40 | *.hex 41 | 42 | # Exclude Data 43 | .RData 44 | .Rhistory 45 | *.Rout 46 | data 47 | logs 48 | 49 | # Exclude old stuff 50 | old 51 | misc 52 | 53 | # Exclude training code 54 | training 55 | 56 | # Exclude MNIST stuff 57 | MNIST_data 58 | run_data 59 | 60 | # Exclude PyCharm 61 | .idea 62 | 63 | # Exclude Python stuff 64 | __pycache__ 65 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2016 Ken Chen 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # deep-learning 2 | Deep learning project in TensorFlow and Torch to analyze clinical health records and construct deep learning models to predict future patient complications. 3 | 4 | ## Background 5 | This project uses **Stacked Denoising Autoencoders (SDA)** [[P. Vincent]](http://jmlr.csail.mit.edu/papers/volume11/vincent10a/vincent10a.pdf) to perform feature learning on a given dataset. Two overall steps are necessary for fully configuring the network to encode the input data: **pre-training**, and **fine-tuning**. 6 | 7 | During unsupervised pre-training, parameters in the neural network are learned and configured layer by layer greedily by minimizing the reconstruction loss between each input and its decoded counterpart. A supervised softmax classifier on top of the network provides fine tuning for all parameters of the network (weights and biases for each autoencoder layer plus softmax weights/biases). 8 | 9 | Following this configuration, the input data can be read into the model and encoded into a different representation depending on the user's desired parameters (layer dimensions, activations, noise level, etc.). For example, this technique can be used to transform a sparse feature space of 30000 dimensions into a dense feature space of 400 dimensions as a primer for better training performance. 10 | 11 | ## Usage 12 | The current working source code is located in `tf/sdautoencoder.py`. Currently reads train/test data from csv files in batch style. The following three datasets must be present for the SDA to output newly learned features: 13 | - X training values 14 | - Y training values 15 | - X testing values 16 | 17 | An additional dataset is needed if the output of SDA encoding is directly used for classification via the provided softmax classifier: 18 | - Y testing values 19 | 20 | 21 | In the future, a version of the program will be constructed to be optimized on a multi(4)-gpu system. 22 | 23 | ```python 24 | # Start a TensorFlow session 25 | sess = tf.Session() 26 | 27 | # Initialize an unconfigured autoencoder with specified dimensions, etc. 28 | sda = SDAutoencoder(dims=[784, 256, 64, 32], 29 | activations=["sigmoid", "tanh", "sigmoid"], 30 | sess=sess, 31 | noise=0.1, 32 | loss="rmse") 33 | 34 | # Pretrain weights and biases of each layer in the network. 35 | sda.pretrain_network(X_TRAIN_PATH) 36 | 37 | # Read in test y-values to softmax classifier. 38 | sda.finetune_parameters(X_TRAIN_PATH, Y_TRAIN_PATH, output_dim=10) 39 | 40 | # Write to file the newly represented features. 41 | sda.write_encoded_input(filepath="../data/transformed.csv", X_TEST_PATH) 42 | ``` 43 | 44 | For an example of how training is performed and subsequent accuracy is evaluated, a basic procedure is implemented on the MNIST data set in `tf/mnist_sda.py`. 45 | 46 | ## Performance 47 | Testing on the MNIST data set, the softmax classifier on top of features extracted from the deep feature learning of the SDA can achieve approximately **98.3%** accuracy in identifying the digits. To achieve this result, the model in `tf/mnist_sda.py` is set up with the following parameters (which may not necessarily be optimal) with 500000 data points for layer-wise pretraining and 3000000 data points for fine tuning: 48 | 49 | ```python 50 | sda = SDAutoencoder(dims=[784, 400, 200, 80], 51 | activations=["sigmoid", "sigmoid", "sigmoid"], 52 | sess=sess, 53 | noise=0.20, 54 | loss="cross-entropy", 55 | pretrain_lr=0.0001, 56 | finetune_lr=0.0001) 57 | ``` 58 | Total execution time for feature learning, training, and evaluation was just under 9 minutes on my 1.3 GHz MacAir processor (under a minute on a GPU machine using one GTX 1080). This result improves upon the benchmark of 92% achieved by just a [simple softmax classifier](https://www.tensorflow.org/versions/r0.9/tutorials/mnist/beginners/index.html#mnist-for-ml-beginners) without feature learning. It is also comparable to some simple 2D convolutional network models, which are optimized to take advantage of the 2D structures in image data. 59 | 60 | In the future, we plan to do additional testing to optimize hyperparameters in the model and improve execution speed in various parts of the model. 61 | 62 | ## Current status 63 | - (Done) SDA implemented in final_sda.py in TensorFlow. 64 | - (Done) Implement softmax classifier. 65 | - (To do) Implement command line execution of program. 66 | - (WIP) Testing for any silent bugs. 67 | - (To do) Enable multi-gpu support in the architecture. 68 | - (WIP) Add compatibility for other data-loading methods 69 | - (To do) Add pre-processing methods in TF 70 | - (WIP) More documentation 71 | -------------------------------------------------------------------------------- /preprocess/ami_preprocess_and_split.R: -------------------------------------------------------------------------------- 1 | library(data.table) # Must have data.table v1.9.7+ 2 | library(readr) 3 | library(DMwR) 4 | #library(ROSE) 5 | 6 | # Usage (must be run from command line) 7 | # Rscript 8 | # Program will print steps of execution and write 5 different files to disk: 9 | # - Train x (saved as base_name_train_x.csv) 10 | # - Train y (saved as base_name_train_y.csv as one-hot vectors) 11 | # - Test x (saved as base_name_test_x.csv) 12 | # - Test y (saved as base_name_test_y.csv as one-hot vectors) 13 | # - Test ids (saved as base_name_test_ids.csv) (patient ids in order of all the test cases) 14 | 15 | # Parse command line arguments 16 | args <- commandArgs(trailingOnly = TRUE) 17 | path_sam <- args[1] 18 | base_name <- args[2] 19 | 20 | # Read in raw files: SAM table, train case ids, and test case ids 21 | print(paste("Reading", path_sam)) 22 | Sam <- fread(path_sam, header = T) 23 | print("Done reading files.") 24 | 25 | # Reset headers of data tables to get rid of BOM in case it's there 26 | # http://stackoverflow.com/questions/21624796/read-the-text-file-with-bom-in-r 27 | Sam.names <- names(read.csv(path_sam, nrows = 1, fileEncoding = "UTF-8-BOM")) 28 | 29 | names(Sam) <- Sam.names 30 | print("Removed BOM from text") 31 | 32 | # Pre-processing functions 33 | is.zero <- function(v) { 34 | return(v==0) 35 | } 36 | 37 | unitScale <- function(v) { 38 | if (is.factor(v)) { 39 | return(v) 40 | } 41 | range <- max(v) - min(v) 42 | if (range == 0) { 43 | return(0) 44 | } 45 | return((v - min(v)) / range) 46 | } 47 | 48 | print(str(Sam)) 49 | 50 | # Test min value of Sam 51 | # Sam.maxs <- Sam[, lapply(.SD, max)] 52 | # print(str(Sam.maxs)) 53 | # print(sum(Sam.maxs==0)) 54 | 55 | # Subcohort for AMI: age 35+ includes 95%? of cases, 60%? of data set 56 | Sam <- Sam[Age >= 35] 57 | print("Subcohort str") 58 | print(str(Sam)) 59 | 60 | # Change y values of IP/ED to 1/0 depending on return or not (binarize) 61 | Sam$AMI1Y_YTD <- ifelse(Sam$AMI1Y_YTD > 0, 1, 0) 62 | 63 | # Change all necessary columns to factors to prevent scaling and 64 | # to assure SMOTE works 65 | Sam$StatePatientID <- as.factor(Sam$StatePatientID) 66 | Sam$AMI1Y_YTD <- as.factor(Sam$AMI1Y_YTD) 67 | 68 | # Scale all columns of Sam 69 | print("Starting to scale table.") 70 | Sam <- Sam[, lapply(.SD, unitScale)] 71 | print("Completed scaling of columns.") 72 | 73 | # Split into train and test 2500 74 | print("Starting to split into train and test sets.") 75 | prop_in_train <- 0.90 76 | cases <- which(Sam$AMI1Y_YTD == 1) 77 | controls <- which(Sam$AMI1Y_YTD == 0) 78 | train_cases <- sample(cases, floor(length(cases) * prop_in_train)) 79 | train_controls <- sample(controls, floor(length(controls) * prop_in_train)) 80 | test_cases <- setdiff(cases, train_cases) 81 | test_controls <- setdiff(controls, train_controls) 82 | print("Total cases:") 83 | print(sum(Sam$AMI1Y_YTD == 1)) 84 | print(str(cases)) 85 | print(str(controls)) 86 | print(str(train_cases)) 87 | print(str(train_controls)) 88 | print(str(test_cases)) 89 | print(str(test_controls)) 90 | 91 | print(length(train_cases)) 92 | print(length(test_cases)) 93 | 94 | Sam.train <- Sam[c(train_cases, train_controls)] 95 | Sam.test <- Sam[c(test_cases, test_controls)] 96 | 97 | rm(Sam) 98 | print("Finished splitting into train and test sets.") 99 | 100 | # SMOTE algorithm for balancing training data by interpolated over/undersampling 101 | #Smote parameters 102 | print("Beginning to apply SMOTE algorithm.") 103 | percent_to_oversample <- 600 104 | percent_ratio_major_to_minor <- 200 105 | Sam.train <- SMOTE(AMI1Y_YTD ~ . -StatePatientID, data = Sam.train, 106 | perc.over = percent_to_oversample, perc.under = percent_ratio_major_to_minor) 107 | print("Finished applying SMOTE algorithm.") 108 | 109 | # ROSE algorithm for balancing training data by over/undersampling 110 | #print("Beginning to apply ROSE algorithm.") 111 | #result_sample_size <- 100000 112 | #rare_proportion <- 0.5 113 | # Sam.train.without_factors <- Sam.train[, !c("StatePatientID", "ED_YTM"), with = FALSE] 114 | # Sam.train.factors <- Sam.train[, c("StatePatientID", "ED_YTM"), with = FALSE] 115 | #Sam.train <- ovun.sample(AMI1Y_YTD ~ . -StatePatientID, data = Sam.train, 116 | # method = "both", N = result_sample_size, p = rare_proportion)$data 117 | #Sam.train <- data.table(Sam.train) 118 | #print("Finished applying ROSE algorithm.") 119 | 120 | # Shuffle train data to homogenize 0/1 y values 121 | print("Begin shuffle.") 122 | Sam.train <- Sam.train[sample(nrow(Sam.train)),] 123 | print("Finished shuffle.") 124 | 125 | # Split into train.x, train.y, test.x, test.y 126 | print("Begin split into x/y.") 127 | Sam.train.x <- Sam.train[, !c("StatePatientID", "AMI1Y_YTD"), with = FALSE] 128 | Sam.train.y <- Sam.train[, c("AMI1Y_YTD"), with = FALSE] 129 | rm(Sam.train) 130 | Sam.test.x <- Sam.test[, !c("StatePatientID", "AMI1Y_YTD"), with = FALSE] 131 | Sam.test.y <- Sam.test[, c("AMI1Y_YTD"), with = FALSE] 132 | Sam.test.ids <- Sam.test[, c("StatePatientID"), with = FALSE] 133 | rm(Sam.test) 134 | print("Finished split into x/y.") 135 | 136 | # Change y to one-hot 137 | Sam.train.y[, zero := ifelse(AMI1Y_YTD == 0, 1, 0)] 138 | Sam.train.y[, one := AMI1Y_YTD] 139 | Sam.train.y[, AMI1Y_YTD := NULL] 140 | Sam.test.y[, zero := ifelse(AMI1Y_YTD == 0, 1, 0)] 141 | Sam.test.y[, one := AMI1Y_YTD] 142 | Sam.test.y[, AMI1Y_YTD := NULL] 143 | 144 | # Write all splits to file 145 | print("Begin write to file.") 146 | base_name <- ifelse(is.na(base_name), "SAMFull", base_name) 147 | fwrite(Sam.train.x, paste0(base_name, "_train_x", ".csv"), col.names = FALSE) 148 | fwrite(Sam.train.y, paste0(base_name, "_train_y", ".csv"), col.names = FALSE) 149 | fwrite(Sam.test.x, paste0(base_name, "_test_x", ".csv"), col.names = FALSE) 150 | fwrite(Sam.test.y, paste0(base_name, "_test_y", ".csv"), col.names = FALSE) 151 | fwrite(Sam.test.ids, paste0(base_name, "_test_ids", ".csv")) 152 | print("Finished write to file.") 153 | 154 | # Remove all columns with all zero entries 155 | # Sam <- Sam[,which(unlist(lapply(Sam, function(x)!all(is.zero(x))))),with=F] 156 | # print(str(Sam)) 157 | -------------------------------------------------------------------------------- /preprocess/ed_preprocess_and_split.R: -------------------------------------------------------------------------------- 1 | library(data.table) # Must have data.table v1.9.7+ 2 | library(readr) 3 | library(DMwR) 4 | library(ROSE) 5 | 6 | # Usage (must be run from command line) 7 | # Rscript 8 | # Program will print steps of execution and write 5 different files to disk: 9 | # - Train x (saved as base_name_train_x.csv) 10 | # - Train y (saved as base_name_train_y.csv as one-hot vectors) 11 | # - Test x (saved as base_name_test_x.csv) 12 | # - Test y (saved as base_name_test_y.csv as one-hot vectors) 13 | # - Test ids (saved as base_name_test_ids.csv) (patient ids in order of all the test cases) 14 | 15 | # Parse command line arguments 16 | args <- commandArgs(trailingOnly = TRUE) 17 | path_sam <- args[1] 18 | path_train_ids <- args[2] 19 | path_test_ids <- args[3] 20 | base_name <- args[4] 21 | 22 | # Read in raw files: SAM table, train case ids, and test case ids 23 | print(paste("Reading", path_sam)) 24 | Sam <- fread(path_sam, header = T) 25 | print(paste("Reading", path_train_ids)) 26 | Train_ids <- fread(path_train_ids, header = T) 27 | print(paste("Reading", path_test_ids)) 28 | Test_ids <- fread(path_test_ids, header = T) 29 | 30 | print("Done reading files.") 31 | 32 | # Reset headers of data tables to get rid of BOM in case it's there 33 | # http://stackoverflow.com/questions/21624796/read-the-text-file-with-bom-in-r 34 | Sam.names <- names(read.csv(path_sam, nrows = 1, fileEncoding = "UTF-8-BOM")) 35 | Train_ids.names <- names(read.csv(path_train_ids, nrows = 1, fileEncoding = "UTF-8-BOM")) 36 | Test_ids.names <- names(read.csv(path_test_ids, nrows = 1, fileEncoding = "UTF-8-BOM")) 37 | 38 | names(Sam) <- Sam.names 39 | names(Train_ids) <- Train_ids.names 40 | names(Test_ids) <- Test_ids.names 41 | 42 | print("Removed BOM from text") 43 | 44 | # Pre-processing functions 45 | is.zero <- function(v) { 46 | return(v==0) 47 | } 48 | 49 | unitScale <- function(v) { 50 | if (is.factor(v)) { 51 | return(v) 52 | } 53 | range <- max(v) - min(v) 54 | if (range == 0) { 55 | return(0) 56 | } 57 | return((v - min(v)) / range) 58 | } 59 | 60 | print(str(Sam)) 61 | 62 | # Test min value of Sam 63 | # Sam.maxs <- Sam[, lapply(.SD, max)] 64 | # print(str(Sam.maxs)) 65 | # print(sum(Sam.maxs==0)) 66 | 67 | # Change y values of IP/ED to 1/0 depending on return or not (binarize) 68 | Sam$ED_YTM <- ifelse(Sam$ED_YTM > 0, 1, 0) 69 | Sam$IP_YTM <- ifelse(Sam$IP_YTM > 0, 1, 0) 70 | 71 | # Change all necessary columns to factors to prevent scaling and 72 | # to assure SMOTE works 73 | Sam$StatePatientID <- as.factor(Sam$StatePatientID) 74 | Sam$ED_YTM <- as.factor(Sam$ED_YTM) 75 | Sam$IP_YTM <- as.factor(Sam$IP_YTM) 76 | 77 | # Scale all columns of Sam 78 | print("Starting to scale table.") 79 | Sam <- Sam[, lapply(.SD, unitScale)] 80 | print("Completed scaling of columns.") 81 | 82 | # Split into train and test 83 | print("Starting to split into train and test sets.") 84 | Sam.train <- Sam[StatePatientID %in% Train_ids[[1]]] 85 | Sam.test <- Sam[StatePatientID %in% Test_ids[[1]]] 86 | rm(Sam) 87 | print("Finished splitting into train and test sets.") 88 | 89 | # SMOTE algorithm for balancing training data by interpolated over/undersampling 90 | # Smote parameters 91 | # print("Beginning to apply SMOTE algorithm.") 92 | # percent_to_oversample <- 500 93 | # percent_ratio_major_to_minor <- 100 94 | # Sam.train <- SMOTE(IP_YTM ~ . -StatePatientID -ED_YTM, data = Sam.train, 95 | # perc.over = percent_to_oversample, perc.under = percent_ratio_major_to_minor) 96 | # print("Finished applying SMOTE algorithm.") 97 | 98 | # ROSE algorithm for balancing training data by over/undersampling 99 | print("Beginning to apply ROSE algorithm.") 100 | result_sample_size <- 300000 101 | rare_proportion <- 0.5 102 | # Sam.train.without_factors <- Sam.train[, !c("StatePatientID", "ED_YTM"), with = FALSE] 103 | # Sam.train.factors <- Sam.train[, c("StatePatientID", "ED_YTM"), with = FALSE] 104 | Sam.train <- ovun.sample(ED_YTM ~ . -StatePatientID -IP_YTM, data = Sam.train, 105 | method = "both", N = result_sample_size, p = rare_proportion)$data 106 | Sam.train <- data.table(Sam.train) 107 | print("Finished applying ROSE algorithm.") 108 | 109 | # Shuffle train data to homogenize 0/1 y values 110 | print("Begin shuffle.") 111 | Sam.train <- Sam.train[sample(nrow(Sam.train)),] 112 | print("Finished shuffle.") 113 | 114 | # Split into train.x, train.y, test.x, test.y 115 | print("Begin split into x/y.") 116 | Sam.train.x <- Sam.train[, !c("StatePatientID", "ED_YTM", "IP_YTM"), with = FALSE] 117 | Sam.train.y <- Sam.train[, c("ED_YTM"), with = FALSE] 118 | rm(Sam.train) 119 | Sam.test.x <- Sam.test[, !c("StatePatientID", "ED_YTM", "IP_YTM"), with = FALSE] 120 | Sam.test.y <- Sam.test[, c("ED_YTM"), with = FALSE] 121 | Sam.test.ids <- Sam.test[, c("StatePatientID"), with = FALSE] 122 | rm(Sam.test) 123 | print("Finished split into x/y.") 124 | 125 | # Change y to one-hot 126 | Sam.train.y[, zero := ifelse(ED_YTM == 0, 1, 0)] 127 | Sam.train.y[, one := ED_YTM] 128 | Sam.train.y[, ED_YTM := NULL] 129 | Sam.test.y[, zero := ifelse(ED_YTM == 0, 1, 0)] 130 | Sam.test.y[, one := ED_YTM] 131 | Sam.test.y[, ED_YTM := NULL] 132 | 133 | # Write all splits to file 134 | print("Begin write to file.") 135 | base_name <- ifelse(is.na(base_name), "SAMFull", base_name) 136 | fwrite(Sam.train.x, paste0(base_name, "_train_x", ".csv"), col.names = FALSE) 137 | fwrite(Sam.train.y, paste0(base_name, "_train_y", ".csv"), col.names = FALSE) 138 | fwrite(Sam.test.x, paste0(base_name, "_test_x", ".csv"), col.names = FALSE) 139 | fwrite(Sam.test.y, paste0(base_name, "_test_y", ".csv"), col.names = FALSE) 140 | fwrite(Sam.test.ids, paste0(base_name, "_test_ids", ".csv")) 141 | print("Finished write to file.") 142 | 143 | # Remove all columns with all zero entries 144 | # Sam <- Sam[,which(unlist(lapply(Sam, function(x)!all(is.zero(x))))),with=F] 145 | # print(str(Sam)) 146 | -------------------------------------------------------------------------------- /preprocess/ip_preprocess_and_split.R: -------------------------------------------------------------------------------- 1 | library(data.table) # Must have data.table v1.9.7+ 2 | library(readr) 3 | library(DMwR) 4 | library(ROSE) 5 | 6 | # Usage (must be run from command line) 7 | # Rscript 8 | # Program will print steps of execution and write 5 different files to disk: 9 | # - Train x (saved as base_name_train_x.csv) 10 | # - Train y (saved as base_name_train_y.csv as one-hot vectors) 11 | # - Test x (saved as base_name_test_x.csv) 12 | # - Test y (saved as base_name_test_y.csv as one-hot vectors) 13 | # - Test ids (saved as base_name_test_ids.csv) (patient ids in order of all the test cases) 14 | 15 | # Parse command line arguments 16 | args <- commandArgs(trailingOnly = TRUE) 17 | path_sam <- args[1] 18 | path_train_ids <- args[2] 19 | path_test_ids <- args[3] 20 | base_name <- args[4] 21 | 22 | # Read in raw files: SAM table, train case ids, and test case ids 23 | print(paste("Reading", path_sam)) 24 | Sam <- fread(path_sam, header = T) 25 | print(paste("Reading", path_train_ids)) 26 | Train_ids <- fread(path_train_ids, header = T) 27 | print(paste("Reading", path_test_ids)) 28 | Test_ids <- fread(path_test_ids, header = T) 29 | 30 | print("Done reading files.") 31 | 32 | # Reset headers of data tables to get rid of BOM in case it's there 33 | # http://stackoverflow.com/questions/21624796/read-the-text-file-with-bom-in-r 34 | Sam.names <- names(read.csv(path_sam, nrows = 1, fileEncoding = "UTF-8-BOM")) 35 | Train_ids.names <- names(read.csv(path_train_ids, nrows = 1, fileEncoding = "UTF-8-BOM")) 36 | Test_ids.names <- names(read.csv(path_test_ids, nrows = 1, fileEncoding = "UTF-8-BOM")) 37 | 38 | names(Sam) <- Sam.names 39 | names(Train_ids) <- Train_ids.names 40 | names(Test_ids) <- Test_ids.names 41 | 42 | print("Removed BOM from text") 43 | 44 | # Pre-processing functions 45 | is.zero <- function(v) { 46 | return(v==0) 47 | } 48 | 49 | unitScale <- function(v) { 50 | if (is.factor(v)) { 51 | return(v) 52 | } 53 | range <- max(v) - min(v) 54 | if (range == 0) { 55 | return(0) 56 | } 57 | return((v - min(v)) / range) 58 | } 59 | 60 | print(str(Sam)) 61 | 62 | # Test min value of Sam 63 | # Sam.maxs <- Sam[, lapply(.SD, max)] 64 | # print(str(Sam.maxs)) 65 | # print(sum(Sam.maxs==0)) 66 | 67 | # Change y values of IP/ED to 1/0 depending on return or not (binarize) 68 | Sam$ED_YTM <- ifelse(Sam$ED_YTM > 0, 1, 0) 69 | Sam$IP_YTM <- ifelse(Sam$IP_YTM > 0, 1, 0) 70 | 71 | # Change all necessary columns to factors to prevent scaling and 72 | # to assure SMOTE works 73 | Sam$StatePatientID <- as.factor(Sam$StatePatientID) 74 | Sam$ED_YTM <- as.factor(Sam$ED_YTM) 75 | Sam$IP_YTM <- as.factor(Sam$IP_YTM) 76 | 77 | # Scale all columns of Sam 78 | print("Starting to scale table.") 79 | Sam <- Sam[, lapply(.SD, unitScale)] 80 | print("Completed scaling of columns.") 81 | 82 | # Split into train and test 83 | print("Starting to split into train and test sets.") 84 | Sam.train <- Sam[StatePatientID %in% Train_ids[[1]]] 85 | Sam.test <- Sam[StatePatientID %in% Test_ids[[1]]] 86 | rm(Sam) 87 | print("Finished splitting into train and test sets.") 88 | 89 | # SMOTE algorithm for balancing training data by interpolated over/undersampling 90 | # Smote parameters 91 | print("Beginning to apply SMOTE algorithm.") 92 | percent_to_oversample <- 180 93 | percent_ratio_major_to_minor <- 200 94 | Sam.train <- SMOTE(IP_YTM ~ . -StatePatientID -ED_YTM, data = Sam.train, 95 | perc.over = percent_to_oversample, perc.under = percent_ratio_major_to_minor) 96 | print("Finished applying SMOTE algorithm.") 97 | 98 | # ROSE algorithm for balancing training data by over/undersampling 99 | # print("Beginning to apply ROSE algorithm.") 100 | # result_sample_size <- 200000 101 | # rare_proportion <- 0.4 102 | # Sam.train.without_factors <- Sam.train[, !c("StatePatientID", "ED_YTM"), with = FALSE] 103 | # Sam.train.factors <- Sam.train[, c("StatePatientID", "ED_YTM"), with = FALSE] 104 | # Sam.train <- ovun.sample(IP_YTM ~ . -StatePatientID -ED_YTM, data = Sam.train, 105 | # method = "both", N = result_sample_size, p = rare_proportion)$data 106 | # Sam.train <- data.table(Sam.train) 107 | # print("Finished applying ROSE algorithm.") 108 | 109 | # Shuffle train data to homogenize 0/1 y values 110 | print("Begin shuffle.") 111 | Sam.train <- Sam.train[sample(nrow(Sam.train)),] 112 | print("Finished shuffle.") 113 | 114 | # Split into train.x, train.y, test.x, test.y 115 | print("Begin split into x/y.") 116 | Sam.train.x <- Sam.train[, !c("StatePatientID", "ED_YTM", "IP_YTM"), with = FALSE] 117 | Sam.train.y <- Sam.train[, c("IP_YTM"), with = FALSE] 118 | rm(Sam.train) 119 | Sam.test.x <- Sam.test[, !c("StatePatientID", "ED_YTM", "IP_YTM"), with = FALSE] 120 | Sam.test.y <- Sam.test[, c("IP_YTM"), with = FALSE] 121 | Sam.test.ids <- Sam.test[, c("StatePatientID"), with = FALSE] 122 | rm(Sam.test) 123 | print("Finished split into x/y.") 124 | 125 | # Change y to one-hot 126 | Sam.train.y[, zero := ifelse(IP_YTM == 0, 1, 0)] 127 | Sam.train.y[, one := IP_YTM] 128 | Sam.train.y[, IP_YTM := NULL] 129 | Sam.test.y[, zero := ifelse(IP_YTM == 0, 1, 0)] 130 | Sam.test.y[, one := IP_YTM] 131 | Sam.test.y[, IP_YTM := NULL] 132 | 133 | # Write all splits to file 134 | print("Begin write to file.") 135 | base_name <- ifelse(is.na(base_name), "SAMFull", base_name) 136 | fwrite(Sam.train.x, paste0(base_name, "_train_x", ".csv")) 137 | fwrite(Sam.train.y, paste0(base_name, "_train_y", ".csv")) 138 | fwrite(Sam.test.x, paste0(base_name, "_test_x", ".csv")) 139 | fwrite(Sam.test.y, paste0(base_name, "_test_y", ".csv")) 140 | fwrite(Sam.test.ids, paste0(base_name, "_test_ids", ".csv")) 141 | print("Finished write to file.") 142 | 143 | # Remove all columns with all zero entries 144 | # Sam <- Sam[,which(unlist(lapply(Sam, function(x)!all(is.zero(x))))),with=F] 145 | # print(str(Sam)) 146 | -------------------------------------------------------------------------------- /preprocess/preprocess_and_split.R: -------------------------------------------------------------------------------- 1 | library(data.table) # Must have data.table v1.9.7+ 2 | library(readr) 3 | library(DMwR) 4 | library(ROSE) 5 | 6 | # Usage (must be run from command line) 7 | # Rscript 8 | # Program will print steps of execution and write 5 different files to disk: 9 | # - Train x (saved as base_name_train_x.csv) 10 | # - Train y (saved as base_name_train_y.csv as one-hot vectors) 11 | # - Test x (saved as base_name_test_x.csv) 12 | # - Test y (saved as base_name_test_y.csv as one-hot vectors) 13 | # - Test ids (saved as base_name_test_ids.csv) (patient ids in order of all the test cases) 14 | 15 | # Parse command line arguments 16 | args <- commandArgs(trailingOnly = TRUE) 17 | path_sam <- args[1] 18 | path_train_ids <- args[2] 19 | path_test_ids <- args[3] 20 | base_name <- args[4] 21 | 22 | # Read in raw files: SAM table, train case ids, and test case ids 23 | print(paste("Reading", path_sam)) 24 | Sam <- fread(path_sam, header = T) 25 | print(paste("Reading", path_train_ids)) 26 | Train_ids <- fread(path_train_ids, header = T) 27 | print(paste("Reading", path_test_ids)) 28 | Test_ids <- fread(path_test_ids, header = T) 29 | 30 | print("Done reading files.") 31 | 32 | # Reset headers of data tables to get rid of BOM in case it's there 33 | # http://stackoverflow.com/questions/21624796/read-the-text-file-with-bom-in-r 34 | Sam.names <- names(read.csv(path_sam, nrows = 1, fileEncoding = "UTF-8-BOM")) 35 | Train_ids.names <- names(read.csv(path_train_ids, nrows = 1, fileEncoding = "UTF-8-BOM")) 36 | Test_ids.names <- names(read.csv(path_test_ids, nrows = 1, fileEncoding = "UTF-8-BOM")) 37 | 38 | names(Sam) <- Sam.names 39 | names(Train_ids) <- Train_ids.names 40 | names(Test_ids) <- Test_ids.names 41 | 42 | print("Removed BOM from text") 43 | 44 | # Pre-processing functions 45 | is.zero <- function(v) { 46 | return(v==0) 47 | } 48 | 49 | unitScale <- function(v) { 50 | if (is.factor(v)) { 51 | return(v) 52 | } 53 | range <- max(v) - min(v) 54 | if (range == 0) { 55 | return(0) 56 | } 57 | return((v - min(v)) / range) 58 | } 59 | 60 | print(str(Sam)) 61 | 62 | # Test min value of Sam 63 | # Sam.maxs <- Sam[, lapply(.SD, max)] 64 | # print(str(Sam.maxs)) 65 | # print(sum(Sam.maxs==0)) 66 | 67 | # Change y values of IP/ED to 1/0 depending on return or not (binarize) 68 | Sam$ED_YTM <- ifelse(Sam$ED_YTM > 0, 1, 0) 69 | Sam$IP_YTM <- ifelse(Sam$IP_YTM > 0, 1, 0) 70 | 71 | # Change all necessary columns to factors to prevent scaling and 72 | # to assure SMOTE works 73 | Sam$StatePatientID <- as.factor(Sam$StatePatientID) 74 | Sam$ED_YTM <- as.factor(Sam$ED_YTM) 75 | Sam$IP_YTM <- as.factor(Sam$IP_YTM) 76 | 77 | # Scale all columns of Sam 78 | print("Starting to scale table.") 79 | Sam <- Sam[, lapply(.SD, unitScale)] 80 | print("Completed scaling of columns.") 81 | 82 | # Split into train and test 83 | print("Starting to split into train and test sets.") 84 | Sam.train <- Sam[StatePatientID %in% Train_ids[[1]]] 85 | Sam.test <- Sam[StatePatientID %in% Test_ids[[1]]] 86 | rm(Sam) 87 | print("Finished splitting into train and test sets.") 88 | 89 | # SMOTE algorithm for balancing training data by interpolated over/undersampling 90 | # Smote parameters 91 | # print("Beginning to apply SMOTE algorithm.") 92 | # percent_to_oversample <- 500 93 | # percent_ratio_major_to_minor <- 100 94 | # Sam.train <- SMOTE(IP_YTM ~ . -StatePatientID -ED_YTM, data = Sam.train, 95 | # perc.over = percent_to_oversample, perc.under = percent_ratio_major_to_minor) 96 | # print("Finished applying SMOTE algorithm.") 97 | 98 | # ROSE algorithm for balancing training data by over/undersampling 99 | print("Beginning to apply ROSE algorithm.") 100 | result_sample_size <- 200000 101 | rare_proportion <- 0.4 102 | # Sam.train.without_factors <- Sam.train[, !c("StatePatientID", "ED_YTM"), with = FALSE] 103 | # Sam.train.factors <- Sam.train[, c("StatePatientID", "ED_YTM"), with = FALSE] 104 | Sam.train <- ovun.sample(IP_YTM ~ . -StatePatientID -ED_YTM, data = Sam.train, 105 | method = "both", N = result_sample_size, p = rare_proportion)$data 106 | Sam.train <- data.table(Sam.train) 107 | print("Finished applying ROSE algorithm.") 108 | 109 | # Shuffle train data to homogenize 0/1 y values 110 | print("Begin shuffle.") 111 | Sam.train <- Sam.train[sample(nrow(Sam.train)),] 112 | print("Finished shuffle.") 113 | 114 | # Split into train.x, train.y, test.x, test.y 115 | print("Begin split into x/y.") 116 | Sam.train.x <- Sam.train[, !c("StatePatientID", "ED_YTM", "IP_YTM"), with = FALSE] 117 | Sam.train.y <- Sam.train[, c("IP_YTM"), with = FALSE] 118 | rm(Sam.train) 119 | Sam.test.x <- Sam.test[, !c("StatePatientID", "ED_YTM", "IP_YTM"), with = FALSE] 120 | Sam.test.y <- Sam.test[, c("IP_YTM"), with = FALSE] 121 | Sam.test.ids <- Sam.test[, c("StatePatientID"), with = FALSE] 122 | rm(Sam.test) 123 | print("Finished split into x/y.") 124 | 125 | # Change y to one-hot 126 | Sam.train.y[, zero := ifelse(IP_YTM == 0, 1, 0)] 127 | Sam.train.y[, one := IP_YTM] 128 | Sam.train.y[, IP_YTM := NULL] 129 | Sam.test.y[, zero := ifelse(IP_YTM == 0, 1, 0)] 130 | Sam.test.y[, one := IP_YTM] 131 | Sam.test.y[, IP_YTM := NULL] 132 | 133 | # Write all splits to file 134 | print("Begin write to file.") 135 | base_name <- ifelse(is.na(base_name), "SAMFull", base_name) 136 | fwrite(Sam.train.x, paste0(base_name, "_train_x", ".csv"), col.names = FALSE) 137 | fwrite(Sam.train.y, paste0(base_name, "_train_y", ".csv"), col.names = FALSE) 138 | fwrite(Sam.test.x, paste0(base_name, "_test_x", ".csv"), col.names = FALSE) 139 | fwrite(Sam.test.y, paste0(base_name, "_test_y", ".csv"), col.names = FALSE) 140 | fwrite(Sam.test.ids, paste0(base_name, "_test_ids", ".csv")) 141 | print("Finished write to file.") 142 | 143 | # Remove all columns with all zero entries 144 | # Sam <- Sam[,which(unlist(lapply(Sam, function(x)!all(is.zero(x))))),with=F] 145 | # print(str(Sam)) 146 | -------------------------------------------------------------------------------- /preprocess/reduce_columns.R: -------------------------------------------------------------------------------- 1 | library(data.table) 2 | 3 | # For use with command line Rscript 4 | 5 | args <- commandArgs(trailingOnly = TRUE) 6 | path_sam <- args[1] 7 | path_columns <- args[2] 8 | dest_filename <- args[3] 9 | 10 | print("Reading files") 11 | Sam <- fread(path_sam) 12 | Columns <- fread(path_columns, header = FALSE) 13 | print("Finished reading files") 14 | 15 | Sam.names <- names(read.csv(path_sam, nrows = 1, fileEncoding = "UTF-8-BOM")) 16 | Column.names <- c("features") 17 | 18 | names(Sam) <- Sam.names 19 | names(Columns) <- Column.names 20 | 21 | print("Removed BOM from text") 22 | print(str(Sam)) 23 | print(str(Columns)) 24 | 25 | Columns.vec <- Columns$features # first column 26 | # print("Before") 27 | # print(Columns.vec) 28 | Columns.vec <- Columns.vec[which(Columns.vec %in% colnames(Sam))] 29 | print("Reduced Columns vec") 30 | # print("After") 31 | # print(Columns.vec) 32 | 33 | print("Filtering columns") 34 | Sam <- Sam[, Columns.vec, with=F] 35 | fwrite(Sam, file.path = dest_filename) 36 | print("Done filtering columns") -------------------------------------------------------------------------------- /preprocess/roc.R: -------------------------------------------------------------------------------- 1 | library("ROCR") 2 | 3 | args <- commandArgs(trailingOnly = TRUE) 4 | pred_path <- args[1] 5 | labels_path <- args[2] 6 | 7 | pred <- read.csv(pred_path, header = FALSE)[,2] 8 | labels <- read.csv(labels_path, header = FALSE)[,2] 9 | 10 | pred <- prediction(pred, labels) 11 | perf <- performance(pred, measure = "tpr", x.measure = "fpr") # ROC 12 | pdf("ROC.pdf") 13 | plot(perf, col=rainbow(10)) 14 | dev.off() -------------------------------------------------------------------------------- /rf/rf2.py: -------------------------------------------------------------------------------- 1 | """ 2 | a random forest classifier 3 | with muilti-GPU utilization 4 | 5 | Tiffany.Fu 6 | 7 | """ 8 | 9 | 10 | from __future__ import absolute_import 11 | from __future__ import division 12 | from __future__ import print_function 13 | from __future__ import absolute_import 14 | from __future__ import division 15 | from __future__ import print_function 16 | 17 | from sklearn import datasets, metrics, cross_validation 18 | import tensorflow as tf 19 | from tensorflow.contrib import skflow 20 | 21 | 22 | 23 | import tensorflow as tf 24 | 25 | 26 | class TensorForestTrainer (tf.test.TestCase): 27 | 28 | def Classification(self): 29 | """classification using matrix data as input.""" 30 | hparams = tf.contrib.tensor_forest.python.tensor_forest.ForestHParams( 31 | num_trees=300, max_nodes=1000, num_classes=2, num_features=4) 32 | classifier = tf.contrib.learn.TensorForestEstimator(hparams) 33 | 34 | 35 | classifier.fit(x=, y=, steps=100) 36 | classifier.evaluate(x=, y=, steps=10) 37 | 38 | 39 | 40 | if __name__ == '__main__': 41 | tf.test.main() 42 | -------------------------------------------------------------------------------- /rf/rf3.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import time 6 | 7 | import numpy as np 8 | import six 9 | 10 | from tensorflow.contrib import framework as contrib_framework 11 | from tensorflow.contrib.learn.python.learn import monitors as mon 12 | 13 | from tensorflow.contrib.learn.python.learn.estimators import estimator 14 | from tensorflow.contrib.learn.python.learn.estimators import run_config 15 | 16 | from tensorflow.contrib.tensor_forest.client import eval_metrics 17 | from tensorflow.contrib.tensor_forest.data import data_ops 18 | from tensorflow.contrib.tensor_forest.python import tensor_forest 19 | 20 | from tensorflow.python.ops import array_ops 21 | from tensorflow.python.ops import control_flow_ops 22 | from tensorflow.python.ops import math_ops 23 | from tensorflow.python.ops import state_ops 24 | 25 | 26 | class LossMonitor(mon.EveryN): 27 | """Terminates training when training loss stops decreasing.""" 28 | 29 | def __init__(self, 30 | early_stopping_rounds, 31 | every_n_steps): 32 | super(LossMonitor, self).__init__(every_n_steps=every_n_steps) 33 | self.early_stopping_rounds = early_stopping_rounds 34 | self.min_loss = None 35 | self.min_loss_step = 0 36 | 37 | def set_estimator(self, est): 38 | """This function gets called in the same graph as _get_train_ops.""" 39 | super(LossMonitor, self).set_estimator(est) 40 | self._loss_op_name = est.training_loss.name 41 | 42 | def every_n_step_end(self, step, outputs): 43 | super(LossMonitor, self).every_n_step_end(step, outputs) 44 | current_loss = outputs[self._loss_op_name] 45 | if self.min_loss is None or current_loss < self.min_loss: 46 | self.min_loss = current_loss 47 | self.min_loss_step = step 48 | return step - self.min_loss_step >= self.early_stopping_rounds 49 | 50 | 51 | class TensorForestEstimator(estimator.BaseEstimator): 52 | """An estimator that can train and evaluate a random forest.""" 53 | 54 | def __init__(self, params, device_assigner=None, model_dir=None, 55 | graph_builder_class=tensor_forest.RandomForestGraphs, 56 | master='', accuracy_metric=None, 57 | tf_random_seed=None, config=None): 58 | self.params = params.fill() 59 | self.accuracy_metric = (accuracy_metric or 60 | ('r2' if self.params.regression else 'accuracy')) 61 | self.data_feeder = None 62 | self.device_assigner = ( 63 | device_assigner or tensor_forest.RandomForestDeviceAssigner()) 64 | self.graph_builder_class = graph_builder_class 65 | self.training_args = {} 66 | self.construction_args = {} 67 | 68 | super(TensorForestEstimator, self).__init__(model_dir=model_dir, 69 | config=config) 70 | 71 | def predict_proba(self, x=None, input_fn=None, batch_size=None): 72 | """Returns prediction probabilities for given features (classification). 73 | Args: 74 | x: features. 75 | input_fn: Input function. If set, x and y must be None. 76 | batch_size: Override default batch size. 77 | Returns: 78 | Numpy array of predicted probabilities. 79 | Raises: 80 | ValueError: If both or neither of x and input_fn were given. 81 | """ 82 | return super(TensorForestEstimator, self).predict( 83 | x=x, input_fn=input_fn, batch_size=batch_size) 84 | 85 | def predict(self, x=None, input_fn=None, axis=None, batch_size=None): 86 | """Returns predictions for given features. 87 | Args: 88 | x: features. 89 | input_fn: Input function. If set, x must be None. 90 | axis: Axis on which to argmax (for classification). 91 | Last axis is used by default. 92 | batch_size: Override default batch size. 93 | Returns: 94 | Numpy array of predicted classes or regression values. 95 | """ 96 | probabilities = self.predict_proba(x, input_fn, batch_size) 97 | if self.params.regression: 98 | return probabilities 99 | else: 100 | return np.argmax(probabilities, axis=1) 101 | 102 | def _get_train_ops(self, features, targets): 103 | """Method that builds model graph and returns trainer ops. 104 | Args: 105 | features: `Tensor` or `dict` of `Tensor` objects. 106 | targets: `Tensor` or `dict` of `Tensor` objects. 107 | Returns: 108 | Tuple of train `Operation` and loss `Tensor`. 109 | """ 110 | features, spec = data_ops.ParseDataTensorOrDict(features) 111 | labels = data_ops.ParseLabelTensorOrDict(targets) 112 | 113 | graph_builder = self.graph_builder_class( 114 | self.params, device_assigner=self.device_assigner, 115 | **self.construction_args) 116 | 117 | epoch = None 118 | if self.data_feeder: 119 | epoch = self.data_feeder.make_epoch_variable() 120 | 121 | train = control_flow_ops.group( 122 | graph_builder.training_graph( 123 | features, labels, data_spec=spec, epoch=epoch, 124 | **self.training_args), 125 | state_ops.assign_add(contrib_framework.get_global_step(), 1)) 126 | 127 | self.training_loss = graph_builder.training_loss(features, targets) 128 | 129 | return train, self.training_loss 130 | 131 | def _get_predict_ops(self, features): 132 | graph_builder = self.graph_builder_class( 133 | self.params, device_assigner=self.device_assigner, training=False, 134 | **self.construction_args) 135 | features, spec = data_ops.ParseDataTensorOrDict(features) 136 | return graph_builder.inference_graph(features, data_spec=spec) 137 | 138 | def _get_eval_ops(self, features, targets, metrics): 139 | features, spec = data_ops.ParseDataTensorOrDict(features) 140 | labels = data_ops.ParseLabelTensorOrDict(targets) 141 | 142 | graph_builder = self.graph_builder_class( 143 | self.params, device_assigner=self.device_assigner, training=False, 144 | **self.construction_args) 145 | 146 | probabilities = graph_builder.inference_graph(features, data_spec=spec) 147 | 148 | # One-hot the labels. 149 | if not self.params.regression: 150 | labels = math_ops.to_int64(array_ops.one_hot(math_ops.to_int64( 151 | array_ops.squeeze(labels)), self.params.num_classes, 1, 0)) 152 | 153 | if metrics is None: 154 | metrics = {self.accuracy_metric: 155 | eval_metrics.get_metric(self.accuracy_metric)} 156 | 157 | result = {} 158 | for name, metric in six.iteritems(metrics): 159 | result[name] = metric(probabilities, labels) 160 | 161 | return result 162 | -------------------------------------------------------------------------------- /rf/tensor_forest.py: -------------------------------------------------------------------------------- 1 | # Copyright 2016 The TensorFlow Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | """Extremely random forest graph builder. go/brain-tree.""" 16 | from __future__ import absolute_import 17 | from __future__ import division 18 | from __future__ import print_function 19 | 20 | import math 21 | import random 22 | 23 | from tensorflow.contrib.tensor_forest.python import constants 24 | from tensorflow.contrib.tensor_forest.python.ops import inference_ops 25 | from tensorflow.contrib.tensor_forest.python.ops import training_ops 26 | 27 | from tensorflow.python.framework import constant_op 28 | from tensorflow.python.framework import dtypes 29 | from tensorflow.python.framework import ops 30 | from tensorflow.python.ops import array_ops 31 | from tensorflow.python.ops import control_flow_ops 32 | from tensorflow.python.ops import init_ops 33 | from tensorflow.python.ops import math_ops 34 | from tensorflow.python.ops import random_ops 35 | from tensorflow.python.ops import state_ops 36 | from tensorflow.python.ops import variable_scope 37 | from tensorflow.python.ops import variables as tf_variables 38 | from tensorflow.python.platform import tf_logging as logging 39 | 40 | 41 | # A convenience class for holding random forest hyperparameters. 42 | # 43 | # To just get some good default parameters, use: 44 | # hparams = ForestHParams(num_classes=2, num_features=40).fill() 45 | # 46 | # Note that num_classes can not be inferred and so must always be specified. 47 | # Also, either num_splits_to_consider or num_features should be set. 48 | # 49 | # To override specific values, pass them to the constructor: 50 | # hparams = ForestHParams(num_classes=5, num_trees=10, num_features=5).fill() 51 | # 52 | # TODO(thomaswc): Inherit from tf.HParams when that is publicly available. 53 | class ForestHParams(object): 54 | """A base class for holding hyperparameters and calculating good defaults.""" 55 | 56 | def __init__(self, 57 | num_trees=100, 58 | max_nodes=10000, 59 | bagging_fraction=1.0, 60 | num_splits_to_consider=0, 61 | feature_bagging_fraction=1.0, 62 | max_fertile_nodes=0, 63 | split_after_samples=250, 64 | min_split_samples=5, 65 | valid_leaf_threshold=1, 66 | **kwargs): 67 | self.num_trees = num_trees 68 | self.max_nodes = max_nodes 69 | self.bagging_fraction = bagging_fraction 70 | self.feature_bagging_fraction = feature_bagging_fraction 71 | self.num_splits_to_consider = num_splits_to_consider 72 | self.max_fertile_nodes = max_fertile_nodes 73 | self.split_after_samples = split_after_samples 74 | self.min_split_samples = min_split_samples 75 | self.valid_leaf_threshold = valid_leaf_threshold 76 | 77 | for name, value in kwargs.items(): 78 | setattr(self, name, value) 79 | 80 | def values(self): 81 | return self.__dict__ 82 | 83 | def fill(self): 84 | """Intelligently sets any non-specific parameters.""" 85 | # Fail fast if num_classes or num_features isn't set. 86 | _ = getattr(self, 'num_classes') 87 | _ = getattr(self, 'num_features') 88 | 89 | self.bagged_num_features = int(self.feature_bagging_fraction * 90 | self.num_features) 91 | 92 | self.bagged_features = None 93 | if self.feature_bagging_fraction < 1.0: 94 | self.bagged_features = [random.sample( 95 | range(self.num_features), 96 | self.bagged_num_features) for _ in range(self.num_trees)] 97 | 98 | self.regression = getattr(self, 'regression', False) 99 | 100 | # Num_outputs is the actual number of outputs (a single prediction for 101 | # classification, a N-dimenensional point for regression). 102 | self.num_outputs = self.num_classes if self.regression else 1 103 | 104 | # Add an extra column to classes for storing counts, which is needed for 105 | # regression and avoids having to recompute sums for classification. 106 | self.num_output_columns = self.num_classes + 1 107 | 108 | # The Random Forest literature recommends sqrt(# features) for 109 | # classification problems, and p/3 for regression problems. 110 | # TODO(thomaswc): Consider capping this for large number of features. 111 | self.num_splits_to_consider = ( 112 | self.num_splits_to_consider or 113 | max(10, int(math.ceil(math.sqrt(self.num_features))))) 114 | 115 | # max_fertile_nodes doesn't effect performance, only training speed. 116 | # We therefore set it primarily based upon space considerations. 117 | # Each fertile node takes up num_splits_to_consider times as much 118 | # as space as a non-fertile node. We want the fertile nodes to in 119 | # total only take up as much space as the non-fertile nodes, so 120 | num_fertile = int(math.ceil(self.max_nodes / self.num_splits_to_consider)) 121 | # But always use at least 1000 accumulate slots. 122 | num_fertile = max(num_fertile, 1000) 123 | self.max_fertile_nodes = self.max_fertile_nodes or num_fertile 124 | # But it also never needs to be larger than the number of leaves, 125 | # which is max_nodes / 2. 126 | self.max_fertile_nodes = min(self.max_fertile_nodes, 127 | int(math.ceil(self.max_nodes / 2.0))) 128 | 129 | # We have num_splits_to_consider slots to fill, and we want to spend 130 | # approximately split_after_samples samples initializing them. 131 | num_split_initializiations_per_input = max(1, int(math.floor( 132 | self.num_splits_to_consider / self.split_after_samples))) 133 | self.split_initializations_per_input = getattr( 134 | self, 'split_initializations_per_input', 135 | num_split_initializiations_per_input) 136 | 137 | # If base_random_seed is 0, the current time will be used to seed the 138 | # random number generators for each tree. If non-zero, the i-th tree 139 | # will be seeded with base_random_seed + i. 140 | self.base_random_seed = getattr(self, 'base_random_seed', 0) 141 | 142 | return self 143 | 144 | 145 | # A simple container to hold the training variables for a single tree. 146 | class TreeTrainingVariables(object): 147 | """Stores tf.Variables for training a single random tree. 148 | 149 | Uses tf.get_variable to get tree-specific names so that this can be used 150 | with a tf.learn-style implementation (one that trains a model, saves it, 151 | then relies on restoring that model to evaluate). 152 | """ 153 | 154 | def __init__(self, params, tree_num, training): 155 | self.tree = variable_scope.get_variable( 156 | name=self.get_tree_name('tree', tree_num), dtype=dtypes.int32, 157 | shape=[params.max_nodes, 2], 158 | initializer=init_ops.constant_initializer(-2)) 159 | self.tree_thresholds = variable_scope.get_variable( 160 | name=self.get_tree_name('tree_thresholds', tree_num), 161 | shape=[params.max_nodes], 162 | initializer=init_ops.constant_initializer(-1.0)) 163 | self.end_of_tree = variable_scope.get_variable( 164 | name=self.get_tree_name('end_of_tree', tree_num), 165 | dtype=dtypes.int32, 166 | initializer=constant_op.constant([1])) 167 | self.start_epoch = tf_variables.Variable( 168 | [0] * (params.max_nodes), name='start_epoch') 169 | 170 | if training: 171 | self.node_to_accumulator_map = variable_scope.get_variable( 172 | name=self.get_tree_name('node_to_accumulator_map', tree_num), 173 | shape=[params.max_nodes], 174 | dtype=dtypes.int32, 175 | initializer=init_ops.constant_initializer(-1)) 176 | 177 | self.candidate_split_features = variable_scope.get_variable( 178 | name=self.get_tree_name('candidate_split_features', tree_num), 179 | shape=[params.max_fertile_nodes, params.num_splits_to_consider], 180 | dtype=dtypes.int32, 181 | initializer=init_ops.constant_initializer(-1)) 182 | self.candidate_split_thresholds = variable_scope.get_variable( 183 | name=self.get_tree_name('candidate_split_thresholds', tree_num), 184 | shape=[params.max_fertile_nodes, params.num_splits_to_consider], 185 | initializer=init_ops.constant_initializer(0.0)) 186 | 187 | # Statistics shared by classification and regression. 188 | self.node_sums = variable_scope.get_variable( 189 | name=self.get_tree_name('node_sums', tree_num), 190 | shape=[params.max_nodes, params.num_output_columns], 191 | initializer=init_ops.constant_initializer(0.0)) 192 | 193 | if training: 194 | self.candidate_split_sums = variable_scope.get_variable( 195 | name=self.get_tree_name('candidate_split_sums', tree_num), 196 | shape=[params.max_fertile_nodes, params.num_splits_to_consider, 197 | params.num_output_columns], 198 | initializer=init_ops.constant_initializer(0.0)) 199 | self.accumulator_sums = variable_scope.get_variable( 200 | name=self.get_tree_name('accumulator_sums', tree_num), 201 | shape=[params.max_fertile_nodes, params.num_output_columns], 202 | initializer=init_ops.constant_initializer(-1.0)) 203 | 204 | # Regression also tracks second order stats. 205 | if params.regression: 206 | self.node_squares = variable_scope.get_variable( 207 | name=self.get_tree_name('node_squares', tree_num), 208 | shape=[params.max_nodes, params.num_output_columns], 209 | initializer=init_ops.constant_initializer(0.0)) 210 | 211 | self.candidate_split_squares = variable_scope.get_variable( 212 | name=self.get_tree_name('candidate_split_squares', tree_num), 213 | shape=[params.max_fertile_nodes, params.num_splits_to_consider, 214 | params.num_output_columns], 215 | initializer=init_ops.constant_initializer(0.0)) 216 | 217 | self.accumulator_squares = variable_scope.get_variable( 218 | name=self.get_tree_name('accumulator_squares', tree_num), 219 | shape=[params.max_fertile_nodes, params.num_output_columns], 220 | initializer=init_ops.constant_initializer(-1.0)) 221 | 222 | else: 223 | self.node_squares = constant_op.constant( 224 | 0.0, name=self.get_tree_name('node_squares', tree_num)) 225 | 226 | self.candidate_split_squares = constant_op.constant( 227 | 0.0, name=self.get_tree_name('candidate_split_squares', tree_num)) 228 | 229 | self.accumulator_squares = constant_op.constant( 230 | 0.0, name=self.get_tree_name('accumulator_squares', tree_num)) 231 | 232 | def get_tree_name(self, name, num): 233 | return '{0}-{1}'.format(name, num) 234 | 235 | 236 | class ForestStats(object): 237 | 238 | def __init__(self, tree_stats, params): 239 | """A simple container for stats about a forest.""" 240 | self.tree_stats = tree_stats 241 | self.params = params 242 | 243 | def get_average(self, thing): 244 | val = 0.0 245 | for i in range(self.params.num_trees): 246 | val += getattr(self.tree_stats[i], thing) 247 | 248 | return val / self.params.num_trees 249 | 250 | 251 | class TreeStats(object): 252 | 253 | def __init__(self, num_nodes, num_leaves): 254 | self.num_nodes = num_nodes 255 | self.num_leaves = num_leaves 256 | 257 | 258 | class ForestTrainingVariables(object): 259 | """A container for a forests training data, consisting of multiple trees. 260 | 261 | Instantiates a TreeTrainingVariables object for each tree. We override the 262 | __getitem__ and __setitem__ function so that usage looks like this: 263 | 264 | forest_variables = ForestTrainingVariables(params) 265 | 266 | ... forest_variables.tree ... 267 | """ 268 | 269 | def __init__(self, params, device_assigner, training=True, 270 | tree_variables_class=TreeTrainingVariables): 271 | self.variables = [] 272 | for i in range(params.num_trees): 273 | with ops.device(device_assigner.get_device(i)): 274 | self.variables.append(tree_variables_class(params, i, training)) 275 | 276 | def __setitem__(self, t, val): 277 | self.variables[t] = val 278 | 279 | def __getitem__(self, t): 280 | return self.variables[t] 281 | 282 | 283 | class RandomForestDeviceAssigner(object): 284 | """A device assigner that uses the default device. 285 | 286 | Write subclasses that implement get_device for control over how trees 287 | get assigned to devices. This assumes that whole trees are assigned 288 | to a device. 289 | """ 290 | 291 | def __init__(self): 292 | self.cached = None 293 | 294 | def get_device(self, unused_tree_num): 295 | if not self.cached: 296 | dummy = constant_op.constant(0) 297 | self.cached = dummy.device 298 | 299 | return self.cached 300 | 301 | 302 | class RandomForestGraphs(object): 303 | """Builds TF graphs for random forest training and inference.""" 304 | 305 | def __init__(self, params, device_assigner=None, 306 | variables=None, tree_variables_class=TreeTrainingVariables, 307 | tree_graphs=None, training=True, 308 | t_ops=training_ops, 309 | i_ops=inference_ops): 310 | self.params = params 311 | self.device_assigner = device_assigner or RandomForestDeviceAssigner() 312 | logging.info('Constructing forest with params = ') 313 | logging.info(self.params.__dict__) 314 | self.variables = variables or ForestTrainingVariables( 315 | self.params, device_assigner=self.device_assigner, training=training, 316 | tree_variables_class=tree_variables_class) 317 | tree_graph_class = tree_graphs or RandomTreeGraphs 318 | self.trees = [ 319 | tree_graph_class( 320 | self.variables[i], self.params, 321 | t_ops.Load(), i_ops.Load(), i) 322 | for i in range(self.params.num_trees)] 323 | 324 | def _bag_features(self, tree_num, input_data): 325 | split_data = array_ops.split(1, self.params.num_features, input_data) 326 | return array_ops.concat( 327 | 1, [split_data[ind] for ind in self.params.bagged_features[tree_num]]) 328 | 329 | def training_graph(self, input_data, input_labels, data_spec=None, 330 | epoch=None, **tree_kwargs): 331 | """Constructs a TF graph for training a random forest. 332 | 333 | Args: 334 | input_data: A tensor or SparseTensor or placeholder for input data. 335 | input_labels: A tensor or placeholder for labels associated with 336 | input_data. 337 | data_spec: A list of tf.dtype values specifying the original types of 338 | each column. 339 | epoch: A tensor or placeholder for the epoch the training data comes from. 340 | **tree_kwargs: Keyword arguments passed to each tree's training_graph. 341 | 342 | Returns: 343 | The last op in the random forest training graph. 344 | """ 345 | data_spec = [constants.DATA_FLOAT] if data_spec is None else data_spec 346 | tree_graphs = [] 347 | for i in range(self.params.num_trees): 348 | with ops.device(self.device_assigner.get_device(i)): 349 | seed = self.params.base_random_seed 350 | if seed != 0: 351 | seed += i 352 | # If using bagging, randomly select some of the input. 353 | tree_data = input_data 354 | tree_labels = input_labels 355 | if self.params.bagging_fraction < 1.0: 356 | # TODO(thomaswc): This does sampling without replacment. Consider 357 | # also allowing sampling with replacement as an option. 358 | batch_size = array_ops.slice(array_ops.shape(input_data), [0], [1]) 359 | r = random_ops.random_uniform(batch_size, seed=seed) 360 | mask = math_ops.less( 361 | r, array_ops.ones_like(r) * self.params.bagging_fraction) 362 | gather_indices = array_ops.squeeze( 363 | array_ops.where(mask), squeeze_dims=[1]) 364 | # TODO(thomaswc): Calculate out-of-bag data and labels, and store 365 | # them for use in calculating statistics later. 366 | tree_data = array_ops.gather(input_data, gather_indices) 367 | tree_labels = array_ops.gather(input_labels, gather_indices) 368 | if self.params.bagged_features: 369 | tree_data = self._bag_features(i, tree_data) 370 | 371 | initialization = self.trees[i].tree_initialization() 372 | 373 | with ops.control_dependencies([initialization]): 374 | tree_graphs.append( 375 | self.trees[i].training_graph( 376 | tree_data, tree_labels, seed, data_spec=data_spec, 377 | epoch=([0] if epoch is None else epoch), 378 | **tree_kwargs)) 379 | 380 | return control_flow_ops.group(*tree_graphs, name='train') 381 | 382 | def inference_graph(self, input_data, data_spec=None): 383 | """Constructs a TF graph for evaluating a random forest. 384 | 385 | Args: 386 | input_data: A tensor or SparseTensor or placeholder for input data. 387 | data_spec: A list of tf.dtype values specifying the original types of 388 | each column. 389 | 390 | Returns: 391 | The last op in the random forest inference graph. 392 | """ 393 | data_spec = [constants.DATA_FLOAT] if data_spec is None else data_spec 394 | probabilities = [] 395 | for i in range(self.params.num_trees): 396 | with ops.device(self.device_assigner.get_device(i)): 397 | tree_data = input_data 398 | if self.params.bagged_features: 399 | tree_data = self._bag_features(i, input_data) 400 | probabilities.append(self.trees[i].inference_graph(tree_data, 401 | data_spec)) 402 | with ops.device(self.device_assigner.get_device(0)): 403 | all_predict = array_ops.pack(probabilities) 404 | return math_ops.div( 405 | math_ops.reduce_sum(all_predict, 0), self.params.num_trees, 406 | name='probabilities') 407 | 408 | def average_size(self): 409 | """Constructs a TF graph for evaluating the average size of a forest. 410 | 411 | Returns: 412 | The average number of nodes over the trees. 413 | """ 414 | sizes = [] 415 | for i in range(self.params.num_trees): 416 | with ops.device(self.device_assigner.get_device(i)): 417 | sizes.append(self.trees[i].size()) 418 | return math_ops.reduce_mean(array_ops.pack(sizes)) 419 | 420 | # pylint: disable=unused-argument 421 | def training_loss(self, features, labels): 422 | return math_ops.neg(self.average_size()) 423 | 424 | # pylint: disable=unused-argument 425 | def validation_loss(self, features, labels): 426 | return math_ops.neg(self.average_size()) 427 | 428 | def average_impurity(self): 429 | """Constructs a TF graph for evaluating the leaf impurity of a forest. 430 | 431 | Returns: 432 | The last op in the graph. 433 | """ 434 | impurities = [] 435 | for i in range(self.params.num_trees): 436 | with ops.device(self.device_assigner.get_device(i)): 437 | impurities.append(self.trees[i].average_impurity()) 438 | return math_ops.reduce_mean(array_ops.pack(impurities)) 439 | 440 | def get_stats(self, session): 441 | tree_stats = [] 442 | for i in range(self.params.num_trees): 443 | with ops.device(self.device_assigner.get_device(i)): 444 | tree_stats.append(self.trees[i].get_stats(session)) 445 | return ForestStats(tree_stats, self.params) 446 | 447 | 448 | class RandomTreeGraphs(object): 449 | """Builds TF graphs for random tree training and inference.""" 450 | 451 | def __init__(self, variables, params, t_ops, i_ops, tree_num): 452 | self.training_ops = t_ops 453 | self.inference_ops = i_ops 454 | self.variables = variables 455 | self.params = params 456 | self.tree_num = tree_num 457 | 458 | def tree_initialization(self): 459 | def _init_tree(): 460 | return state_ops.scatter_update(self.variables.tree, [0], [[-1, -1]]).op 461 | 462 | def _nothing(): 463 | return control_flow_ops.no_op() 464 | 465 | return control_flow_ops.cond( 466 | math_ops.equal(array_ops.squeeze(array_ops.slice( 467 | self.variables.tree, [0, 0], [1, 1])), -2), 468 | _init_tree, _nothing) 469 | 470 | def _gini(self, class_counts): 471 | """Calculate the Gini impurity. 472 | 473 | If c(i) denotes the i-th class count and c = sum_i c(i) then 474 | score = 1 - sum_i ( c(i) / c )^2 475 | 476 | Args: 477 | class_counts: A 2-D tensor of per-class counts, usually a slice or 478 | gather from variables.node_sums. 479 | 480 | Returns: 481 | A 1-D tensor of the Gini impurities for each row in the input. 482 | """ 483 | smoothed = 1.0 + array_ops.slice(class_counts, [0, 1], [-1, -1]) 484 | sums = math_ops.reduce_sum(smoothed, 1) 485 | sum_squares = math_ops.reduce_sum(math_ops.square(smoothed), 1) 486 | 487 | return 1.0 - sum_squares / (sums * sums) 488 | 489 | def _weighted_gini(self, class_counts): 490 | """Our split score is the Gini impurity times the number of examples. 491 | 492 | If c(i) denotes the i-th class count and c = sum_i c(i) then 493 | score = c * (1 - sum_i ( c(i) / c )^2 ) 494 | = c - sum_i c(i)^2 / c 495 | Args: 496 | class_counts: A 2-D tensor of per-class counts, usually a slice or 497 | gather from variables.node_sums. 498 | 499 | Returns: 500 | A 1-D tensor of the Gini impurities for each row in the input. 501 | """ 502 | smoothed = 1.0 + array_ops.slice(class_counts, [0, 1], [-1, -1]) 503 | sums = math_ops.reduce_sum(smoothed, 1) 504 | sum_squares = math_ops.reduce_sum(math_ops.square(smoothed), 1) 505 | 506 | return sums - sum_squares / sums 507 | 508 | def _variance(self, sums, squares): 509 | """Calculate the variance for each row of the input tensors. 510 | 511 | Variance is V = E[x^2] - (E[x])^2. 512 | 513 | Args: 514 | sums: A tensor containing output sums, usually a slice from 515 | variables.node_sums. Should contain the number of examples seen 516 | in index 0 so we can calculate expected value. 517 | squares: Same as sums, but sums of squares. 518 | 519 | Returns: 520 | A 1-D tensor of the variances for each row in the input. 521 | """ 522 | total_count = array_ops.slice(sums, [0, 0], [-1, 1]) 523 | e_x = sums / total_count 524 | e_x2 = squares / total_count 525 | 526 | return math_ops.reduce_sum(e_x2 - math_ops.square(e_x), 1) 527 | 528 | def training_graph(self, input_data, input_labels, random_seed, 529 | data_spec, epoch=None): 530 | 531 | """Constructs a TF graph for training a random tree. 532 | 533 | Args: 534 | input_data: A tensor or SparseTensor or placeholder for input data. 535 | input_labels: A tensor or placeholder for labels associated with 536 | input_data. 537 | random_seed: The random number generator seed to use for this tree. 0 538 | means use the current time as the seed. 539 | data_spec: A list of tf.dtype values specifying the original types of 540 | each column. 541 | epoch: A tensor or placeholder for the epoch the training data comes from. 542 | 543 | Returns: 544 | The last op in the random tree training graph. 545 | """ 546 | epoch = [0] if epoch is None else epoch 547 | 548 | sparse_indices = [] 549 | sparse_values = [] 550 | sparse_shape = [] 551 | if isinstance(input_data, ops.SparseTensor): 552 | sparse_indices = input_data.indices 553 | sparse_values = input_data.values 554 | sparse_shape = input_data.shape 555 | input_data = [] 556 | 557 | # Count extremely random stats. 558 | (node_sums, node_squares, splits_indices, splits_sums, 559 | splits_squares, totals_indices, totals_sums, 560 | totals_squares, input_leaves) = ( 561 | self.training_ops.count_extremely_random_stats( 562 | input_data, sparse_indices, sparse_values, sparse_shape, 563 | data_spec, input_labels, self.variables.tree, 564 | self.variables.tree_thresholds, 565 | self.variables.node_to_accumulator_map, 566 | self.variables.candidate_split_features, 567 | self.variables.candidate_split_thresholds, 568 | self.variables.start_epoch, epoch, 569 | num_classes=self.params.num_output_columns, 570 | regression=self.params.regression)) 571 | node_update_ops = [] 572 | node_update_ops.append( 573 | state_ops.assign_add(self.variables.node_sums, node_sums)) 574 | 575 | splits_update_ops = [] 576 | splits_update_ops.append(self.training_ops.scatter_add_ndim( 577 | self.variables.candidate_split_sums, 578 | splits_indices, splits_sums)) 579 | splits_update_ops.append(self.training_ops.scatter_add_ndim( 580 | self.variables.accumulator_sums, totals_indices, 581 | totals_sums)) 582 | 583 | if self.params.regression: 584 | node_update_ops.append(state_ops.assign_add(self.variables.node_squares, 585 | node_squares)) 586 | splits_update_ops.append(self.training_ops.scatter_add_ndim( 587 | self.variables.candidate_split_squares, 588 | splits_indices, splits_squares)) 589 | splits_update_ops.append(self.training_ops.scatter_add_ndim( 590 | self.variables.accumulator_squares, totals_indices, 591 | totals_squares)) 592 | 593 | # Sample inputs. 594 | update_indices, feature_updates, threshold_updates = ( 595 | self.training_ops.sample_inputs( 596 | input_data, sparse_indices, sparse_values, sparse_shape, 597 | self.variables.node_to_accumulator_map, 598 | input_leaves, self.variables.candidate_split_features, 599 | self.variables.candidate_split_thresholds, 600 | split_initializations_per_input=( 601 | self.params.split_initializations_per_input), 602 | split_sampling_random_seed=random_seed)) 603 | update_features_op = state_ops.scatter_update( 604 | self.variables.candidate_split_features, update_indices, 605 | feature_updates) 606 | update_thresholds_op = state_ops.scatter_update( 607 | self.variables.candidate_split_thresholds, update_indices, 608 | threshold_updates) 609 | 610 | # Calculate finished nodes. 611 | with ops.control_dependencies(splits_update_ops): 612 | children = array_ops.squeeze(array_ops.slice( 613 | self.variables.tree, [0, 0], [-1, 1]), squeeze_dims=[1]) 614 | is_leaf = math_ops.equal(constants.LEAF_NODE, children) 615 | leaves = math_ops.to_int32(array_ops.squeeze(array_ops.where(is_leaf), 616 | squeeze_dims=[1])) 617 | finished, stale = self.training_ops.finished_nodes( 618 | leaves, self.variables.node_to_accumulator_map, 619 | self.variables.candidate_split_sums, 620 | self.variables.candidate_split_squares, 621 | self.variables.accumulator_sums, 622 | self.variables.accumulator_squares, 623 | self.variables.start_epoch, epoch, 624 | num_split_after_samples=self.params.split_after_samples, 625 | min_split_samples=self.params.min_split_samples) 626 | 627 | # Update leaf scores. 628 | non_fertile_leaves = array_ops.boolean_mask( 629 | leaves, math_ops.less(array_ops.gather( 630 | self.variables.node_to_accumulator_map, leaves), 0)) 631 | 632 | # TODO(gilberth): It should be possible to limit the number of non 633 | # fertile leaves we calculate scores for, especially since we can only take 634 | # at most array_ops.shape(finished)[0] of them. 635 | with ops.control_dependencies(node_update_ops): 636 | sums = array_ops.gather(self.variables.node_sums, non_fertile_leaves) 637 | if self.params.regression: 638 | squares = array_ops.gather(self.variables.node_squares, 639 | non_fertile_leaves) 640 | non_fertile_leaf_scores = self._variance(sums, squares) 641 | else: 642 | non_fertile_leaf_scores = self._weighted_gini(sums) 643 | 644 | # Calculate best splits. 645 | with ops.control_dependencies(splits_update_ops): 646 | split_indices = self.training_ops.best_splits( 647 | finished, self.variables.node_to_accumulator_map, 648 | self.variables.candidate_split_sums, 649 | self.variables.candidate_split_squares, 650 | self.variables.accumulator_sums, 651 | self.variables.accumulator_squares, 652 | regression=self.params.regression) 653 | 654 | # Grow tree. 655 | with ops.control_dependencies([update_features_op, update_thresholds_op]): 656 | (tree_update_indices, tree_children_updates, tree_threshold_updates, 657 | new_eot) = (self.training_ops.grow_tree( 658 | self.variables.end_of_tree, self.variables.node_to_accumulator_map, 659 | finished, split_indices, self.variables.candidate_split_features, 660 | self.variables.candidate_split_thresholds)) 661 | tree_update_op = state_ops.scatter_update( 662 | self.variables.tree, tree_update_indices, tree_children_updates) 663 | thresholds_update_op = state_ops.scatter_update( 664 | self.variables.tree_thresholds, tree_update_indices, 665 | tree_threshold_updates) 666 | # TODO(thomaswc): Only update the epoch on the new leaves. 667 | new_epoch_updates = epoch * array_ops.ones_like(tree_threshold_updates, 668 | dtype=dtypes.int32) 669 | epoch_update_op = state_ops.scatter_update( 670 | self.variables.start_epoch, tree_update_indices, 671 | new_epoch_updates) 672 | 673 | # Update fertile slots. 674 | with ops.control_dependencies([tree_update_op]): 675 | (node_map_updates, accumulators_cleared, accumulators_allocated) = ( 676 | self.training_ops.update_fertile_slots( 677 | finished, 678 | non_fertile_leaves, 679 | non_fertile_leaf_scores, 680 | self.variables.end_of_tree, 681 | self.variables.accumulator_sums, 682 | self.variables.node_to_accumulator_map, 683 | stale, 684 | regression=self.params.regression)) 685 | 686 | # Ensure end_of_tree doesn't get updated until UpdateFertileSlots has 687 | # used it to calculate new leaves. 688 | gated_new_eot, = control_flow_ops.tuple([new_eot], 689 | control_inputs=[node_map_updates]) 690 | eot_update_op = state_ops.assign(self.variables.end_of_tree, gated_new_eot) 691 | 692 | updates = [] 693 | updates.append(eot_update_op) 694 | updates.append(tree_update_op) 695 | updates.append(thresholds_update_op) 696 | updates.append(epoch_update_op) 697 | 698 | updates.append(state_ops.scatter_update( 699 | self.variables.node_to_accumulator_map, 700 | array_ops.squeeze(array_ops.slice(node_map_updates, [0, 0], [1, -1]), 701 | squeeze_dims=[0]), 702 | array_ops.squeeze(array_ops.slice(node_map_updates, [1, 0], [1, -1]), 703 | squeeze_dims=[0]))) 704 | 705 | cleared_and_allocated_accumulators = array_ops.concat( 706 | 0, [accumulators_cleared, accumulators_allocated]) 707 | # Calculate values to put into scatter update for candidate counts. 708 | # Candidate split counts are always reset back to 0 for both cleared 709 | # and allocated accumulators. This means some accumulators might be doubly 710 | # reset to 0 if the were released and not allocated, then later allocated. 711 | split_values = array_ops.tile( 712 | array_ops.expand_dims(array_ops.expand_dims( 713 | array_ops.zeros_like(cleared_and_allocated_accumulators, 714 | dtype=dtypes.float32), 1), 2), 715 | [1, self.params.num_splits_to_consider, self.params.num_output_columns]) 716 | updates.append(state_ops.scatter_update( 717 | self.variables.candidate_split_sums, 718 | cleared_and_allocated_accumulators, split_values)) 719 | if self.params.regression: 720 | updates.append(state_ops.scatter_update( 721 | self.variables.candidate_split_squares, 722 | cleared_and_allocated_accumulators, split_values)) 723 | 724 | # Calculate values to put into scatter update for total counts. 725 | total_cleared = array_ops.tile( 726 | array_ops.expand_dims( 727 | math_ops.neg(array_ops.ones_like(accumulators_cleared, 728 | dtype=dtypes.float32)), 1), 729 | [1, self.params.num_output_columns]) 730 | total_reset = array_ops.tile( 731 | array_ops.expand_dims( 732 | array_ops.zeros_like(accumulators_allocated, 733 | dtype=dtypes.float32), 1), 734 | [1, self.params.num_output_columns]) 735 | accumulator_updates = array_ops.concat(0, [total_cleared, total_reset]) 736 | updates.append(state_ops.scatter_update( 737 | self.variables.accumulator_sums, 738 | cleared_and_allocated_accumulators, accumulator_updates)) 739 | if self.params.regression: 740 | updates.append(state_ops.scatter_update( 741 | self.variables.accumulator_squares, 742 | cleared_and_allocated_accumulators, accumulator_updates)) 743 | 744 | # Calculate values to put into scatter update for candidate splits. 745 | split_features_updates = array_ops.tile( 746 | array_ops.expand_dims( 747 | math_ops.neg(array_ops.ones_like( 748 | cleared_and_allocated_accumulators)), 1), 749 | [1, self.params.num_splits_to_consider]) 750 | updates.append(state_ops.scatter_update( 751 | self.variables.candidate_split_features, 752 | cleared_and_allocated_accumulators, split_features_updates)) 753 | 754 | updates += self.finish_iteration() 755 | 756 | return control_flow_ops.group(*updates) 757 | 758 | def finish_iteration(self): 759 | """Perform any operations that should be done at the end of an iteration. 760 | 761 | This is mostly useful for subclasses that need to reset variables after 762 | an iteration, such as ones that are used to finish nodes. 763 | 764 | Returns: 765 | A list of operations. 766 | """ 767 | return [] 768 | 769 | def inference_graph(self, input_data, data_spec): 770 | """Constructs a TF graph for evaluating a random tree. 771 | 772 | Args: 773 | input_data: A tensor or SparseTensor or placeholder for input data. 774 | data_spec: A list of tf.dtype values specifying the original types of 775 | each column. 776 | 777 | Returns: 778 | The last op in the random tree inference graph. 779 | """ 780 | sparse_indices = [] 781 | sparse_values = [] 782 | sparse_shape = [] 783 | if isinstance(input_data, ops.SparseTensor): 784 | sparse_indices = input_data.indices 785 | sparse_values = input_data.values 786 | sparse_shape = input_data.shape 787 | input_data = [] 788 | return self.inference_ops.tree_predictions( 789 | input_data, sparse_indices, sparse_values, sparse_shape, data_spec, 790 | self.variables.tree, 791 | self.variables.tree_thresholds, 792 | self.variables.node_sums, 793 | valid_leaf_threshold=self.params.valid_leaf_threshold) 794 | 795 | def average_impurity(self): 796 | """Constructs a TF graph for evaluating the average leaf impurity of a tree. 797 | 798 | If in regression mode, this is the leaf variance. If in classification mode, 799 | this is the gini impurity. 800 | 801 | Returns: 802 | The last op in the graph. 803 | """ 804 | children = array_ops.squeeze(array_ops.slice( 805 | self.variables.tree, [0, 0], [-1, 1]), squeeze_dims=[1]) 806 | is_leaf = math_ops.equal(constants.LEAF_NODE, children) 807 | leaves = math_ops.to_int32(array_ops.squeeze(array_ops.where(is_leaf), 808 | squeeze_dims=[1])) 809 | counts = array_ops.gather(self.variables.node_sums, leaves) 810 | gini = self._weighted_gini(counts) 811 | # Guard against step 1, when there often are no leaves yet. 812 | def impurity(): 813 | return gini 814 | # Since average impurity can be used for loss, when there's no data just 815 | # return a big number so that loss always decreases. 816 | def big(): 817 | return array_ops.ones_like(gini, dtype=dtypes.float32) * 10000000. 818 | return control_flow_ops.cond(math_ops.greater( 819 | array_ops.shape(leaves)[0], 0), impurity, big) 820 | 821 | def size(self): 822 | """Constructs a TF graph for evaluating the current number of nodes. 823 | 824 | Returns: 825 | The current number of nodes in the tree. 826 | """ 827 | return self.variables.end_of_tree - 1 828 | 829 | def get_stats(self, session): 830 | num_nodes = self.variables.end_of_tree.eval(session=session) - 1 831 | num_leaves = array_ops.where( 832 | math_ops.equal(array_ops.squeeze(array_ops.slice( 833 | self.variables.tree, [0, 0], [-1, 1])), constants.LEAF_NODE) 834 | ).eval(session=session).shape[0] 835 | return TreeStats(num_nodes, num_leaves) 836 | -------------------------------------------------------------------------------- /rf/tensor_forest_test.py: -------------------------------------------------------------------------------- 1 | # Copyright 2016 The TensorFlow Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | """Tests for tf.contrib.tensor_forest.ops.tensor_forest.""" 16 | from __future__ import absolute_import 17 | from __future__ import division 18 | from __future__ import print_function 19 | 20 | import tensorflow as tf 21 | 22 | from tensorflow.contrib.tensor_forest.python import tensor_forest 23 | 24 | from tensorflow.python.framework import test_util 25 | from tensorflow.python.platform import googletest 26 | 27 | 28 | class TensorForestTest(test_util.TensorFlowTestCase): 29 | 30 | def testForestHParams(self): 31 | hparams = tensor_forest.ForestHParams( 32 | num_classes=2, num_trees=100, max_nodes=1000, 33 | split_after_samples=25, num_features=60).fill() 34 | self.assertEquals(2, hparams.num_classes) 35 | self.assertEquals(3, hparams.num_output_columns) 36 | # sqrt(num_features) < 10, so num_splits_to_consider should be 10. 37 | self.assertEquals(10, hparams.num_splits_to_consider) 38 | # Don't have more fertile nodes than max # leaves, which is 500. 39 | self.assertEquals(500, hparams.max_fertile_nodes) 40 | # Default value of valid_leaf_threshold 41 | self.assertEquals(1, hparams.valid_leaf_threshold) 42 | # split_after_samples is larger than 10 43 | self.assertEquals(1, hparams.split_initializations_per_input) 44 | self.assertEquals(0, hparams.base_random_seed) 45 | 46 | def testForestHParamsBigTree(self): 47 | hparams = tensor_forest.ForestHParams( 48 | num_classes=2, num_trees=100, max_nodes=1000000, 49 | split_after_samples=25, 50 | num_features=1000).fill() 51 | # sqrt(1000) = 31.63... 52 | self.assertEquals(32, hparams.num_splits_to_consider) 53 | # 1000000 / 32 = 31250 54 | self.assertEquals(31250, hparams.max_fertile_nodes) 55 | # floor(31.63 / 25) = 1 56 | self.assertEquals(1, hparams.split_initializations_per_input) 57 | 58 | def testTrainingConstructionClassification(self): 59 | input_data = [[-1., 0.], [-1., 2.], # node 1 60 | [1., 0.], [1., -2.]] # node 2 61 | input_labels = [0, 1, 2, 3] 62 | 63 | params = tensor_forest.ForestHParams( 64 | num_classes=4, num_features=2, num_trees=10, max_nodes=1000, 65 | split_after_samples=25).fill() 66 | 67 | graph_builder = tensor_forest.RandomForestGraphs(params) 68 | graph = graph_builder.training_graph(input_data, input_labels) 69 | self.assertTrue(isinstance(graph, tf.Operation)) 70 | 71 | def testTrainingConstructionRegression(self): 72 | input_data = [[-1., 0.], [-1., 2.], # node 1 73 | [1., 0.], [1., -2.]] # node 2 74 | input_labels = [0, 1, 2, 3] 75 | 76 | params = tensor_forest.ForestHParams( 77 | num_classes=4, num_features=2, num_trees=10, max_nodes=1000, 78 | split_after_samples=25, regression=True).fill() 79 | 80 | graph_builder = tensor_forest.RandomForestGraphs(params) 81 | graph = graph_builder.training_graph(input_data, input_labels) 82 | self.assertTrue(isinstance(graph, tf.Operation)) 83 | 84 | def testInferenceConstruction(self): 85 | input_data = [[-1., 0.], [-1., 2.], # node 1 86 | [1., 0.], [1., -2.]] # node 2 87 | 88 | params = tensor_forest.ForestHParams( 89 | num_classes=4, num_features=2, num_trees=10, max_nodes=1000, 90 | split_after_samples=25).fill() 91 | 92 | graph_builder = tensor_forest.RandomForestGraphs(params) 93 | graph = graph_builder.inference_graph(input_data) 94 | self.assertTrue(isinstance(graph, tf.Tensor)) 95 | 96 | def testImpurityConstruction(self): 97 | params = tensor_forest.ForestHParams( 98 | num_classes=4, num_features=2, num_trees=10, max_nodes=1000, 99 | split_after_samples=25).fill() 100 | 101 | graph_builder = tensor_forest.RandomForestGraphs(params) 102 | graph = graph_builder.average_impurity() 103 | self.assertTrue(isinstance(graph, tf.Tensor)) 104 | 105 | def testTrainingConstructionClassificationSparse(self): 106 | input_data = tf.SparseTensor( 107 | indices=[[0, 0], [0, 3], 108 | [1, 0], [1, 7], 109 | [2, 1], 110 | [3, 9]], 111 | values=[-1.0, 0.0, 112 | -1., 2., 113 | 1., 114 | -2.0], 115 | shape=[4, 10]) 116 | input_labels = [0, 1, 2, 3] 117 | 118 | params = tensor_forest.ForestHParams( 119 | num_classes=4, num_features=10, num_trees=10, max_nodes=1000, 120 | split_after_samples=25).fill() 121 | 122 | graph_builder = tensor_forest.RandomForestGraphs(params) 123 | graph = graph_builder.training_graph(input_data, input_labels) 124 | self.assertTrue(isinstance(graph, tf.Operation)) 125 | 126 | def testInferenceConstructionSparse(self): 127 | input_data = tf.SparseTensor( 128 | indices=[[0, 0], [0, 3], 129 | [1, 0], [1, 7], 130 | [2, 1], 131 | [3, 9]], 132 | values=[-1.0, 0.0, 133 | -1., 2., 134 | 1., 135 | -2.0], 136 | shape=[4, 10]) 137 | 138 | params = tensor_forest.ForestHParams( 139 | num_classes=4, num_features=10, num_trees=10, max_nodes=1000, 140 | split_after_samples=25).fill() 141 | 142 | graph_builder = tensor_forest.RandomForestGraphs(params) 143 | graph = graph_builder.inference_graph(input_data) 144 | self.assertTrue(isinstance(graph, tf.Tensor)) 145 | 146 | 147 | if __name__ == '__main__': 148 | googletest.main() 149 | -------------------------------------------------------------------------------- /tf/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lbkchen/deep-learning/ee2dee949d545d9b7cc1997998ee49e5d9bb2642/tf/__init__.py -------------------------------------------------------------------------------- /tf/mnist_sda.py: -------------------------------------------------------------------------------- 1 | """ 2 | Example testing SDA model on MNIST digits. 3 | """ 4 | 5 | from sdautoencoder import SDAutoencoder 6 | from softmax import test_model 7 | from tensorflow.examples.tutorials.mnist import input_data 8 | import tensorflow as tf 9 | 10 | mnist = input_data.read_data_sets("MNIST_data/", one_hot=True) 11 | 12 | 13 | def get_mnist_batch_generator(is_train, batch_size, batch_limit=100): 14 | if is_train: 15 | for _ in range(batch_limit): 16 | yield mnist.train.next_batch(batch_size) 17 | else: 18 | for _ in range(batch_limit): 19 | yield mnist.test.next_batch(batch_size) 20 | 21 | 22 | def get_mnist_batch_xs_generator(is_train, batch_size, batch_limit=100): 23 | for x, _ in get_mnist_batch_generator(is_train, batch_size, batch_limit): 24 | yield x 25 | 26 | 27 | def main(): 28 | sess = tf.Session() 29 | sda = SDAutoencoder(dims=[784, 500], 30 | activations=["sigmoid"], 31 | sess=sess, 32 | noise=0.40, 33 | loss="cross-entropy") 34 | 35 | mnist_train_gen_f = lambda: get_mnist_batch_xs_generator(True, batch_size=100, batch_limit=12000) 36 | 37 | sda.pretrain_network_gen(mnist_train_gen_f) 38 | trained_parameters = sda.finetune_parameters_gen(get_mnist_batch_generator(True, batch_size=100, batch_limit=18000), 39 | output_dim=10) 40 | transformed_filepath = "../data/mnist_test_transformed.csv" 41 | test_ys_filepath = "../data/mnist_test_ys.csv" 42 | output_filepath = "../data/mnist_pred_ys.csv" 43 | 44 | sda.write_encoded_input_with_ys(transformed_filepath, test_ys_filepath, 45 | get_mnist_batch_generator(False, batch_size=100, batch_limit=100)) 46 | sess.close() 47 | 48 | test_model(parameters_dict=trained_parameters, 49 | input_dim=sda.output_dim, 50 | output_dim=10, 51 | x_test_filepath=transformed_filepath, 52 | y_test_filepath=test_ys_filepath, 53 | output_filepath=output_filepath) 54 | 55 | if __name__ == "__main__": 56 | main() 57 | -------------------------------------------------------------------------------- /tf/sdautoencoder.py: -------------------------------------------------------------------------------- 1 | """Stacked Denoising Autoencoder Implementation""" 2 | 3 | import tensorflow as tf 4 | import numpy as np 5 | from math import sqrt 6 | from utils import * 7 | 8 | __author__ = "Ken Chen" 9 | __copyright__ = "Copyright (C) 2016 Ken Chen, HBI Solutions, Inc." 10 | __version__ = "1.0" 11 | 12 | 13 | """ 14 | ########################### 15 | ### SETUP AND CONSTANTS ### 16 | ########################### 17 | """ 18 | 19 | 20 | ALLOWED_ACTIVATIONS = ["sigmoid", "tanh", "relu"] 21 | ALLOWED_LOSSES = ["rmse", "cross-entropy"] 22 | 23 | TENSORBOARD_LOGDIR = "../logs/tensorboard" 24 | TENSORBOARD_LOG_STEP = 100 25 | 26 | DEBUG = False 27 | 28 | 29 | """ 30 | ################### 31 | ### TENSORBOARD ### 32 | ################### 33 | """ 34 | 35 | 36 | def attach_variable_summaries(var, name, summ_list): 37 | """Attach statistical summaries to a tensor for tensorboard visualization.""" 38 | with tf.name_scope("summaries"): 39 | mean = tf.reduce_mean(var) 40 | summ_mean = tf.scalar_summary("mean/" + name, mean) 41 | with tf.name_scope('stddev'): 42 | stddev = tf.sqrt(tf.reduce_sum(tf.square(tf.sub(var, mean)))) 43 | summ_std = tf.scalar_summary('stddev/' + name, stddev) 44 | summ_max = tf.scalar_summary('max/' + name, tf.reduce_max(var)) 45 | summ_min = tf.scalar_summary('min/' + name, tf.reduce_min(var)) 46 | summ_hist = tf.histogram_summary(name, var) 47 | summ_list.extend([summ_mean, summ_std, summ_max, summ_min, summ_hist]) 48 | 49 | 50 | def attach_scalar_summary(var, name, summ_list): 51 | """Attach scalar summaries to a scalar.""" 52 | summ = tf.scalar_summary(tags=name, values=var) 53 | summ_list.append(summ) 54 | 55 | 56 | """ 57 | ############################ 58 | ### TENSORFLOW UTILITIES ### 59 | ############################ 60 | """ 61 | 62 | 63 | def weight_variable(input_dim, output_dim, name=None, stretch_factor=1, dtype=tf.float32): 64 | """Creates a weight variable with initial weights as recommended by Bengio. 65 | Reference: http://arxiv.org/pdf/1206.5533v2.pdf. If sigmoid is used as the activation 66 | function, then a stretch_factor of 4 is recommended.""" 67 | limit = sqrt(6 / (input_dim + output_dim)) 68 | initial = tf.random_uniform(shape=[input_dim, output_dim], 69 | minval=-(stretch_factor * limit), 70 | maxval=stretch_factor * limit, 71 | dtype=dtype) 72 | return tf.Variable(initial, name=name) 73 | 74 | 75 | def bias_variable(dim, initial_value=0.0, name=None, dtype=tf.float32): 76 | """Creates a bias variable with an initial constant value.""" 77 | return tf.Variable(tf.constant(value=initial_value, dtype=dtype, shape=[dim]), name=name) 78 | 79 | 80 | def corrupt(tensor, corruption_level=0.05): 81 | """Uses the masking noise algorithm to mask corruption_level proportion 82 | of the input. 83 | 84 | :param tensor: A tensor whose values are to be corrupted. 85 | :param corruption_level: An int [0, 1] specifying the probability to corrupt each value. 86 | :return: The corrupted tensor. 87 | """ 88 | total_samples = tf.reduce_prod(tf.shape(tensor)) 89 | corruption_matrix = tf.multinomial(tf.log([[corruption_level, 1 - corruption_level]]), total_samples) 90 | corruption_matrix = tf.cast(tf.reshape(corruption_matrix, shape=tf.shape(tensor)), dtype=tf.float32) 91 | return tf.mul(tensor, corruption_matrix) 92 | 93 | 94 | """ 95 | ############################ 96 | ### NEURAL NETWORK LAYER ### 97 | ############################ 98 | """ 99 | 100 | 101 | class NNLayer: 102 | """A container class to represent a hidden layer in the autoencoder network.""" 103 | 104 | def __init__(self, input_dim, output_dim, name="hidden_layer", activation=None, weights=None, biases=None): 105 | """Initializes an NNLayer with empty weights/biases (default). Weights/biases 106 | are meant to be updated during pre-training with set_wb. Also has methods to 107 | transform an input_tensor to an encoded representation via the weights/biases 108 | of the layer. 109 | 110 | :param input_dim: An int representing the dimension of input to this layer. 111 | :param output_dim: An int representing the dimension of the encoded output. 112 | :param activation: A function to transform the inputs to this layer (sigmoid, etc.). 113 | :param weights: A tensor with shape [input_dim, output_dim] 114 | :param biases: A tensor with shape [output_dim] 115 | """ 116 | self.input_dim = input_dim 117 | self.output_dim = output_dim 118 | self.name = name 119 | self.activation = activation 120 | self.weights = weights # Evaluated numpy array, static 121 | self.biases = biases # Evaluated numpy array, static 122 | self._weights = None # Weights Variable, dynamic 123 | self._biases = None # Biases Variable, dynamic 124 | 125 | @property 126 | def is_pretrained(self): 127 | return self.weights is not None and self.biases is not None 128 | 129 | def set_wb(self, weights, biases): 130 | """Used during pre-training for convenience.""" 131 | self.weights = weights # Evaluated numpy array 132 | self.biases = biases # Evaluated numpy array 133 | 134 | print("Set weights of layer with shape", weights.shape) 135 | print("Set biases of layer with shape", biases.shape) 136 | 137 | def set_wb_variables(self, summ_list): 138 | """This function is called at the beginning of supervised fine tuning to create new 139 | variables with initial values based on their static parameter counterparts. These 140 | variables can then all be adjusted simultaneously during the fine tune optimization.""" 141 | assert self.is_pretrained, "Cannot set Variables when not pretrained." 142 | with tf.name_scope(self.name): 143 | self._weights = tf.Variable(self.weights, dtype=tf.float32, name="weights") 144 | self._biases = tf.Variable(self.biases, dtype=tf.float32, name="biases") 145 | attach_variable_summaries(self._weights, name=self._weights.name, summ_list=summ_list) 146 | attach_variable_summaries(self._biases, name=self._biases.name, summ_list=summ_list) 147 | print("Created new weights and bias variables from current values.") 148 | 149 | def update_wb(self, sess): 150 | """This function is called at the end of supervised fine tuning to update the static 151 | weight and bias values to the newest snapshot of their dynamic variable counterparts.""" 152 | assert self._weights is not None and self._biases is not None, "Weights and biases Variables not set." 153 | self.weights = sess.run(self._weights) 154 | self.biases = sess.run(self._biases) 155 | print("Updated weights and biases with corresponding evaluated variable values.") 156 | 157 | def get_weight_variable(self): 158 | return self._weights 159 | 160 | def get_bias_variable(self): 161 | return self._biases 162 | 163 | def encode(self, input_tensor, use_variables=False): 164 | """Performs this layer's encoding on the input_tensor. use_variables is set to true 165 | during the fine-tuning stage, when all parameters of each layer need to be adjusted.""" 166 | assert self.is_pretrained, "Cannot encode when not pre-trained." 167 | if use_variables: 168 | return self.activate(tf.matmul(input_tensor, self._weights) + self._biases) 169 | else: 170 | return self.activate(tf.matmul(input_tensor, self.weights) + self.biases) 171 | 172 | def activate(self, input_tensor, name=None): 173 | """Applies the activation function for this layer based on self.activation.""" 174 | if self.activation == "sigmoid": 175 | return tf.nn.sigmoid(input_tensor, name=name) 176 | if self.activation == "tanh": 177 | return tf.nn.tanh(input_tensor, name=name) 178 | if self.activation == "relu": 179 | return tf.nn.relu(input_tensor, name=name) 180 | else: 181 | print("Activation function not valid. Using the identity.") 182 | return input_tensor 183 | 184 | 185 | """ 186 | ##################################### 187 | ### STACKED DENOISING AUTOENCODER ### 188 | ##################################### 189 | """ 190 | 191 | 192 | class SDAutoencoder: 193 | """A stacked denoising autoencoder.""" 194 | 195 | def check_assertions(self): 196 | assert 0 <= self.noise <= 1, "Invalid noise value given: %s" % self.noise 197 | assert self.loss in ALLOWED_LOSSES 198 | 199 | def __init__(self, dims, activations, sess, noise=0.0, loss="cross-entropy", 200 | pretrain_lr=0.001, finetune_lr=0.001, batch_size=100, print_step=100): 201 | """Initializes a Stacked Denoising Autoencoder based on the dimension of each 202 | layer in the neural network and the activation function of each layer. SDA only 203 | undergoes parameter setup at initialization. Main functions to utilize the SDA are: 204 | 205 | pretrain_network: (unsupervised) Greedily pre-trains every layer of the neural network, 206 | beginning with feeding the raw data input to the first layer, and getting an encoded 207 | version from the output of the first layer. Adjusts parameters of the network (weights and 208 | biases of each layer) during training, via a stochastic Adam optimization method. 209 | 210 | finetune_parameters: (supervised) Adds a layer of fine-tuning to the network, adjusting 211 | the weights and biases of all layers simultaneously via a softmax classifier with test 212 | y-values. Also prints batch accuracy during each print step. 213 | 214 | write_encoded_input: Reads the x-values from a test data source and transforms them 215 | accordingly through the network (which has all parameters optimized from pre-training). 216 | Writes the newly represented features to a specified file. 217 | 218 | (Example usage) 219 | sda = SDAutoencoder([784, 400, 200, 10], ["relu", "relu", "relu"], noise=0.05) 220 | sda.pretrain_network(X_TRAIN_PATH) 221 | sda.finetune_parameters(X_TRAIN_PATH, Y_TRAIN_PATH) 222 | sda.write_encoded_input(your_filename, X_TEST_PATH) 223 | 224 | :param dims: A list of ints containing the dimensions of the x-values at each step of 225 | the network. The first entry is the overall input_dim, and the last entry is the 226 | overall output_dim from the network. 227 | :param activations: A list of activation functions for each layer in the network. 228 | :param sess: A tf.Session to be used by the autoencoder 229 | :param noise: A double from 0 to 1 representing the amount of masking on the input (noise). 230 | :param loss: A string representing the loss function used. 231 | :param pretrain_lr: A double representing the learning rate of the pretrain op. 232 | :param finetune_lr: A double representing the learning rate of the finetune op. 233 | :param batch_size: The number of cases fed to the network in each batch from file. 234 | :param print_step: The number of batches processed before each print progress step. 235 | """ 236 | self.input_dim = dims[0] # The dimension of the raw input 237 | self.output_dim = dims[-1] # The output dimension of the last layer: fully encoded input 238 | self.hidden_layers = self.create_new_layers(dims, activations) 239 | self.sess = sess 240 | 241 | self.noise = noise 242 | self.loss = loss 243 | self.pretrain_lr = pretrain_lr 244 | self.finetune_lr = finetune_lr 245 | self.batch_size = batch_size 246 | self.print_step = print_step 247 | 248 | self.check_assertions() 249 | print("Initialized SDA network with dims %s, activations %s, noise %s, " 250 | "loss %s, pretraining learning rate %s, finetuning learning rate %s, and batch size %s." 251 | % (dims, activations, self.noise, self.loss, self.pretrain_lr, self.finetune_lr, self.batch_size)) 252 | 253 | @property 254 | def is_pretrained(self): 255 | """Returns whether the whole autoencoder network (all layers) is pre-trained.""" 256 | return all([layer.is_pretrained for layer in self.hidden_layers]) 257 | 258 | ########################## 259 | # VARIABLE CONFIGURATION # 260 | ########################## 261 | 262 | def get_all_variables(self, additional_vars=None): 263 | """Returns all trainable variables of the neural network.""" 264 | all_vars = [] 265 | for layer in self.hidden_layers: 266 | all_vars.extend([layer.get_weight_variable(), layer.get_bias_variable()]) 267 | if additional_vars: 268 | all_vars.extend(additional_vars) 269 | return all_vars 270 | 271 | def setup_all_variables(self, summ_list): 272 | """See NNLayer.set_wb_variables. Performs layer method on all hidden layers.""" 273 | for layer in self.hidden_layers: 274 | layer.set_wb_variables(summ_list) 275 | 276 | def finalize_all_variables(self): 277 | """See NNLayer.finalize_all_variables. Performs layer method on all hidden layers.""" 278 | for layer in self.hidden_layers: 279 | layer.update_wb(self.sess) 280 | 281 | def save_variables(self, filepath): 282 | """Saves all Tensorflow variables in the desired filepath.""" 283 | saver = tf.train.Saver() 284 | save_path = saver.save(self.sess, filepath) 285 | print("Model saved in file: %s" % save_path) 286 | 287 | ################ 288 | # WRITING DATA # 289 | ################ 290 | 291 | @staticmethod 292 | def write_data(data, filename): 293 | """Writes data in data_tensor and appends to the end of filename in csv format. 294 | 295 | :param data: A 2-dimensional numpy array. 296 | :param filename: A string representing the save filepath. 297 | :return: None 298 | """ 299 | with open(filename, "ab") as file: 300 | np.savetxt(file, data, delimiter=",") 301 | 302 | @stopwatch 303 | def write_encoded_input(self, filepath, x_test_path): 304 | """Reads from x_test_path and encodes the input through the entire model. Then 305 | writes the encoded result to filepath. Call this function after pretraining and 306 | fine-tuning to get the newly learned features. 307 | """ 308 | x_test = get_batch_generator(x_test_path, self.batch_size) 309 | self.write_encoded_input_gen(filepath, x_test_gen=x_test) 310 | 311 | @stopwatch 312 | def write_encoded_input_gen(self, filepath, x_test_gen): 313 | """Get encoded feature representation and writes to filepath. 314 | 315 | :param filepath: A string specifying the file path/name to write the encoded input to. 316 | :param x_test_gen: A generator that iterates through the x-test values. 317 | :return: None 318 | """ 319 | sess = self.sess 320 | x_input = tf.placeholder(tf.float32, shape=[None, self.input_dim]) 321 | x_encoded = self.get_encoded_input(x_input, depth=-1, use_variables=False) 322 | 323 | print("Beginning to write to file.") 324 | for x_batch in x_test_gen: 325 | self.write_data(sess.run(x_encoded, feed_dict={x_input: x_batch}), filepath) 326 | print("Written encoded input to file %s" % filepath) 327 | 328 | def write_encoded_input_with_ys(self, filepath_x, filepath_y, xy_test_gen): 329 | """For use in testing MNIST. Writes the encoded x values along with their corresponding 330 | y values to file. 331 | 332 | :param filepath_x: A string, the filepath to store the encoded x values. 333 | :param filepath_y: A string, the filepath to store the y values. 334 | :param xy_test_gen: A generator that yields tuples of x and y test values. 335 | :return: None 336 | """ 337 | sess = self.sess 338 | x_input = tf.placeholder(tf.float32, shape=[None, self.input_dim]) 339 | x_encoded = self.get_encoded_input(x_input, depth=-1, use_variables=False) 340 | 341 | print("Beginning to write to file encoded x with ys.") 342 | for x_batch, y_batch in xy_test_gen: 343 | self.write_data(sess.run(x_encoded, feed_dict={x_input: x_batch}), filepath_x) 344 | self.write_data(y_batch, filepath_y) 345 | print("Written encoded input to file %s and test ys to %s" % (filepath_x, filepath_y)) 346 | 347 | ################### 348 | # GENERAL UTILITY # 349 | ################### 350 | 351 | def get_encoded_input(self, input_tensor, depth, use_variables=False): 352 | """Performs an encoding on input_tensor through the neural network depending on depth. 353 | If depth is 0, then input_tensor is simply returned. If depth is 3, then input_tensor 354 | will be encoded through the first three layers of the network. If depth is -1, then 355 | input_tensor will be encoded through the entire network. 356 | 357 | :param input_tensor: A tensor to encode. 358 | :param depth: The number of layers through which input_tensor will be encoded. If -1, 359 | then the full network encoding will be used. 360 | :param use_variables: A boolean representing whether to use tf.Variable representations 361 | of layer parameters. This is set to True only during the fine-tuning stage. 362 | :return: The encoded input_tensor. 363 | """ 364 | depth = len(self.hidden_layers) if depth == -1 else depth 365 | for i in range(depth): 366 | input_tensor = self.hidden_layers[i].encode(input_tensor, use_variables=use_variables) 367 | return input_tensor 368 | 369 | def get_loss(self, labels, values, epsilon=1e-10): 370 | """Returns the loss value between labels and values based on the method, either rmse 371 | or cross-entropy. 372 | 373 | Note: cross-entropy should only be used when the values are between 0 and 1.""" 374 | if self.loss == "rmse": 375 | return tf.sqrt(tf.reduce_mean(tf.square(tf.sub(labels, values)))) 376 | elif self.loss == "cross-entropy": 377 | return tf.reduce_mean(-tf.reduce_sum( 378 | labels * tf.log(values + epsilon) + (1 - labels) * tf.log(1 - values + epsilon), reduction_indices=[1] 379 | )) 380 | 381 | @staticmethod 382 | def create_new_layers(dims, activations): 383 | """Creates and sets up template layers (un-pretrained) for the network based on dimensions 384 | and activation functions. 385 | 386 | :param dims: Ex. [784, 200, 10] 387 | :param activations: Ex. ['relu', 'relu'] 388 | :return: [NNLayer(input_dim=784, output_dim=200), NNLayer(input_dim=200, output_dim=10)] 389 | """ 390 | assert len(dims) >= 2 and len(activations) >= 1, "Invalid number of layers given by `dims` and `activations`." 391 | assert set(activations + ALLOWED_ACTIVATIONS) == set(ALLOWED_ACTIVATIONS), "Incorrect activation(s) given." 392 | assert len(dims) == len(activations) + 1, "Incorrect number of layers/activations." 393 | return [NNLayer(dims[i], dims[i + 1], "hidden_layer_" + str(i), activations[i]) 394 | for i in range(len(activations))] 395 | 396 | ############### 397 | # PRETRAINING # 398 | ############### 399 | 400 | @stopwatch 401 | def pretrain_network(self, x_train_path, epochs=1, batch_method="random"): 402 | """Pretrains the network using x-train values from a csv file. 403 | 404 | :param x_train_path: A string: the filepath to the train data. 405 | :param epochs: The number of epochs to iterate through the train data. 406 | :param batch_method: A string, either "random" or "sequential", indicating the method to 407 | use for batch generation (get_random_batch_generator vs. get_batch_generator). 408 | """ 409 | print("Starting to pretrain autoencoder network.") 410 | for i in range(len(self.hidden_layers)): 411 | if batch_method == "random": 412 | x_train = get_random_batch_generator(self.batch_size, x_train_path, repeat=epochs - 1) 413 | else: 414 | x_train = get_batch_generator(x_train_path, self.batch_size, repeat=epochs-1) 415 | self.pretrain_layer(i, x_train) 416 | print("Finished pretraining of autoencoder network.") 417 | 418 | @stopwatch 419 | def pretrain_network_gen(self, x_train_gen_f): 420 | """Pretrains the network with a generator supplying input. Use for testing MNIST. 421 | 422 | :param x_train_gen_f: A function that when called with no arguments returns a generator 423 | that iterates through the entire train dataset. 424 | :return: None 425 | """ 426 | print("Starting to pretrain autoencoder network.") 427 | for i in range(len(self.hidden_layers)): 428 | x_train_gen = x_train_gen_f() 429 | self.pretrain_layer(i, x_train_gen) 430 | print("Finished pretraining of autoencoder network.") 431 | 432 | def pretrain_layer(self, depth, batch_generator): 433 | """Pretrains the layer at depth `depth` feeding data from batch_generator. Do not call 434 | this method externally unless specific pretraining of a particular layer is required. 435 | Use `pretrain_network` instead.""" 436 | sess = self.sess 437 | 438 | print("Starting to pretrain layer %d." % depth) 439 | hidden_layer = self.hidden_layers[depth] 440 | summary_list = [] 441 | 442 | with tf.name_scope(hidden_layer.name): 443 | input_dim, output_dim = hidden_layer.input_dim, hidden_layer.output_dim 444 | 445 | with tf.name_scope("x_values"): 446 | x_original = tf.placeholder(tf.float32, shape=[None, self.input_dim]) 447 | x_latent = self.get_encoded_input(x_original, depth, use_variables=False) 448 | x_corrupt = corrupt(x_latent, corruption_level=self.noise) 449 | 450 | with tf.name_scope("encoding_vars"): 451 | stretch_factor = 4 if self.loss == "sigmoid" else 1 452 | encode = { 453 | "weights": weight_variable(input_dim, output_dim, name="weights", stretch_factor=stretch_factor), 454 | "biases": bias_variable(output_dim, initial_value=0, name="biases") 455 | } 456 | attach_variable_summaries(encode["weights"], encode["weights"].name, summ_list=summary_list) 457 | attach_variable_summaries(encode["biases"], encode["biases"].name, summ_list=summary_list) 458 | 459 | with tf.name_scope("decoding_vars"): 460 | decode = { 461 | "weights": tf.transpose(encode["weights"], name="transposed_weights"), # Tied weights 462 | "biases": bias_variable(input_dim, initial_value=0, name="decode_biases") 463 | } 464 | attach_variable_summaries(decode["weights"], decode["weights"].name, summ_list=summary_list) 465 | attach_variable_summaries(decode["biases"], decode["biases"].name, summ_list=summary_list) 466 | 467 | with tf.name_scope("encoded_and_decoded"): 468 | encoded = hidden_layer.activate(tf.matmul(x_corrupt, encode["weights"]) + encode["biases"]) 469 | decoded = hidden_layer.activate(tf.matmul(encoded, decode["weights"]) + decode["biases"]) 470 | attach_variable_summaries(encoded, "encoded", summ_list=summary_list) 471 | attach_variable_summaries(decoded, "decoded", summ_list=summary_list) 472 | 473 | # Reconstruction loss 474 | with tf.name_scope("reconstruction_loss"): 475 | loss = self.get_loss(x_latent, decoded) 476 | attach_scalar_summary(loss, "%s_loss" % self.loss, summ_list=summary_list) 477 | 478 | trainable_vars = [encode["weights"], encode["biases"], decode["biases"]] 479 | # Only optimize variables for this layer ("greedy") 480 | with tf.name_scope("train_step"): 481 | train_op = tf.train.AdamOptimizer(learning_rate=self.pretrain_lr).minimize( 482 | loss, var_list=trainable_vars) 483 | sess.run(tf.initialize_all_variables()) 484 | 485 | # Merge summaries and get a summary writer 486 | merged = tf.merge_summary(summary_list) 487 | pretrain_writer = tf.train.SummaryWriter(TENSORBOARD_LOGDIR + "/train/" + hidden_layer.name, sess.graph) 488 | 489 | step = 0 490 | for batch_x_original in batch_generator: 491 | sess.run(train_op, feed_dict={x_original: batch_x_original}) 492 | 493 | if step % self.print_step == 0: 494 | loss_value = sess.run(loss, feed_dict={x_original: batch_x_original}) 495 | print("Step %s, batch %s loss = %s" % (step, self.loss, loss_value)) 496 | 497 | if step % TENSORBOARD_LOG_STEP == 0: 498 | summary = sess.run(merged, feed_dict={x_original: batch_x_original}) 499 | pretrain_writer.add_summary(summary, global_step=step) 500 | 501 | # Break for debugging purposes 502 | if DEBUG and step > 5: 503 | break 504 | 505 | step += 1 506 | 507 | # Set the weights and biases of pretrained hidden layer 508 | hidden_layer.set_wb(weights=sess.run(encode["weights"]), biases=sess.run(encode["biases"])) 509 | print("Finished pretraining of layer %d. Updated layer weights and biases." % depth) 510 | 511 | ############## 512 | # FINETUNING # 513 | ############## 514 | 515 | @stopwatch 516 | def finetune_parameters(self, x_train_path, y_train_path, output_dim, epochs=1, batch_method="random"): 517 | """Performs fine tuning on all parameters of the neural network plus two additional softmax 518 | variables. Call this method after `pretrain_network` is complete. Y values should be represented 519 | in one-hot format. 520 | 521 | :param x_train_path: A string, the path to the x train values. 522 | :param y_train_path: A string, the path to the y train values. 523 | :param output_dim: An int, the number of classes in the target classification problem. Ex: 10 for MNIST. 524 | :param epochs: An int, the number of iterations to tune through the entire dataset. 525 | :param batch_method: A string, either 'random' or 'sequential', to indicate how batches are retrieved. 526 | :return: The tuned softmax parameters (weights and biases) of the classification layer. 527 | """ 528 | if batch_method == "random": 529 | xy_train = get_random_batch_generator(self.batch_size, x_train_path, y_train_path, repeat=epochs - 1) 530 | else: 531 | x_train = get_batch_generator(x_train_path, self.batch_size, repeat=epochs - 1) 532 | y_train = get_batch_generator(y_train_path, self.batch_size, repeat=epochs - 1) 533 | xy_train = merge_generators(x_train, y_train) 534 | return self.finetune_parameters_gen(xy_train_gen=xy_train, output_dim=output_dim) 535 | 536 | @stopwatch 537 | def finetune_parameters_gen(self, xy_train_gen, output_dim): 538 | """An implementation of finetuning to support data feeding from generators.""" 539 | sess = self.sess 540 | summary_list = [] 541 | 542 | print("Starting to fine tune parameters of network.") 543 | with tf.name_scope("finetuning"): 544 | self.setup_all_variables(summary_list) 545 | 546 | with tf.name_scope("inputs"): 547 | x = tf.placeholder(tf.float32, shape=[None, self.input_dim], name="raw_input") 548 | with tf.name_scope("fully_encoded"): 549 | x_encoded = self.get_encoded_input(x, depth=-1, use_variables=True) # Full depth encoding 550 | 551 | """Note on W below: The difference between self.output_dim and output_dim is that the former 552 | is the output dimension of the autoencoder stack, which is the dimension of the new feature 553 | space. The latter is the dimension of the y value space for classification. Ex: If the output 554 | should be binary, then the output_dim = 2.""" 555 | with tf.name_scope("softmax_variables"): 556 | W = weight_variable(self.output_dim, output_dim, name="weights") 557 | b = bias_variable(output_dim, initial_value=0, name="biases") 558 | attach_variable_summaries(W, W.name, summ_list=summary_list) 559 | attach_variable_summaries(b, b.name, summ_list=summary_list) 560 | 561 | with tf.name_scope("outputs"): 562 | y_logits = tf.matmul(x_encoded, W) + b 563 | with tf.name_scope("predicted"): 564 | y_pred = tf.nn.softmax(y_logits, name="y_pred") 565 | attach_variable_summaries(y_pred, y_pred.name, summ_list=summary_list) 566 | with tf.name_scope("actual"): 567 | y_actual = tf.placeholder(tf.float32, shape=[None, output_dim], name="y_actual") 568 | attach_variable_summaries(y_actual, y_actual.name, summ_list=summary_list) 569 | 570 | with tf.name_scope("cross_entropy"): 571 | cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_logits, y_actual)) 572 | attach_scalar_summary(cross_entropy, "cross_entropy", summ_list=summary_list) 573 | 574 | trainable_vars = self.get_all_variables(additional_vars=[W, b]) 575 | with tf.name_scope("train_step"): 576 | train_step = tf.train.AdamOptimizer(learning_rate=self.finetune_lr).minimize( 577 | cross_entropy, var_list=trainable_vars) 578 | 579 | with tf.name_scope("evaluation"): 580 | correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.argmax(y_actual, 1)) 581 | accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 582 | attach_scalar_summary(accuracy, "finetune_accuracy", summ_list=summary_list) 583 | 584 | sess.run(tf.initialize_all_variables()) 585 | 586 | # Merge summaries and get a summary writer 587 | merged = tf.merge_summary(summary_list) 588 | train_writer = tf.train.SummaryWriter(TENSORBOARD_LOGDIR + "/train/finetune", sess.graph) 589 | 590 | step = 0 591 | for batch_xs, batch_ys in xy_train_gen: 592 | if step % self.print_step == 0: 593 | print("Step %s, batch accuracy: " % step, 594 | sess.run(accuracy, feed_dict={x: batch_xs, y_actual: batch_ys})) 595 | 596 | # For debugging predicted y values 597 | if step % (self.print_step * 10) == 0: 598 | print("Predicted y-value:", sess.run(y_pred, feed_dict={x: batch_xs})[0]) 599 | print("Actual y-value:", batch_ys[0]) 600 | 601 | if step % TENSORBOARD_LOG_STEP == 0: 602 | summary = sess.run(merged, feed_dict={x: batch_xs, y_actual: batch_ys}) 603 | train_writer.add_summary(summary, global_step=step) 604 | 605 | # For debugging, break early. 606 | if DEBUG and step > 5: 607 | break 608 | 609 | sess.run(train_step, feed_dict={x: batch_xs, y_actual: batch_ys}) 610 | step += 1 611 | 612 | self.finalize_all_variables() 613 | print("Completed fine-tuning of parameters.") 614 | tuned_params = {"weights": sess.run(W), "biases": sess.run(b)} 615 | 616 | return tuned_params 617 | -------------------------------------------------------------------------------- /tf/softmax.py: -------------------------------------------------------------------------------- 1 | from sdautoencoder import SDAutoencoder, get_batch_generator, merge_generators, stopwatch, DEBUG 2 | import tensorflow as tf 3 | import numpy as np 4 | 5 | 6 | # X_TRAIN_PATH = "../data/x_train_transformed_SAM_2.csv" 7 | # Y_TRAIN_PATH = "../data/splits/OPYTrainSAM.csv" 8 | # X_TEST_PATH = "../data/x_test_transformed_SAM_2.csv" 9 | # Y_TEST_PATH = "../data/splits/OPYTestSAM.csv" 10 | 11 | # NEED TO RENAME FOR EVERY TRIAL 12 | OUTPUT_PATH = "../data/ami/smote4k/outputs/pred_ys_8_10.csv" 13 | TRANSFORMED_PATH = "../data/ami/smote4k/outputs/x_test_transformed_8_10.csv" 14 | 15 | X_TRAIN_PATH = "../data/ami/smote4k/AMI_SAM_train_x.csv" 16 | Y_TRAIN_PATH = "../data/ami/smote4k/AMI_SAM_train_y.csv" 17 | X_TEST_PATH = "../data/ami/smote4k/AMI_SAM_test_x.csv" 18 | Y_TEST_PATH = "../data/ami/smote4k/AMI_SAM_test_y.csv" 19 | 20 | VARIABLE_SAVE_PATH = "../data/ami/smote4k/vars/last_vars.ckpt" 21 | 22 | 23 | def average(lst): 24 | return sum(lst) / len(lst) 25 | 26 | 27 | def append_with_limit(lst, val, limit=10): 28 | """Non-destructive function that returns a copy of the original list with the appended value and limit.""" 29 | lst_copy = lst[:] 30 | lst_copy.append(val) 31 | return lst_copy[-limit:] 32 | 33 | 34 | def write_data(data, filename): # FIXME: Copied from sda, should refactor to static 35 | """Writes data in data_tensor and appends to the end of filename in csv format. 36 | 37 | :param data: A 2-dimensional numpy array. 38 | :param filename: A string representing the save filepath. 39 | :return: None 40 | """ 41 | with open(filename, "ab") as file: 42 | np.savetxt(file, data, delimiter=",") 43 | 44 | 45 | # @stopwatch 46 | # def train_softmax(input_dim, output_dim, x_train_filepath, y_train_filepath, lr=0.001, batch_size=100, 47 | # print_step=50, epochs=1): 48 | # """Trains a softmax model for prediction.""" 49 | # # Model input and parameters 50 | # x = tf.placeholder(tf.float32, [None, input_dim]) 51 | # weights = tf.Variable(tf.truncated_normal(shape=[input_dim, output_dim], stddev=0.1)) 52 | # biases = tf.Variable(tf.constant(0.1, shape=[output_dim])) 53 | # 54 | # # Outputs and true y-values 55 | # y_logits = tf.matmul(x, weights) + biases 56 | # y_pred = tf.nn.softmax(y_logits) 57 | # y_actual = tf.placeholder(tf.float32, [None, output_dim]) 58 | # 59 | # # Cross entropy and training step 60 | # cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=y_logits, labels=y_actual)) 61 | # train_step = tf.train.AdamOptimizer(learning_rate=lr).minimize(cross_entropy) 62 | # 63 | # # Start session and run batches based on number of epochs 64 | # sess = tf.Session() 65 | # sess.run(tf.initialize_all_variables()) 66 | # x_train = get_batch_generator(filename=x_train_filepath, batch_size=batch_size, 67 | # repeat=epochs - 1) 68 | # y_train = get_batch_generator(filename=y_train_filepath, batch_size=batch_size, 69 | # repeat=epochs - 1) 70 | # step = 0 71 | # accuracy_history = [] 72 | # for batch_xs, batch_ys in zip(x_train, y_train): 73 | # sess.run(train_step, feed_dict={x: batch_xs, y_actual: batch_ys}) 74 | # 75 | # # Debug 76 | # # if step == 100: 77 | # # break 78 | # 79 | # # Assess training accuracy for current batch 80 | # if step % print_step == 0: 81 | # correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.argmax(y_actual, 1)) 82 | # accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 83 | # accuracy_val = sess.run(accuracy, feed_dict={x: batch_xs, y_actual: batch_ys}) 84 | # print("Step %s, current batch training accuracy: %s" % (step, accuracy_val)) 85 | # accuracy_history = append_with_limit(accuracy_history, accuracy_val) 86 | # 87 | # # Assess training accuracy for last 10 batches 88 | # if step > 0 and step % (print_step * 10) == 0: 89 | # print("Predicted y-values:\n", sess.run(y_pred, feed_dict={x: batch_xs})) 90 | # print("Overall batch training accuracy for steps %s to %s: %s" % (step - 10 * print_step, 91 | # step, 92 | # average(accuracy_history))) 93 | # 94 | # step += 1 95 | # 96 | # parameters_dict = { 97 | # "weights": sess.run(weights), 98 | # "biases": sess.run(biases) 99 | # } 100 | # sess.close() 101 | # return parameters_dict 102 | 103 | 104 | @stopwatch 105 | def test_model(parameters_dict, input_dim, output_dim, x_test_filepath, y_test_filepath, output_filepath, 106 | batch_size=100, print_step=100): 107 | x_test = get_batch_generator(filename=x_test_filepath, batch_size=batch_size) 108 | y_test = get_batch_generator(filename=y_test_filepath, batch_size=batch_size) # FIXME: Check if headers 109 | xy_test_gen = merge_generators(x_test, y_test) 110 | test_model_gen(parameters_dict, input_dim, output_dim, xy_test_gen, output_filepath, print_step) 111 | 112 | 113 | @stopwatch 114 | def test_model_gen(parameters_dict, input_dim, output_dim, xy_test_gen, output_filepath, print_step=100): 115 | """ 116 | 117 | :param parameters_dict: Must contain keys 'weights' and 'biases' with their respective values 118 | :param input_dim: 119 | :param output_dim: 120 | :param x_test_filepath: 121 | :param y_test_filepath: 122 | :param output_filepath: 123 | :param batch_size: 124 | :param print_step: 125 | :return: 126 | """ 127 | # Model input and parameters 128 | x = tf.placeholder(tf.float32, [None, input_dim]) 129 | weights = parameters_dict["weights"] 130 | biases = parameters_dict["biases"] 131 | 132 | # Outputs and true y-values 133 | y_pred = tf.nn.softmax(tf.matmul(x, weights) + biases) 134 | y_actual = tf.placeholder(tf.float32, [None, output_dim]) 135 | 136 | # Evaluate testing accuracy 137 | correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.argmax(y_actual, 1)) 138 | accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 139 | sess = tf.Session() 140 | 141 | step = 0 142 | accuracy_history = [] 143 | for batch_xs, batch_ys in xy_test_gen: 144 | write_data(data=sess.run(y_pred, feed_dict={x: batch_xs}), filename=output_filepath) 145 | 146 | # Break early if debug 147 | if DEBUG and step == 10: 148 | break 149 | 150 | accuracy_val = sess.run(accuracy, feed_dict={x: batch_xs, y_actual: batch_ys}) 151 | accuracy_history.append(accuracy_val) 152 | 153 | if step % print_step == 0: 154 | print("Step %s, current batch testing accuracy: %s" % (step, accuracy_val)) 155 | # print("Predicted y-values:\n", sess.run(y_pred, feed_dict={x: batch_xs})) 156 | 157 | step += 1 158 | sess.close() 159 | print("Testing complete and written to %s, overall accuracy: %s" % (output_filepath, average(accuracy_history))) 160 | 161 | 162 | @stopwatch 163 | def unsupervised(): 164 | sess = tf.Session() 165 | sda = SDAutoencoder(dims=[4000, 1000, 500, 200], 166 | activations=["sigmoid", "sigmoid", "sigmoid"], 167 | sess=sess, 168 | noise=0.05, 169 | loss="rmse", 170 | batch_size=100, 171 | print_step=50) 172 | 173 | layer_1_weights_path = "../data/outputs/last_weights" 174 | layer_1_biases_path = "../data/outputs/last_biases" 175 | 176 | sda.pretrain_network(X_TRAIN_PATH, epochs=8) 177 | sda.write_data(sda.hidden_layers[1].weights, layer_1_weights_path) 178 | sda.write_data(sda.hidden_layers[1].biases, layer_1_biases_path) 179 | sda.write_encoded_input(TRANSFORMED_PATH, X_TEST_PATH) 180 | sda.save_variables(VARIABLE_SAVE_PATH) 181 | sess.close() 182 | 183 | 184 | @stopwatch 185 | def full_test(): 186 | sess = tf.Session() 187 | sda = SDAutoencoder(dims=[4000, 400, 400, 400], 188 | activations=["sigmoid", "sigmoid", "sigmoid"], 189 | sess=sess, 190 | noise=0.20, 191 | loss="cross-entropy", 192 | pretrain_lr=1e-6, 193 | finetune_lr=1e-5, 194 | batch_size=50, 195 | print_step=500) 196 | 197 | sda.pretrain_network(X_TRAIN_PATH, epochs=50) 198 | trained_parameters = sda.finetune_parameters(X_TRAIN_PATH, Y_TRAIN_PATH, output_dim=2, epochs=80) 199 | sda.write_encoded_input(TRANSFORMED_PATH, X_TEST_PATH) 200 | sda.save_variables(VARIABLE_SAVE_PATH) 201 | sess.close() 202 | 203 | test_model(parameters_dict=trained_parameters, 204 | input_dim=sda.output_dim, 205 | output_dim=2, 206 | x_test_filepath=TRANSFORMED_PATH, 207 | y_test_filepath=Y_TEST_PATH, 208 | output_filepath=OUTPUT_PATH) 209 | 210 | 211 | @stopwatch 212 | def main(): 213 | full_test() 214 | 215 | 216 | if __name__ == "__main__": 217 | main() 218 | -------------------------------------------------------------------------------- /tf/utils.py: -------------------------------------------------------------------------------- 1 | """ 2 | Utility functions for SDA 3 | 4 | Includes batch generation methods, and generator repeating/merging. 5 | 6 | Ken Chen 7 | """ 8 | 9 | import random 10 | import csv 11 | import time 12 | from math import ceil 13 | from functools import wraps 14 | 15 | 16 | def stopwatch(f): 17 | """Simple decorator that prints the execution time of a function.""" 18 | 19 | @wraps(f) 20 | def wrapped(*args, **kwargs): 21 | start_time = time.time() 22 | result = f(*args, **kwargs) 23 | elapsed_time = time.time() - start_time 24 | print("Total seconds elapsed for execution of %s:" % f, elapsed_time) 25 | return result 26 | 27 | return wrapped 28 | 29 | 30 | def file_len(filename): 31 | """Returns the number of lines in a file.""" 32 | i = 0 33 | with open(filename) as f: 34 | for i, line in enumerate(f): 35 | pass 36 | return i + 1 37 | 38 | 39 | def get_batch_generator(filename, batch_size, repeat=0): 40 | """Generator that sequentially gets batches of batch_size x or y values 41 | from the given file. 42 | 43 | :param filename: A string of the file to write to. 44 | :param batch_size: An int: the number of lines to include in each batch. 45 | :param repeat: An int specifying the number of times to repeat going through 46 | the file. Repeat of 2 will return a generator that iterates through the 47 | full file three times before stopping iteration. 48 | :return: A generator. 49 | """ 50 | assert repeat < 1000, "Recursion depth will be exceeded." 51 | with open(filename, "rt") as file: 52 | reader = csv.reader(file) 53 | 54 | index = 0 55 | this_batch = [] 56 | for row in reader: 57 | this_batch.append(row) 58 | index += 1 59 | 60 | if index % batch_size == 0: 61 | yield this_batch 62 | this_batch = [] 63 | 64 | # Catch any remainders in current data set 65 | if this_batch: 66 | yield this_batch 67 | 68 | print("Finished a batch iteration through %s" % filename) 69 | if repeat > 0: 70 | for item in get_batch_generator(filename, batch_size, repeat - 1): 71 | yield item 72 | 73 | 74 | def get_random_batch_generator(batch_size, filename, paired_filename=None, repeat=0): 75 | """Given a csv file `filename` and a specified batch_size, returns a generator that randomly 76 | yields `batch_size` cases from the file at a time and repeats its entire set of rows for 77 | `repeat` number of times. 78 | 79 | Note: use only for smaller files, as this process will consume significant memory. 80 | 81 | :param batch_size: An int, the number of lines to include in each batch. 82 | :param filename: A string, the path to the file to be batched. 83 | :param paired_filename: A string (optional), the path to another file to be batched together 84 | with `filename`. 85 | :param repeat: An int, the number of times to repeat batching of the entire dataset. 86 | :return: If `paired_filename` is not None, returns a generator that yields corresponding tuples 87 | of batches from both datasets. If `paired_filename` is None, returns a generator that yields 88 | just batches from `filename`. 89 | """ 90 | def batch_list(lst): 91 | return [lst[j*batch_size:(j+1)*batch_size] for j in range(int(ceil(len(lst) / batch_size)))] 92 | 93 | for _ in range(repeat + 1): 94 | with open(filename, "rt") as file: 95 | if paired_filename: 96 | with open(paired_filename, "rt") as paired: 97 | paired = list(zip(list(csv.reader(file)), list(csv.reader(paired)))) 98 | random.shuffle(paired) 99 | lines_0, lines_1 = list(zip(*paired)) 100 | lines_0, lines_1 = batch_list(lines_0), batch_list(lines_1) 101 | for batch_0, batch_1 in zip(lines_0, lines_1): 102 | yield batch_0, batch_1 103 | else: 104 | lines = list(csv.reader(file)) 105 | random.shuffle(lines) 106 | lines = batch_list(lines) 107 | for batch in lines: 108 | yield batch 109 | 110 | 111 | def repeat_generator(f_gen, multiple=2): 112 | """Repeats a generator. 113 | 114 | :param f_gen: A function that when called with no arguments returns a generator 115 | to be repeated. 116 | :param multiple: The number of times the generator should be iterated through. 117 | :return: A generator that iterates through the original generator `multiple` 118 | number of times. 119 | """ 120 | for _ in range(multiple): 121 | gen = f_gen() 122 | for item in gen: 123 | yield item 124 | 125 | 126 | def merge_generators(gen_1, gen_2): 127 | """Returns a generator that yields combined tuples of the results of `gen_1` and `gen_2`.""" 128 | for x, y in zip(gen_1, gen_2): 129 | yield x, y 130 | --------------------------------------------------------------------------------