├── .gitignore
├── LICENSE
├── README.md
├── preprocess
    ├── ami_preprocess_and_split.R
    ├── ed_preprocess_and_split.R
    ├── ip_preprocess_and_split.R
    ├── preprocess_and_split.R
    ├── reduce_columns.R
    └── roc.R
├── rf
    ├── rf2.py
    ├── rf3.py
    ├── tensor_forest.py
    └── tensor_forest_test.py
└── tf
    ├── __init__.py
    ├── mnist_sda.py
    ├── sdautoencoder.py
    ├── softmax.py
    └── utils.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Compiled Lua sources
 2 | luac.out
 3 | 
 4 | # luarocks build files
 5 | *.src.rock
 6 | *.zip
 7 | *.tar.gz
 8 | 
 9 | # Object files
10 | *.o
11 | *.os
12 | *.ko
13 | *.obj
14 | *.elf
15 | 
16 | # Precompiled Headers
17 | *.gch
18 | *.pch
19 | 
20 | # Libraries
21 | *.lib
22 | *.a
23 | *.la
24 | *.lo
25 | *.def
26 | *.exp
27 | 
28 | # Shared objects (inc. Windows DLLs)
29 | *.dll
30 | *.so
31 | *.so.*
32 | *.dylib
33 | 
34 | # Executables
35 | *.exe
36 | *.out
37 | *.app
38 | *.i*86
39 | *.x86_64
40 | *.hex
41 | 
42 | # Exclude Data
43 | .RData
44 | .Rhistory
45 | *.Rout
46 | data
47 | logs
48 | 
49 | # Exclude old stuff
50 | old
51 | misc
52 | 
53 | # Exclude training code
54 | training
55 | 
56 | # Exclude MNIST stuff
57 | MNIST_data
58 | run_data
59 | 
60 | # Exclude PyCharm
61 | .idea
62 | 
63 | # Exclude Python stuff
64 | __pycache__
65 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2016 Ken Chen
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # deep-learning
 2 | Deep learning project in TensorFlow and Torch to analyze clinical health records and construct deep learning models to predict future patient complications.
 3 | 
 4 | ## Background
 5 | This project uses **Stacked Denoising Autoencoders (SDA)** [[P. Vincent]](http://jmlr.csail.mit.edu/papers/volume11/vincent10a/vincent10a.pdf) to perform feature learning on a given dataset. Two overall steps are necessary for fully configuring the network to encode the input data: **pre-training**, and **fine-tuning**.
 6 | 
 7 | During unsupervised pre-training, parameters in the neural network are learned and configured layer by layer greedily by minimizing the reconstruction loss between each input and its decoded counterpart. A supervised softmax classifier on top of the network provides fine tuning for all parameters of the network (weights and biases for each autoencoder layer plus softmax weights/biases).
 8 | 
 9 | Following this configuration, the input data can be read into the model and encoded into a different representation depending on the user's desired parameters (layer dimensions, activations, noise level, etc.). For example, this technique can be used to transform a sparse feature space of 30000 dimensions into a dense feature space of 400 dimensions as a primer for better training performance.
10 | 
11 | ## Usage
12 | The current working source code is located in `tf/sdautoencoder.py`. Currently reads train/test data from csv files in batch style. The following three datasets must be present for the SDA to output newly learned features:
13 | - X training values
14 | - Y training values
15 | - X testing values
16 | 
17 | An additional dataset is needed if the output of SDA encoding is directly used for classification via the provided softmax classifier:
18 | - Y testing values
19 | 
20 | 
21 | In the future, a version of the program will be constructed to be optimized on a multi(4)-gpu system.
22 | 
23 | ```python
24 | # Start a TensorFlow session
25 | sess = tf.Session()
26 | 
27 | # Initialize an unconfigured autoencoder with specified dimensions, etc.
28 | sda = SDAutoencoder(dims=[784, 256, 64, 32],
29 |                     activations=["sigmoid", "tanh", "sigmoid"],
30 |                     sess=sess,
31 |                     noise=0.1,
32 |                     loss="rmse")
33 | 
34 | # Pretrain weights and biases of each layer in the network.
35 | sda.pretrain_network(X_TRAIN_PATH)
36 | 
37 | # Read in test y-values to softmax classifier.
38 | sda.finetune_parameters(X_TRAIN_PATH, Y_TRAIN_PATH, output_dim=10)
39 | 
40 | # Write to file the newly represented features.
41 | sda.write_encoded_input(filepath="../data/transformed.csv", X_TEST_PATH)
42 | ```
43 | 
44 | For an example of how training is performed and subsequent accuracy is evaluated, a basic procedure is implemented on the MNIST data set in `tf/mnist_sda.py`.
45 | 
46 | ## Performance
47 | Testing on the MNIST data set, the softmax classifier on top of features extracted from the deep feature learning of the SDA can achieve approximately **98.3%** accuracy in identifying the digits. To achieve this result, the model in `tf/mnist_sda.py` is set up with the following parameters (which may not necessarily be optimal) with 500000 data points for layer-wise pretraining and 3000000 data points for fine tuning:
48 | 
49 | ```python
50 | sda = SDAutoencoder(dims=[784, 400, 200, 80],
51 |                     activations=["sigmoid", "sigmoid", "sigmoid"],
52 |                     sess=sess,
53 |                     noise=0.20,
54 |                     loss="cross-entropy",
55 |                     pretrain_lr=0.0001,
56 |                     finetune_lr=0.0001)
57 | ```
58 | Total execution time for feature learning, training, and evaluation was just under 9 minutes on my 1.3 GHz MacAir processor (under a minute on a GPU machine using one GTX 1080). This result improves upon the benchmark of 92% achieved by just a [simple softmax classifier](https://www.tensorflow.org/versions/r0.9/tutorials/mnist/beginners/index.html#mnist-for-ml-beginners) without feature learning. It is also comparable to some simple 2D convolutional network models, which are optimized to take advantage of the 2D structures in image data.
59 | 
60 | In the future, we plan to do additional testing to optimize hyperparameters in the model and improve execution speed in various parts of the model.
61 | 
62 | ## Current status
63 | - (Done) SDA implemented in final_sda.py in TensorFlow.
64 | - (Done) Implement softmax classifier.
65 | - (To do) Implement command line execution of program.
66 | - (WIP) Testing for any silent bugs.
67 | - (To do) Enable multi-gpu support in the architecture.
68 | - (WIP) Add compatibility for other data-loading methods
69 | - (To do) Add pre-processing methods in TF
70 | - (WIP) More documentation
71 | 


--------------------------------------------------------------------------------
/preprocess/ami_preprocess_and_split.R:
--------------------------------------------------------------------------------
  1 | library(data.table) # Must have data.table v1.9.7+
  2 | library(readr)
  3 | library(DMwR)
  4 | #library(ROSE)
  5 | 
  6 | # Usage (must be run from command line)
  7 | # Rscript <path/to/sam_table.csv> <path/to/training_ids.csv> <path/to/testing_ids.csv> <optional: base name>
  8 | # Program will print steps of execution and write 5 different files to disk:
  9 | #   - Train x (saved as base_name_train_x.csv)
 10 | #   - Train y (saved as base_name_train_y.csv as one-hot vectors)
 11 | #   - Test x (saved as base_name_test_x.csv)
 12 | #   - Test y (saved as base_name_test_y.csv as one-hot vectors)
 13 | #   - Test ids (saved as base_name_test_ids.csv) (patient ids in order of all the test cases)
 14 | 
 15 | # Parse command line arguments
 16 | args <- commandArgs(trailingOnly = TRUE)
 17 | path_sam <- args[1]
 18 | base_name <- args[2]
 19 | 
 20 | # Read in raw files: SAM table, train case ids, and test case ids
 21 | print(paste("Reading", path_sam))
 22 | Sam <- fread(path_sam, header = T)
 23 | print("Done reading files.")
 24 | 
 25 | # Reset headers of data tables to get rid of BOM in case it's there
 26 | # http://stackoverflow.com/questions/21624796/read-the-text-file-with-bom-in-r
 27 | Sam.names <- names(read.csv(path_sam, nrows = 1, fileEncoding = "UTF-8-BOM"))
 28 | 
 29 | names(Sam) <- Sam.names
 30 | print("Removed BOM from text")
 31 | 
 32 | # Pre-processing functions
 33 | is.zero <- function(v) {
 34 |   return(v==0)
 35 | }
 36 | 
 37 | unitScale <- function(v) {
 38 |   if (is.factor(v)) {
 39 |     return(v)
 40 |   }
 41 |   range <- max(v) - min(v)
 42 |   if (range == 0) {
 43 |     return(0)
 44 |   }
 45 |   return((v - min(v)) / range)
 46 | }
 47 | 
 48 | print(str(Sam))
 49 | 
 50 | # Test min value of Sam
 51 | # Sam.maxs <- Sam[, lapply(.SD, max)]
 52 | # print(str(Sam.maxs))
 53 | # print(sum(Sam.maxs==0))
 54 | 
 55 | # Subcohort for AMI: age 35+ includes 95%? of cases, 60%? of data set
 56 | Sam <- Sam[Age >= 35]
 57 | print("Subcohort str")
 58 | print(str(Sam))
 59 | 
 60 | # Change y values of IP/ED to 1/0 depending on return or not (binarize)
 61 | Sam$AMI1Y_YTD <- ifelse(Sam$AMI1Y_YTD > 0, 1, 0)
 62 | 
 63 | # Change all necessary columns to factors to prevent scaling and 
 64 | # to assure SMOTE works
 65 | Sam$StatePatientID <- as.factor(Sam$StatePatientID)
 66 | Sam$AMI1Y_YTD <- as.factor(Sam$AMI1Y_YTD)
 67 | 
 68 | # Scale all columns of Sam
 69 | print("Starting to scale table.")
 70 | Sam <- Sam[, lapply(.SD, unitScale)]
 71 | print("Completed scaling of columns.")
 72 | 
 73 | # Split into train and test 2500 
 74 | print("Starting to split into train and test sets.")
 75 | prop_in_train <- 0.90
 76 | cases <- which(Sam$AMI1Y_YTD == 1)
 77 | controls <- which(Sam$AMI1Y_YTD == 0)
 78 | train_cases <- sample(cases, floor(length(cases) * prop_in_train))
 79 | train_controls <- sample(controls, floor(length(controls) * prop_in_train))
 80 | test_cases <- setdiff(cases, train_cases)
 81 | test_controls <- setdiff(controls, train_controls)
 82 | print("Total cases:")
 83 | print(sum(Sam$AMI1Y_YTD == 1))
 84 | print(str(cases))
 85 | print(str(controls))
 86 | print(str(train_cases))
 87 | print(str(train_controls))
 88 | print(str(test_cases))
 89 | print(str(test_controls))
 90 | 
 91 | print(length(train_cases))
 92 | print(length(test_cases))
 93 | 
 94 | Sam.train <- Sam[c(train_cases, train_controls)]
 95 | Sam.test <- Sam[c(test_cases, test_controls)]
 96 | 
 97 | rm(Sam)
 98 | print("Finished splitting into train and test sets.")
 99 | 
100 | # SMOTE algorithm for balancing training data by interpolated over/undersampling
101 | #Smote parameters
102 | print("Beginning to apply SMOTE algorithm.")
103 | percent_to_oversample <- 600
104 | percent_ratio_major_to_minor <- 200
105 | Sam.train <- SMOTE(AMI1Y_YTD ~ . -StatePatientID, data = Sam.train, 
106 |                   perc.over = percent_to_oversample, perc.under = percent_ratio_major_to_minor)
107 | print("Finished applying SMOTE algorithm.")
108 | 
109 | # ROSE algorithm for balancing training data by over/undersampling
110 | #print("Beginning to apply ROSE algorithm.")
111 | #result_sample_size <- 100000
112 | #rare_proportion <- 0.5
113 | # Sam.train.without_factors <- Sam.train[, !c("StatePatientID", "ED_YTM"), with = FALSE]
114 | # Sam.train.factors <- Sam.train[, c("StatePatientID", "ED_YTM"), with = FALSE]
115 | #Sam.train <- ovun.sample(AMI1Y_YTD ~ . -StatePatientID, data = Sam.train, 
116 | #                         method = "both", N = result_sample_size, p = rare_proportion)$data
117 | #Sam.train <- data.table(Sam.train)
118 | #print("Finished applying ROSE algorithm.")
119 | 
120 | # Shuffle train data to homogenize 0/1 y values
121 | print("Begin shuffle.")
122 | Sam.train <- Sam.train[sample(nrow(Sam.train)),]
123 | print("Finished shuffle.")
124 | 
125 | # Split into train.x, train.y, test.x, test.y
126 | print("Begin split into x/y.")
127 | Sam.train.x <- Sam.train[, !c("StatePatientID", "AMI1Y_YTD"), with = FALSE]
128 | Sam.train.y <- Sam.train[, c("AMI1Y_YTD"), with = FALSE]
129 | rm(Sam.train)
130 | Sam.test.x <- Sam.test[, !c("StatePatientID", "AMI1Y_YTD"), with = FALSE]
131 | Sam.test.y <- Sam.test[, c("AMI1Y_YTD"), with = FALSE]
132 | Sam.test.ids <- Sam.test[, c("StatePatientID"), with = FALSE]
133 | rm(Sam.test)
134 | print("Finished split into x/y.")
135 | 
136 | # Change y to one-hot
137 | Sam.train.y[, zero := ifelse(AMI1Y_YTD == 0, 1, 0)]
138 | Sam.train.y[, one := AMI1Y_YTD]
139 | Sam.train.y[, AMI1Y_YTD := NULL]
140 | Sam.test.y[, zero := ifelse(AMI1Y_YTD == 0, 1, 0)]
141 | Sam.test.y[, one := AMI1Y_YTD]
142 | Sam.test.y[, AMI1Y_YTD := NULL]
143 | 
144 | # Write all splits to file
145 | print("Begin write to file.")
146 | base_name <- ifelse(is.na(base_name), "SAMFull", base_name)
147 | fwrite(Sam.train.x, paste0(base_name, "_train_x", ".csv"), col.names = FALSE)
148 | fwrite(Sam.train.y, paste0(base_name, "_train_y", ".csv"), col.names = FALSE)
149 | fwrite(Sam.test.x, paste0(base_name, "_test_x", ".csv"), col.names = FALSE)
150 | fwrite(Sam.test.y, paste0(base_name, "_test_y", ".csv"), col.names = FALSE)
151 | fwrite(Sam.test.ids, paste0(base_name, "_test_ids", ".csv"))
152 | print("Finished write to file.")
153 | 
154 | # Remove all columns with all zero entries 
155 | # Sam <- Sam[,which(unlist(lapply(Sam, function(x)!all(is.zero(x))))),with=F]
156 | # print(str(Sam))
157 | 


--------------------------------------------------------------------------------
/preprocess/ed_preprocess_and_split.R:
--------------------------------------------------------------------------------
  1 | library(data.table) # Must have data.table v1.9.7+
  2 | library(readr)
  3 | library(DMwR)
  4 | library(ROSE)
  5 | 
  6 | # Usage (must be run from command line)
  7 | # Rscript <path/to/sam_table.csv> <path/to/training_ids.csv> <path/to/testing_ids.csv> <optional: base name>
  8 | # Program will print steps of execution and write 5 different files to disk:
  9 | #   - Train x (saved as base_name_train_x.csv)
 10 | #   - Train y (saved as base_name_train_y.csv as one-hot vectors)
 11 | #   - Test x (saved as base_name_test_x.csv)
 12 | #   - Test y (saved as base_name_test_y.csv as one-hot vectors)
 13 | #   - Test ids (saved as base_name_test_ids.csv) (patient ids in order of all the test cases)
 14 | 
 15 | # Parse command line arguments
 16 | args <- commandArgs(trailingOnly = TRUE)
 17 | path_sam <- args[1]
 18 | path_train_ids <- args[2]
 19 | path_test_ids <- args[3]
 20 | base_name <- args[4]
 21 | 
 22 | # Read in raw files: SAM table, train case ids, and test case ids
 23 | print(paste("Reading", path_sam))
 24 | Sam <- fread(path_sam, header = T)
 25 | print(paste("Reading", path_train_ids))
 26 | Train_ids <- fread(path_train_ids, header = T)
 27 | print(paste("Reading", path_test_ids))
 28 | Test_ids <- fread(path_test_ids, header = T)
 29 | 
 30 | print("Done reading files.")
 31 | 
 32 | # Reset headers of data tables to get rid of BOM in case it's there
 33 | # http://stackoverflow.com/questions/21624796/read-the-text-file-with-bom-in-r
 34 | Sam.names <- names(read.csv(path_sam, nrows = 1, fileEncoding = "UTF-8-BOM"))
 35 | Train_ids.names <- names(read.csv(path_train_ids, nrows = 1, fileEncoding = "UTF-8-BOM"))
 36 | Test_ids.names <- names(read.csv(path_test_ids, nrows = 1, fileEncoding = "UTF-8-BOM"))
 37 | 
 38 | names(Sam) <- Sam.names
 39 | names(Train_ids) <- Train_ids.names
 40 | names(Test_ids) <- Test_ids.names
 41 | 
 42 | print("Removed BOM from text")
 43 | 
 44 | # Pre-processing functions
 45 | is.zero <- function(v) {
 46 |   return(v==0)
 47 | }
 48 | 
 49 | unitScale <- function(v) {
 50 |   if (is.factor(v)) {
 51 |     return(v)
 52 |   }
 53 |   range <- max(v) - min(v)
 54 |   if (range == 0) {
 55 |     return(0)
 56 |   }
 57 |   return((v - min(v)) / range)
 58 | }
 59 | 
 60 | print(str(Sam))
 61 | 
 62 | # Test min value of Sam
 63 | # Sam.maxs <- Sam[, lapply(.SD, max)]
 64 | # print(str(Sam.maxs))
 65 | # print(sum(Sam.maxs==0))
 66 | 
 67 | # Change y values of IP/ED to 1/0 depending on return or not (binarize)
 68 | Sam$ED_YTM <- ifelse(Sam$ED_YTM > 0, 1, 0)
 69 | Sam$IP_YTM <- ifelse(Sam$IP_YTM > 0, 1, 0)
 70 | 
 71 | # Change all necessary columns to factors to prevent scaling and 
 72 | # to assure SMOTE works
 73 | Sam$StatePatientID <- as.factor(Sam$StatePatientID)
 74 | Sam$ED_YTM <- as.factor(Sam$ED_YTM)
 75 | Sam$IP_YTM <- as.factor(Sam$IP_YTM)
 76 | 
 77 | # Scale all columns of Sam
 78 | print("Starting to scale table.")
 79 | Sam <- Sam[, lapply(.SD, unitScale)]
 80 | print("Completed scaling of columns.")
 81 | 
 82 | # Split into train and test
 83 | print("Starting to split into train and test sets.")
 84 | Sam.train <- Sam[StatePatientID %in% Train_ids[[1]]]
 85 | Sam.test <- Sam[StatePatientID %in% Test_ids[[1]]]
 86 | rm(Sam)
 87 | print("Finished splitting into train and test sets.")
 88 | 
 89 | # SMOTE algorithm for balancing training data by interpolated over/undersampling
 90 | # Smote parameters
 91 | # print("Beginning to apply SMOTE algorithm.")
 92 | # percent_to_oversample <- 500
 93 | # percent_ratio_major_to_minor <- 100
 94 | # Sam.train <- SMOTE(IP_YTM ~ . -StatePatientID -ED_YTM, data = Sam.train, 
 95 | #                    perc.over = percent_to_oversample, perc.under = percent_ratio_major_to_minor)
 96 | # print("Finished applying SMOTE algorithm.")
 97 | 
 98 | # ROSE algorithm for balancing training data by over/undersampling
 99 | print("Beginning to apply ROSE algorithm.")
100 | result_sample_size <- 300000
101 | rare_proportion <- 0.5
102 | # Sam.train.without_factors <- Sam.train[, !c("StatePatientID", "ED_YTM"), with = FALSE]
103 | # Sam.train.factors <- Sam.train[, c("StatePatientID", "ED_YTM"), with = FALSE]
104 | Sam.train <- ovun.sample(ED_YTM ~ . -StatePatientID -IP_YTM, data = Sam.train, 
105 |                          method = "both", N = result_sample_size, p = rare_proportion)$data
106 | Sam.train <- data.table(Sam.train)
107 | print("Finished applying ROSE algorithm.")
108 | 
109 | # Shuffle train data to homogenize 0/1 y values
110 | print("Begin shuffle.")
111 | Sam.train <- Sam.train[sample(nrow(Sam.train)),]
112 | print("Finished shuffle.")
113 | 
114 | # Split into train.x, train.y, test.x, test.y
115 | print("Begin split into x/y.")
116 | Sam.train.x <- Sam.train[, !c("StatePatientID", "ED_YTM", "IP_YTM"), with = FALSE]
117 | Sam.train.y <- Sam.train[, c("ED_YTM"), with = FALSE]
118 | rm(Sam.train)
119 | Sam.test.x <- Sam.test[, !c("StatePatientID", "ED_YTM", "IP_YTM"), with = FALSE]
120 | Sam.test.y <- Sam.test[, c("ED_YTM"), with = FALSE]
121 | Sam.test.ids <- Sam.test[, c("StatePatientID"), with = FALSE]
122 | rm(Sam.test)
123 | print("Finished split into x/y.")
124 | 
125 | # Change y to one-hot
126 | Sam.train.y[, zero := ifelse(ED_YTM == 0, 1, 0)]
127 | Sam.train.y[, one := ED_YTM]
128 | Sam.train.y[, ED_YTM := NULL]
129 | Sam.test.y[, zero := ifelse(ED_YTM == 0, 1, 0)]
130 | Sam.test.y[, one := ED_YTM]
131 | Sam.test.y[, ED_YTM := NULL]
132 | 
133 | # Write all splits to file
134 | print("Begin write to file.")
135 | base_name <- ifelse(is.na(base_name), "SAMFull", base_name)
136 | fwrite(Sam.train.x, paste0(base_name, "_train_x", ".csv"), col.names = FALSE)
137 | fwrite(Sam.train.y, paste0(base_name, "_train_y", ".csv"), col.names = FALSE)
138 | fwrite(Sam.test.x, paste0(base_name, "_test_x", ".csv"), col.names = FALSE)
139 | fwrite(Sam.test.y, paste0(base_name, "_test_y", ".csv"), col.names = FALSE)
140 | fwrite(Sam.test.ids, paste0(base_name, "_test_ids", ".csv"))
141 | print("Finished write to file.")
142 | 
143 | # Remove all columns with all zero entries 
144 | # Sam <- Sam[,which(unlist(lapply(Sam, function(x)!all(is.zero(x))))),with=F]
145 | # print(str(Sam))
146 | 


--------------------------------------------------------------------------------
/preprocess/ip_preprocess_and_split.R:
--------------------------------------------------------------------------------
  1 | library(data.table) # Must have data.table v1.9.7+
  2 | library(readr)
  3 | library(DMwR)
  4 | library(ROSE)
  5 | 
  6 | # Usage (must be run from command line)
  7 | # Rscript <path/to/sam_table.csv> <path/to/training_ids.csv> <path/to/testing_ids.csv> <optional: base name>
  8 | # Program will print steps of execution and write 5 different files to disk:
  9 | #   - Train x (saved as base_name_train_x.csv)
 10 | #   - Train y (saved as base_name_train_y.csv as one-hot vectors)
 11 | #   - Test x (saved as base_name_test_x.csv)
 12 | #   - Test y (saved as base_name_test_y.csv as one-hot vectors)
 13 | #   - Test ids (saved as base_name_test_ids.csv) (patient ids in order of all the test cases)
 14 | 
 15 | # Parse command line arguments
 16 | args <- commandArgs(trailingOnly = TRUE)
 17 | path_sam <- args[1]
 18 | path_train_ids <- args[2]
 19 | path_test_ids <- args[3]
 20 | base_name <- args[4]
 21 | 
 22 | # Read in raw files: SAM table, train case ids, and test case ids
 23 | print(paste("Reading", path_sam))
 24 | Sam <- fread(path_sam, header = T)
 25 | print(paste("Reading", path_train_ids))
 26 | Train_ids <- fread(path_train_ids, header = T)
 27 | print(paste("Reading", path_test_ids))
 28 | Test_ids <- fread(path_test_ids, header = T)
 29 | 
 30 | print("Done reading files.")
 31 | 
 32 | # Reset headers of data tables to get rid of BOM in case it's there
 33 | # http://stackoverflow.com/questions/21624796/read-the-text-file-with-bom-in-r
 34 | Sam.names <- names(read.csv(path_sam, nrows = 1, fileEncoding = "UTF-8-BOM"))
 35 | Train_ids.names <- names(read.csv(path_train_ids, nrows = 1, fileEncoding = "UTF-8-BOM"))
 36 | Test_ids.names <- names(read.csv(path_test_ids, nrows = 1, fileEncoding = "UTF-8-BOM"))
 37 | 
 38 | names(Sam) <- Sam.names
 39 | names(Train_ids) <- Train_ids.names
 40 | names(Test_ids) <- Test_ids.names
 41 | 
 42 | print("Removed BOM from text")
 43 | 
 44 | # Pre-processing functions
 45 | is.zero <- function(v) {
 46 |   return(v==0)
 47 | }
 48 | 
 49 | unitScale <- function(v) {
 50 |   if (is.factor(v)) {
 51 |     return(v)
 52 |   }
 53 |   range <- max(v) - min(v)
 54 |   if (range == 0) {
 55 |     return(0)
 56 |   }
 57 |   return((v - min(v)) / range)
 58 | }
 59 | 
 60 | print(str(Sam))
 61 | 
 62 | # Test min value of Sam
 63 | # Sam.maxs <- Sam[, lapply(.SD, max)]
 64 | # print(str(Sam.maxs))
 65 | # print(sum(Sam.maxs==0))
 66 | 
 67 | # Change y values of IP/ED to 1/0 depending on return or not (binarize)
 68 | Sam$ED_YTM <- ifelse(Sam$ED_YTM > 0, 1, 0)
 69 | Sam$IP_YTM <- ifelse(Sam$IP_YTM > 0, 1, 0)
 70 | 
 71 | # Change all necessary columns to factors to prevent scaling and 
 72 | # to assure SMOTE works
 73 | Sam$StatePatientID <- as.factor(Sam$StatePatientID)
 74 | Sam$ED_YTM <- as.factor(Sam$ED_YTM)
 75 | Sam$IP_YTM <- as.factor(Sam$IP_YTM)
 76 | 
 77 | # Scale all columns of Sam
 78 | print("Starting to scale table.")
 79 | Sam <- Sam[, lapply(.SD, unitScale)]
 80 | print("Completed scaling of columns.")
 81 | 
 82 | # Split into train and test
 83 | print("Starting to split into train and test sets.")
 84 | Sam.train <- Sam[StatePatientID %in% Train_ids[[1]]]
 85 | Sam.test <- Sam[StatePatientID %in% Test_ids[[1]]]
 86 | rm(Sam)
 87 | print("Finished splitting into train and test sets.")
 88 | 
 89 | # SMOTE algorithm for balancing training data by interpolated over/undersampling
 90 | # Smote parameters
 91 | print("Beginning to apply SMOTE algorithm.")
 92 | percent_to_oversample <- 180
 93 | percent_ratio_major_to_minor <- 200
 94 | Sam.train <- SMOTE(IP_YTM ~ . -StatePatientID -ED_YTM, data = Sam.train, 
 95 |                    perc.over = percent_to_oversample, perc.under = percent_ratio_major_to_minor)
 96 | print("Finished applying SMOTE algorithm.")
 97 | 
 98 | # ROSE algorithm for balancing training data by over/undersampling
 99 | # print("Beginning to apply ROSE algorithm.")
100 | # result_sample_size <- 200000
101 | # rare_proportion <- 0.4
102 | # Sam.train.without_factors <- Sam.train[, !c("StatePatientID", "ED_YTM"), with = FALSE]
103 | # Sam.train.factors <- Sam.train[, c("StatePatientID", "ED_YTM"), with = FALSE]
104 | # Sam.train <- ovun.sample(IP_YTM ~ . -StatePatientID -ED_YTM, data = Sam.train, 
105 | #                          method = "both", N = result_sample_size, p = rare_proportion)$data
106 | # Sam.train <- data.table(Sam.train)
107 | # print("Finished applying ROSE algorithm.")
108 | 
109 | # Shuffle train data to homogenize 0/1 y values
110 | print("Begin shuffle.")
111 | Sam.train <- Sam.train[sample(nrow(Sam.train)),]
112 | print("Finished shuffle.")
113 | 
114 | # Split into train.x, train.y, test.x, test.y
115 | print("Begin split into x/y.")
116 | Sam.train.x <- Sam.train[, !c("StatePatientID", "ED_YTM", "IP_YTM"), with = FALSE]
117 | Sam.train.y <- Sam.train[, c("IP_YTM"), with = FALSE]
118 | rm(Sam.train)
119 | Sam.test.x <- Sam.test[, !c("StatePatientID", "ED_YTM", "IP_YTM"), with = FALSE]
120 | Sam.test.y <- Sam.test[, c("IP_YTM"), with = FALSE]
121 | Sam.test.ids <- Sam.test[, c("StatePatientID"), with = FALSE]
122 | rm(Sam.test)
123 | print("Finished split into x/y.")
124 | 
125 | # Change y to one-hot
126 | Sam.train.y[, zero := ifelse(IP_YTM == 0, 1, 0)]
127 | Sam.train.y[, one := IP_YTM]
128 | Sam.train.y[, IP_YTM := NULL]
129 | Sam.test.y[, zero := ifelse(IP_YTM == 0, 1, 0)]
130 | Sam.test.y[, one := IP_YTM]
131 | Sam.test.y[, IP_YTM := NULL]
132 | 
133 | # Write all splits to file
134 | print("Begin write to file.")
135 | base_name <- ifelse(is.na(base_name), "SAMFull", base_name)
136 | fwrite(Sam.train.x, paste0(base_name, "_train_x", ".csv"))
137 | fwrite(Sam.train.y, paste0(base_name, "_train_y", ".csv"))
138 | fwrite(Sam.test.x, paste0(base_name, "_test_x", ".csv"))
139 | fwrite(Sam.test.y, paste0(base_name, "_test_y", ".csv"))
140 | fwrite(Sam.test.ids, paste0(base_name, "_test_ids", ".csv"))
141 | print("Finished write to file.")
142 | 
143 | # Remove all columns with all zero entries 
144 | # Sam <- Sam[,which(unlist(lapply(Sam, function(x)!all(is.zero(x))))),with=F]
145 | # print(str(Sam))
146 | 


--------------------------------------------------------------------------------
/preprocess/preprocess_and_split.R:
--------------------------------------------------------------------------------
  1 | library(data.table) # Must have data.table v1.9.7+
  2 | library(readr)
  3 | library(DMwR)
  4 | library(ROSE)
  5 | 
  6 | # Usage (must be run from command line)
  7 | # Rscript <path/to/sam_table.csv> <path/to/training_ids.csv> <path/to/testing_ids.csv> <optional: base name>
  8 | # Program will print steps of execution and write 5 different files to disk:
  9 | #   - Train x (saved as base_name_train_x.csv)
 10 | #   - Train y (saved as base_name_train_y.csv as one-hot vectors)
 11 | #   - Test x (saved as base_name_test_x.csv)
 12 | #   - Test y (saved as base_name_test_y.csv as one-hot vectors)
 13 | #   - Test ids (saved as base_name_test_ids.csv) (patient ids in order of all the test cases)
 14 | 
 15 | # Parse command line arguments
 16 | args <- commandArgs(trailingOnly = TRUE)
 17 | path_sam <- args[1]
 18 | path_train_ids <- args[2]
 19 | path_test_ids <- args[3]
 20 | base_name <- args[4]
 21 | 
 22 | # Read in raw files: SAM table, train case ids, and test case ids
 23 | print(paste("Reading", path_sam))
 24 | Sam <- fread(path_sam, header = T)
 25 | print(paste("Reading", path_train_ids))
 26 | Train_ids <- fread(path_train_ids, header = T)
 27 | print(paste("Reading", path_test_ids))
 28 | Test_ids <- fread(path_test_ids, header = T)
 29 | 
 30 | print("Done reading files.")
 31 | 
 32 | # Reset headers of data tables to get rid of BOM in case it's there
 33 | # http://stackoverflow.com/questions/21624796/read-the-text-file-with-bom-in-r
 34 | Sam.names <- names(read.csv(path_sam, nrows = 1, fileEncoding = "UTF-8-BOM"))
 35 | Train_ids.names <- names(read.csv(path_train_ids, nrows = 1, fileEncoding = "UTF-8-BOM"))
 36 | Test_ids.names <- names(read.csv(path_test_ids, nrows = 1, fileEncoding = "UTF-8-BOM"))
 37 | 
 38 | names(Sam) <- Sam.names
 39 | names(Train_ids) <- Train_ids.names
 40 | names(Test_ids) <- Test_ids.names
 41 | 
 42 | print("Removed BOM from text")
 43 | 
 44 | # Pre-processing functions
 45 | is.zero <- function(v) {
 46 |   return(v==0)
 47 | }
 48 | 
 49 | unitScale <- function(v) {
 50 |   if (is.factor(v)) {
 51 |     return(v)
 52 |   }
 53 |   range <- max(v) - min(v)
 54 |   if (range == 0) {
 55 |     return(0)
 56 |   }
 57 |   return((v - min(v)) / range)
 58 | }
 59 | 
 60 | print(str(Sam))
 61 | 
 62 | # Test min value of Sam
 63 | # Sam.maxs <- Sam[, lapply(.SD, max)]
 64 | # print(str(Sam.maxs))
 65 | # print(sum(Sam.maxs==0))
 66 | 
 67 | # Change y values of IP/ED to 1/0 depending on return or not (binarize)
 68 | Sam$ED_YTM <- ifelse(Sam$ED_YTM > 0, 1, 0)
 69 | Sam$IP_YTM <- ifelse(Sam$IP_YTM > 0, 1, 0)
 70 | 
 71 | # Change all necessary columns to factors to prevent scaling and 
 72 | # to assure SMOTE works
 73 | Sam$StatePatientID <- as.factor(Sam$StatePatientID)
 74 | Sam$ED_YTM <- as.factor(Sam$ED_YTM)
 75 | Sam$IP_YTM <- as.factor(Sam$IP_YTM)
 76 | 
 77 | # Scale all columns of Sam
 78 | print("Starting to scale table.")
 79 | Sam <- Sam[, lapply(.SD, unitScale)]
 80 | print("Completed scaling of columns.")
 81 | 
 82 | # Split into train and test
 83 | print("Starting to split into train and test sets.")
 84 | Sam.train <- Sam[StatePatientID %in% Train_ids[[1]]]
 85 | Sam.test <- Sam[StatePatientID %in% Test_ids[[1]]]
 86 | rm(Sam)
 87 | print("Finished splitting into train and test sets.")
 88 | 
 89 | # SMOTE algorithm for balancing training data by interpolated over/undersampling
 90 | # Smote parameters
 91 | # print("Beginning to apply SMOTE algorithm.")
 92 | # percent_to_oversample <- 500
 93 | # percent_ratio_major_to_minor <- 100
 94 | # Sam.train <- SMOTE(IP_YTM ~ . -StatePatientID -ED_YTM, data = Sam.train, 
 95 | #                    perc.over = percent_to_oversample, perc.under = percent_ratio_major_to_minor)
 96 | # print("Finished applying SMOTE algorithm.")
 97 | 
 98 | # ROSE algorithm for balancing training data by over/undersampling
 99 | print("Beginning to apply ROSE algorithm.")
100 | result_sample_size <- 200000
101 | rare_proportion <- 0.4
102 | # Sam.train.without_factors <- Sam.train[, !c("StatePatientID", "ED_YTM"), with = FALSE]
103 | # Sam.train.factors <- Sam.train[, c("StatePatientID", "ED_YTM"), with = FALSE]
104 | Sam.train <- ovun.sample(IP_YTM ~ . -StatePatientID -ED_YTM, data = Sam.train, 
105 |                          method = "both", N = result_sample_size, p = rare_proportion)$data
106 | Sam.train <- data.table(Sam.train)
107 | print("Finished applying ROSE algorithm.")
108 | 
109 | # Shuffle train data to homogenize 0/1 y values
110 | print("Begin shuffle.")
111 | Sam.train <- Sam.train[sample(nrow(Sam.train)),]
112 | print("Finished shuffle.")
113 | 
114 | # Split into train.x, train.y, test.x, test.y
115 | print("Begin split into x/y.")
116 | Sam.train.x <- Sam.train[, !c("StatePatientID", "ED_YTM", "IP_YTM"), with = FALSE]
117 | Sam.train.y <- Sam.train[, c("IP_YTM"), with = FALSE]
118 | rm(Sam.train)
119 | Sam.test.x <- Sam.test[, !c("StatePatientID", "ED_YTM", "IP_YTM"), with = FALSE]
120 | Sam.test.y <- Sam.test[, c("IP_YTM"), with = FALSE]
121 | Sam.test.ids <- Sam.test[, c("StatePatientID"), with = FALSE]
122 | rm(Sam.test)
123 | print("Finished split into x/y.")
124 | 
125 | # Change y to one-hot
126 | Sam.train.y[, zero := ifelse(IP_YTM == 0, 1, 0)]
127 | Sam.train.y[, one := IP_YTM]
128 | Sam.train.y[, IP_YTM := NULL]
129 | Sam.test.y[, zero := ifelse(IP_YTM == 0, 1, 0)]
130 | Sam.test.y[, one := IP_YTM]
131 | Sam.test.y[, IP_YTM := NULL]
132 | 
133 | # Write all splits to file
134 | print("Begin write to file.")
135 | base_name <- ifelse(is.na(base_name), "SAMFull", base_name)
136 | fwrite(Sam.train.x, paste0(base_name, "_train_x", ".csv"), col.names = FALSE)
137 | fwrite(Sam.train.y, paste0(base_name, "_train_y", ".csv"), col.names = FALSE)
138 | fwrite(Sam.test.x, paste0(base_name, "_test_x", ".csv"), col.names = FALSE)
139 | fwrite(Sam.test.y, paste0(base_name, "_test_y", ".csv"), col.names = FALSE)
140 | fwrite(Sam.test.ids, paste0(base_name, "_test_ids", ".csv"))
141 | print("Finished write to file.")
142 | 
143 | # Remove all columns with all zero entries 
144 | # Sam <- Sam[,which(unlist(lapply(Sam, function(x)!all(is.zero(x))))),with=F]
145 | # print(str(Sam))
146 | 


--------------------------------------------------------------------------------
/preprocess/reduce_columns.R:
--------------------------------------------------------------------------------
 1 | library(data.table)
 2 | 
 3 | # For use with command line Rscript
 4 | 
 5 | args <- commandArgs(trailingOnly = TRUE)
 6 | path_sam <- args[1]
 7 | path_columns <- args[2]
 8 | dest_filename <- args[3]
 9 | 
10 | print("Reading files")
11 | Sam <- fread(path_sam)
12 | Columns <- fread(path_columns, header = FALSE)
13 | print("Finished reading files")
14 | 
15 | Sam.names <- names(read.csv(path_sam, nrows = 1, fileEncoding = "UTF-8-BOM"))
16 | Column.names <- c("features")
17 | 
18 | names(Sam) <- Sam.names
19 | names(Columns) <- Column.names
20 | 
21 | print("Removed BOM from text")
22 | print(str(Sam))
23 | print(str(Columns))
24 | 
25 | Columns.vec <- Columns$features # first column
26 | # print("Before")
27 | # print(Columns.vec)
28 | Columns.vec <- Columns.vec[which(Columns.vec %in% colnames(Sam))]
29 | print("Reduced Columns vec")
30 | # print("After")
31 | # print(Columns.vec)
32 | 
33 | print("Filtering columns")
34 | Sam <- Sam[, Columns.vec, with=F]
35 | fwrite(Sam, file.path = dest_filename)
36 | print("Done filtering columns")


--------------------------------------------------------------------------------
/preprocess/roc.R:
--------------------------------------------------------------------------------
 1 | library("ROCR")
 2 | 
 3 | args <- commandArgs(trailingOnly = TRUE)
 4 | pred_path <- args[1]
 5 | labels_path <- args[2]
 6 | 
 7 | pred <- read.csv(pred_path, header = FALSE)[,2]
 8 | labels <- read.csv(labels_path, header = FALSE)[,2]
 9 | 
10 | pred <- prediction(pred, labels)
11 | perf <- performance(pred, measure = "tpr", x.measure = "fpr") # ROC
12 | pdf("ROC.pdf")
13 | plot(perf, col=rainbow(10))
14 | dev.off()


--------------------------------------------------------------------------------
/rf/rf2.py:
--------------------------------------------------------------------------------
 1 | """
 2 | a random forest classifier
 3 | with muilti-GPU utilization
 4 | 
 5 | Tiffany.Fu
 6 | 
 7 | """
 8 | 
 9 | 
10 | from __future__ import absolute_import
11 | from __future__ import division
12 | from __future__ import print_function
13 | from __future__ import absolute_import
14 | from __future__ import division
15 | from __future__ import print_function
16 | 
17 | from sklearn import datasets, metrics, cross_validation
18 | import tensorflow as tf
19 | from tensorflow.contrib import skflow
20 | 
21 | 
22 | 
23 | import tensorflow as tf
24 | 
25 | 
26 | class TensorForestTrainer (tf.test.TestCase):
27 | 
28 |   def Classification(self):
29 |     """classification using matrix data as input."""
30 |     hparams = tf.contrib.tensor_forest.python.tensor_forest.ForestHParams(
31 |         num_trees=300, max_nodes=1000, num_classes=2, num_features=4)
32 |     classifier = tf.contrib.learn.TensorForestEstimator(hparams)
33 | 
34 | 
35 |     classifier.fit(x=, y=, steps=100)
36 |     classifier.evaluate(x=, y=, steps=10)
37 | 
38 | 
39 | 
40 | if __name__ == '__main__':
41 |   tf.test.main()
42 | 


--------------------------------------------------------------------------------
/rf/rf3.py:
--------------------------------------------------------------------------------
  1 | from __future__ import absolute_import
  2 | from __future__ import division
  3 | from __future__ import print_function
  4 | 
  5 | import time
  6 | 
  7 | import numpy as np
  8 | import six
  9 | 
 10 | from tensorflow.contrib import framework as contrib_framework
 11 | from tensorflow.contrib.learn.python.learn import monitors as mon
 12 | 
 13 | from tensorflow.contrib.learn.python.learn.estimators import estimator
 14 | from tensorflow.contrib.learn.python.learn.estimators import run_config
 15 | 
 16 | from tensorflow.contrib.tensor_forest.client import eval_metrics
 17 | from tensorflow.contrib.tensor_forest.data import data_ops
 18 | from tensorflow.contrib.tensor_forest.python import tensor_forest
 19 | 
 20 | from tensorflow.python.ops import array_ops
 21 | from tensorflow.python.ops import control_flow_ops
 22 | from tensorflow.python.ops import math_ops
 23 | from tensorflow.python.ops import state_ops
 24 | 
 25 | 
 26 | class LossMonitor(mon.EveryN):
 27 |   """Terminates training when training loss stops decreasing."""
 28 | 
 29 |   def __init__(self,
 30 |                early_stopping_rounds,
 31 |                every_n_steps):
 32 |     super(LossMonitor, self).__init__(every_n_steps=every_n_steps)
 33 |     self.early_stopping_rounds = early_stopping_rounds
 34 |     self.min_loss = None
 35 |     self.min_loss_step = 0
 36 | 
 37 |   def set_estimator(self, est):
 38 |     """This function gets called in the same graph as _get_train_ops."""
 39 |     super(LossMonitor, self).set_estimator(est)
 40 |     self._loss_op_name = est.training_loss.name
 41 | 
 42 |   def every_n_step_end(self, step, outputs):
 43 |     super(LossMonitor, self).every_n_step_end(step, outputs)
 44 |     current_loss = outputs[self._loss_op_name]
 45 |     if self.min_loss is None or current_loss < self.min_loss:
 46 |       self.min_loss = current_loss
 47 |       self.min_loss_step = step
 48 |     return step - self.min_loss_step >= self.early_stopping_rounds
 49 | 
 50 | 
 51 | class TensorForestEstimator(estimator.BaseEstimator):
 52 |   """An estimator that can train and evaluate a random forest."""
 53 | 
 54 |   def __init__(self, params, device_assigner=None, model_dir=None,
 55 |                graph_builder_class=tensor_forest.RandomForestGraphs,
 56 |                master='', accuracy_metric=None,
 57 |                tf_random_seed=None, config=None):
 58 |     self.params = params.fill()
 59 |     self.accuracy_metric = (accuracy_metric or
 60 |                             ('r2' if self.params.regression else 'accuracy'))
 61 |     self.data_feeder = None
 62 |     self.device_assigner = (
 63 |         device_assigner or tensor_forest.RandomForestDeviceAssigner())
 64 |     self.graph_builder_class = graph_builder_class
 65 |     self.training_args = {}
 66 |     self.construction_args = {}
 67 | 
 68 |     super(TensorForestEstimator, self).__init__(model_dir=model_dir,
 69 |                                                 config=config)
 70 | 
 71 |   def predict_proba(self, x=None, input_fn=None, batch_size=None):
 72 |     """Returns prediction probabilities for given features (classification).
 73 |     Args:
 74 |       x: features.
 75 |       input_fn: Input function. If set, x and y must be None.
 76 |       batch_size: Override default batch size.
 77 |     Returns:
 78 |       Numpy array of predicted probabilities.
 79 |     Raises:
 80 |       ValueError: If both or neither of x and input_fn were given.
 81 |     """
 82 |     return super(TensorForestEstimator, self).predict(
 83 |         x=x, input_fn=input_fn, batch_size=batch_size)
 84 | 
 85 |   def predict(self, x=None, input_fn=None, axis=None, batch_size=None):
 86 |     """Returns predictions for given features.
 87 |     Args:
 88 |       x: features.
 89 |       input_fn: Input function. If set, x must be None.
 90 |       axis: Axis on which to argmax (for classification).
 91 |             Last axis is used by default.
 92 |       batch_size: Override default batch size.
 93 |     Returns:
 94 |       Numpy array of predicted classes or regression values.
 95 |     """
 96 |     probabilities = self.predict_proba(x, input_fn, batch_size)
 97 |     if self.params.regression:
 98 |       return probabilities
 99 |     else:
100 |       return np.argmax(probabilities, axis=1)
101 | 
102 |   def _get_train_ops(self, features, targets):
103 |     """Method that builds model graph and returns trainer ops.
104 |     Args:
105 |       features: `Tensor` or `dict` of `Tensor` objects.
106 |       targets: `Tensor` or `dict` of `Tensor` objects.
107 |     Returns:
108 |       Tuple of train `Operation` and loss `Tensor`.
109 |     """
110 |     features, spec = data_ops.ParseDataTensorOrDict(features)
111 |     labels = data_ops.ParseLabelTensorOrDict(targets)
112 | 
113 |     graph_builder = self.graph_builder_class(
114 |         self.params, device_assigner=self.device_assigner,
115 |         **self.construction_args)
116 | 
117 |     epoch = None
118 |     if self.data_feeder:
119 |       epoch = self.data_feeder.make_epoch_variable()
120 | 
121 |     train = control_flow_ops.group(
122 |         graph_builder.training_graph(
123 |             features, labels, data_spec=spec, epoch=epoch,
124 |             **self.training_args),
125 |         state_ops.assign_add(contrib_framework.get_global_step(), 1))
126 | 
127 |     self.training_loss = graph_builder.training_loss(features, targets)
128 | 
129 |     return train, self.training_loss
130 | 
131 |   def _get_predict_ops(self, features):
132 |     graph_builder = self.graph_builder_class(
133 |         self.params, device_assigner=self.device_assigner, training=False,
134 |         **self.construction_args)
135 |     features, spec = data_ops.ParseDataTensorOrDict(features)
136 |     return graph_builder.inference_graph(features, data_spec=spec)
137 | 
138 |   def _get_eval_ops(self, features, targets, metrics):
139 |     features, spec = data_ops.ParseDataTensorOrDict(features)
140 |     labels = data_ops.ParseLabelTensorOrDict(targets)
141 | 
142 |     graph_builder = self.graph_builder_class(
143 |         self.params, device_assigner=self.device_assigner, training=False,
144 |         **self.construction_args)
145 | 
146 |     probabilities = graph_builder.inference_graph(features, data_spec=spec)
147 | 
148 |     # One-hot the labels.
149 |     if not self.params.regression:
150 |       labels = math_ops.to_int64(array_ops.one_hot(math_ops.to_int64(
151 |           array_ops.squeeze(labels)), self.params.num_classes, 1, 0))
152 | 
153 |     if metrics is None:
154 |       metrics = {self.accuracy_metric:
155 |                  eval_metrics.get_metric(self.accuracy_metric)}
156 | 
157 |     result = {}
158 |     for name, metric in six.iteritems(metrics):
159 |       result[name] = metric(probabilities, labels)
160 | 
161 |     return result
162 | 


--------------------------------------------------------------------------------
/rf/tensor_forest.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2016 The TensorFlow Authors. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | # ==============================================================================
 15 | """Extremely random forest graph builder. go/brain-tree."""
 16 | from __future__ import absolute_import
 17 | from __future__ import division
 18 | from __future__ import print_function
 19 | 
 20 | import math
 21 | import random
 22 | 
 23 | from tensorflow.contrib.tensor_forest.python import constants
 24 | from tensorflow.contrib.tensor_forest.python.ops import inference_ops
 25 | from tensorflow.contrib.tensor_forest.python.ops import training_ops
 26 | 
 27 | from tensorflow.python.framework import constant_op
 28 | from tensorflow.python.framework import dtypes
 29 | from tensorflow.python.framework import ops
 30 | from tensorflow.python.ops import array_ops
 31 | from tensorflow.python.ops import control_flow_ops
 32 | from tensorflow.python.ops import init_ops
 33 | from tensorflow.python.ops import math_ops
 34 | from tensorflow.python.ops import random_ops
 35 | from tensorflow.python.ops import state_ops
 36 | from tensorflow.python.ops import variable_scope
 37 | from tensorflow.python.ops import variables as tf_variables
 38 | from tensorflow.python.platform import tf_logging as logging
 39 | 
 40 | 
 41 | # A convenience class for holding random forest hyperparameters.
 42 | #
 43 | # To just get some good default parameters, use:
 44 | #   hparams = ForestHParams(num_classes=2, num_features=40).fill()
 45 | #
 46 | # Note that num_classes can not be inferred and so must always be specified.
 47 | # Also, either num_splits_to_consider or num_features should be set.
 48 | #
 49 | # To override specific values, pass them to the constructor:
 50 | #   hparams = ForestHParams(num_classes=5, num_trees=10, num_features=5).fill()
 51 | #
 52 | # TODO(thomaswc): Inherit from tf.HParams when that is publicly available.
 53 | class ForestHParams(object):
 54 |   """A base class for holding hyperparameters and calculating good defaults."""
 55 | 
 56 |   def __init__(self,
 57 |                num_trees=100,
 58 |                max_nodes=10000,
 59 |                bagging_fraction=1.0,
 60 |                num_splits_to_consider=0,
 61 |                feature_bagging_fraction=1.0,
 62 |                max_fertile_nodes=0,
 63 |                split_after_samples=250,
 64 |                min_split_samples=5,
 65 |                valid_leaf_threshold=1,
 66 |                **kwargs):
 67 |     self.num_trees = num_trees
 68 |     self.max_nodes = max_nodes
 69 |     self.bagging_fraction = bagging_fraction
 70 |     self.feature_bagging_fraction = feature_bagging_fraction
 71 |     self.num_splits_to_consider = num_splits_to_consider
 72 |     self.max_fertile_nodes = max_fertile_nodes
 73 |     self.split_after_samples = split_after_samples
 74 |     self.min_split_samples = min_split_samples
 75 |     self.valid_leaf_threshold = valid_leaf_threshold
 76 | 
 77 |     for name, value in kwargs.items():
 78 |       setattr(self, name, value)
 79 | 
 80 |   def values(self):
 81 |     return self.__dict__
 82 | 
 83 |   def fill(self):
 84 |     """Intelligently sets any non-specific parameters."""
 85 |     # Fail fast if num_classes or num_features isn't set.
 86 |     _ = getattr(self, 'num_classes')
 87 |     _ = getattr(self, 'num_features')
 88 | 
 89 |     self.bagged_num_features = int(self.feature_bagging_fraction *
 90 |                                    self.num_features)
 91 | 
 92 |     self.bagged_features = None
 93 |     if self.feature_bagging_fraction < 1.0:
 94 |       self.bagged_features = [random.sample(
 95 |           range(self.num_features),
 96 |           self.bagged_num_features) for _ in range(self.num_trees)]
 97 | 
 98 |     self.regression = getattr(self, 'regression', False)
 99 | 
100 |     # Num_outputs is the actual number of outputs (a single prediction for
101 |     # classification, a N-dimenensional point for regression).
102 |     self.num_outputs = self.num_classes if self.regression else 1
103 | 
104 |     # Add an extra column to classes for storing counts, which is needed for
105 |     # regression and avoids having to recompute sums for classification.
106 |     self.num_output_columns = self.num_classes + 1
107 | 
108 |     # The Random Forest literature recommends sqrt(# features) for
109 |     # classification problems, and p/3 for regression problems.
110 |     # TODO(thomaswc): Consider capping this for large number of features.
111 |     self.num_splits_to_consider = (
112 |         self.num_splits_to_consider or
113 |         max(10, int(math.ceil(math.sqrt(self.num_features)))))
114 | 
115 |     # max_fertile_nodes doesn't effect performance, only training speed.
116 |     # We therefore set it primarily based upon space considerations.
117 |     # Each fertile node takes up num_splits_to_consider times as much
118 |     # as space as a non-fertile node.  We want the fertile nodes to in
119 |     # total only take up as much space as the non-fertile nodes, so
120 |     num_fertile = int(math.ceil(self.max_nodes / self.num_splits_to_consider))
121 |     # But always use at least 1000 accumulate slots.
122 |     num_fertile = max(num_fertile, 1000)
123 |     self.max_fertile_nodes = self.max_fertile_nodes or num_fertile
124 |     # But it also never needs to be larger than the number of leaves,
125 |     # which is max_nodes / 2.
126 |     self.max_fertile_nodes = min(self.max_fertile_nodes,
127 |                                  int(math.ceil(self.max_nodes / 2.0)))
128 | 
129 |     # We have num_splits_to_consider slots to fill, and we want to spend
130 |     # approximately split_after_samples samples initializing them.
131 |     num_split_initializiations_per_input = max(1, int(math.floor(
132 |         self.num_splits_to_consider / self.split_after_samples)))
133 |     self.split_initializations_per_input = getattr(
134 |         self, 'split_initializations_per_input',
135 |         num_split_initializiations_per_input)
136 | 
137 |     # If base_random_seed is 0, the current time will be used to seed the
138 |     # random number generators for each tree.  If non-zero, the i-th tree
139 |     # will be seeded with base_random_seed + i.
140 |     self.base_random_seed = getattr(self, 'base_random_seed', 0)
141 | 
142 |     return self
143 | 
144 | 
145 | # A simple container to hold the training variables for a single tree.
146 | class TreeTrainingVariables(object):
147 |   """Stores tf.Variables for training a single random tree.
148 | 
149 |   Uses tf.get_variable to get tree-specific names so that this can be used
150 |   with a tf.learn-style implementation (one that trains a model, saves it,
151 |   then relies on restoring that model to evaluate).
152 |   """
153 | 
154 |   def __init__(self, params, tree_num, training):
155 |     self.tree = variable_scope.get_variable(
156 |         name=self.get_tree_name('tree', tree_num), dtype=dtypes.int32,
157 |         shape=[params.max_nodes, 2],
158 |         initializer=init_ops.constant_initializer(-2))
159 |     self.tree_thresholds = variable_scope.get_variable(
160 |         name=self.get_tree_name('tree_thresholds', tree_num),
161 |         shape=[params.max_nodes],
162 |         initializer=init_ops.constant_initializer(-1.0))
163 |     self.end_of_tree = variable_scope.get_variable(
164 |         name=self.get_tree_name('end_of_tree', tree_num),
165 |         dtype=dtypes.int32,
166 |         initializer=constant_op.constant([1]))
167 |     self.start_epoch = tf_variables.Variable(
168 |         [0] * (params.max_nodes), name='start_epoch')
169 | 
170 |     if training:
171 |       self.node_to_accumulator_map = variable_scope.get_variable(
172 |           name=self.get_tree_name('node_to_accumulator_map', tree_num),
173 |           shape=[params.max_nodes],
174 |           dtype=dtypes.int32,
175 |           initializer=init_ops.constant_initializer(-1))
176 | 
177 |       self.candidate_split_features = variable_scope.get_variable(
178 |           name=self.get_tree_name('candidate_split_features', tree_num),
179 |           shape=[params.max_fertile_nodes, params.num_splits_to_consider],
180 |           dtype=dtypes.int32,
181 |           initializer=init_ops.constant_initializer(-1))
182 |       self.candidate_split_thresholds = variable_scope.get_variable(
183 |           name=self.get_tree_name('candidate_split_thresholds', tree_num),
184 |           shape=[params.max_fertile_nodes, params.num_splits_to_consider],
185 |           initializer=init_ops.constant_initializer(0.0))
186 | 
187 |     # Statistics shared by classification and regression.
188 |     self.node_sums = variable_scope.get_variable(
189 |         name=self.get_tree_name('node_sums', tree_num),
190 |         shape=[params.max_nodes, params.num_output_columns],
191 |         initializer=init_ops.constant_initializer(0.0))
192 | 
193 |     if training:
194 |       self.candidate_split_sums = variable_scope.get_variable(
195 |           name=self.get_tree_name('candidate_split_sums', tree_num),
196 |           shape=[params.max_fertile_nodes, params.num_splits_to_consider,
197 |                  params.num_output_columns],
198 |           initializer=init_ops.constant_initializer(0.0))
199 |       self.accumulator_sums = variable_scope.get_variable(
200 |           name=self.get_tree_name('accumulator_sums', tree_num),
201 |           shape=[params.max_fertile_nodes, params.num_output_columns],
202 |           initializer=init_ops.constant_initializer(-1.0))
203 | 
204 |       # Regression also tracks second order stats.
205 |       if params.regression:
206 |         self.node_squares = variable_scope.get_variable(
207 |             name=self.get_tree_name('node_squares', tree_num),
208 |             shape=[params.max_nodes, params.num_output_columns],
209 |             initializer=init_ops.constant_initializer(0.0))
210 | 
211 |         self.candidate_split_squares = variable_scope.get_variable(
212 |             name=self.get_tree_name('candidate_split_squares', tree_num),
213 |             shape=[params.max_fertile_nodes, params.num_splits_to_consider,
214 |                    params.num_output_columns],
215 |             initializer=init_ops.constant_initializer(0.0))
216 | 
217 |         self.accumulator_squares = variable_scope.get_variable(
218 |             name=self.get_tree_name('accumulator_squares', tree_num),
219 |             shape=[params.max_fertile_nodes, params.num_output_columns],
220 |             initializer=init_ops.constant_initializer(-1.0))
221 | 
222 |       else:
223 |         self.node_squares = constant_op.constant(
224 |             0.0, name=self.get_tree_name('node_squares', tree_num))
225 | 
226 |         self.candidate_split_squares = constant_op.constant(
227 |             0.0, name=self.get_tree_name('candidate_split_squares', tree_num))
228 | 
229 |         self.accumulator_squares = constant_op.constant(
230 |             0.0, name=self.get_tree_name('accumulator_squares', tree_num))
231 | 
232 |   def get_tree_name(self, name, num):
233 |     return '{0}-{1}'.format(name, num)
234 | 
235 | 
236 | class ForestStats(object):
237 | 
238 |   def __init__(self, tree_stats, params):
239 |     """A simple container for stats about a forest."""
240 |     self.tree_stats = tree_stats
241 |     self.params = params
242 | 
243 |   def get_average(self, thing):
244 |     val = 0.0
245 |     for i in range(self.params.num_trees):
246 |       val += getattr(self.tree_stats[i], thing)
247 | 
248 |     return val / self.params.num_trees
249 | 
250 | 
251 | class TreeStats(object):
252 | 
253 |   def __init__(self, num_nodes, num_leaves):
254 |     self.num_nodes = num_nodes
255 |     self.num_leaves = num_leaves
256 | 
257 | 
258 | class ForestTrainingVariables(object):
259 |   """A container for a forests training data, consisting of multiple trees.
260 | 
261 |   Instantiates a TreeTrainingVariables object for each tree. We override the
262 |   __getitem__ and __setitem__ function so that usage looks like this:
263 | 
264 |     forest_variables = ForestTrainingVariables(params)
265 | 
266 |     ... forest_variables.tree ...
267 |   """
268 | 
269 |   def __init__(self, params, device_assigner, training=True,
270 |                tree_variables_class=TreeTrainingVariables):
271 |     self.variables = []
272 |     for i in range(params.num_trees):
273 |       with ops.device(device_assigner.get_device(i)):
274 |         self.variables.append(tree_variables_class(params, i, training))
275 | 
276 |   def __setitem__(self, t, val):
277 |     self.variables[t] = val
278 | 
279 |   def __getitem__(self, t):
280 |     return self.variables[t]
281 | 
282 | 
283 | class RandomForestDeviceAssigner(object):
284 |   """A device assigner that uses the default device.
285 | 
286 |   Write subclasses that implement get_device for control over how trees
287 |   get assigned to devices.  This assumes that whole trees are assigned
288 |   to a device.
289 |   """
290 | 
291 |   def __init__(self):
292 |     self.cached = None
293 | 
294 |   def get_device(self, unused_tree_num):
295 |     if not self.cached:
296 |       dummy = constant_op.constant(0)
297 |       self.cached = dummy.device
298 | 
299 |     return self.cached
300 | 
301 | 
302 | class RandomForestGraphs(object):
303 |   """Builds TF graphs for random forest training and inference."""
304 | 
305 |   def __init__(self, params, device_assigner=None,
306 |                variables=None, tree_variables_class=TreeTrainingVariables,
307 |                tree_graphs=None, training=True,
308 |                t_ops=training_ops,
309 |                i_ops=inference_ops):
310 |     self.params = params
311 |     self.device_assigner = device_assigner or RandomForestDeviceAssigner()
312 |     logging.info('Constructing forest with params = ')
313 |     logging.info(self.params.__dict__)
314 |     self.variables = variables or ForestTrainingVariables(
315 |         self.params, device_assigner=self.device_assigner, training=training,
316 |         tree_variables_class=tree_variables_class)
317 |     tree_graph_class = tree_graphs or RandomTreeGraphs
318 |     self.trees = [
319 |         tree_graph_class(
320 |             self.variables[i], self.params,
321 |             t_ops.Load(), i_ops.Load(), i)
322 |         for i in range(self.params.num_trees)]
323 | 
324 |   def _bag_features(self, tree_num, input_data):
325 |     split_data = array_ops.split(1, self.params.num_features, input_data)
326 |     return array_ops.concat(
327 |         1, [split_data[ind] for ind in self.params.bagged_features[tree_num]])
328 | 
329 |   def training_graph(self, input_data, input_labels, data_spec=None,
330 |                      epoch=None, **tree_kwargs):
331 |     """Constructs a TF graph for training a random forest.
332 | 
333 |     Args:
334 |       input_data: A tensor or SparseTensor or placeholder for input data.
335 |       input_labels: A tensor or placeholder for labels associated with
336 |         input_data.
337 |       data_spec: A list of tf.dtype values specifying the original types of
338 |         each column.
339 |       epoch: A tensor or placeholder for the epoch the training data comes from.
340 |       **tree_kwargs: Keyword arguments passed to each tree's training_graph.
341 | 
342 |     Returns:
343 |       The last op in the random forest training graph.
344 |     """
345 |     data_spec = [constants.DATA_FLOAT] if data_spec is None else data_spec
346 |     tree_graphs = []
347 |     for i in range(self.params.num_trees):
348 |       with ops.device(self.device_assigner.get_device(i)):
349 |         seed = self.params.base_random_seed
350 |         if seed != 0:
351 |           seed += i
352 |         # If using bagging, randomly select some of the input.
353 |         tree_data = input_data
354 |         tree_labels = input_labels
355 |         if self.params.bagging_fraction < 1.0:
356 |           # TODO(thomaswc): This does sampling without replacment.  Consider
357 |           # also allowing sampling with replacement as an option.
358 |           batch_size = array_ops.slice(array_ops.shape(input_data), [0], [1])
359 |           r = random_ops.random_uniform(batch_size, seed=seed)
360 |           mask = math_ops.less(
361 |               r, array_ops.ones_like(r) * self.params.bagging_fraction)
362 |           gather_indices = array_ops.squeeze(
363 |               array_ops.where(mask), squeeze_dims=[1])
364 |           # TODO(thomaswc): Calculate out-of-bag data and labels, and store
365 |           # them for use in calculating statistics later.
366 |           tree_data = array_ops.gather(input_data, gather_indices)
367 |           tree_labels = array_ops.gather(input_labels, gather_indices)
368 |         if self.params.bagged_features:
369 |           tree_data = self._bag_features(i, tree_data)
370 | 
371 |         initialization = self.trees[i].tree_initialization()
372 | 
373 |         with ops.control_dependencies([initialization]):
374 |           tree_graphs.append(
375 |               self.trees[i].training_graph(
376 |                   tree_data, tree_labels, seed, data_spec=data_spec,
377 |                   epoch=([0] if epoch is None else epoch),
378 |                   **tree_kwargs))
379 | 
380 |     return control_flow_ops.group(*tree_graphs, name='train')
381 | 
382 |   def inference_graph(self, input_data, data_spec=None):
383 |     """Constructs a TF graph for evaluating a random forest.
384 | 
385 |     Args:
386 |       input_data: A tensor or SparseTensor or placeholder for input data.
387 |       data_spec: A list of tf.dtype values specifying the original types of
388 |         each column.
389 | 
390 |     Returns:
391 |       The last op in the random forest inference graph.
392 |     """
393 |     data_spec = [constants.DATA_FLOAT] if data_spec is None else data_spec
394 |     probabilities = []
395 |     for i in range(self.params.num_trees):
396 |       with ops.device(self.device_assigner.get_device(i)):
397 |         tree_data = input_data
398 |         if self.params.bagged_features:
399 |           tree_data = self._bag_features(i, input_data)
400 |         probabilities.append(self.trees[i].inference_graph(tree_data,
401 |                                                            data_spec))
402 |     with ops.device(self.device_assigner.get_device(0)):
403 |       all_predict = array_ops.pack(probabilities)
404 |       return math_ops.div(
405 |           math_ops.reduce_sum(all_predict, 0), self.params.num_trees,
406 |           name='probabilities')
407 | 
408 |   def average_size(self):
409 |     """Constructs a TF graph for evaluating the average size of a forest.
410 | 
411 |     Returns:
412 |       The average number of nodes over the trees.
413 |     """
414 |     sizes = []
415 |     for i in range(self.params.num_trees):
416 |       with ops.device(self.device_assigner.get_device(i)):
417 |         sizes.append(self.trees[i].size())
418 |     return math_ops.reduce_mean(array_ops.pack(sizes))
419 | 
420 |   # pylint: disable=unused-argument
421 |   def training_loss(self, features, labels):
422 |     return math_ops.neg(self.average_size())
423 | 
424 |   # pylint: disable=unused-argument
425 |   def validation_loss(self, features, labels):
426 |     return math_ops.neg(self.average_size())
427 | 
428 |   def average_impurity(self):
429 |     """Constructs a TF graph for evaluating the leaf impurity of a forest.
430 | 
431 |     Returns:
432 |       The last op in the graph.
433 |     """
434 |     impurities = []
435 |     for i in range(self.params.num_trees):
436 |       with ops.device(self.device_assigner.get_device(i)):
437 |         impurities.append(self.trees[i].average_impurity())
438 |     return math_ops.reduce_mean(array_ops.pack(impurities))
439 | 
440 |   def get_stats(self, session):
441 |     tree_stats = []
442 |     for i in range(self.params.num_trees):
443 |       with ops.device(self.device_assigner.get_device(i)):
444 |         tree_stats.append(self.trees[i].get_stats(session))
445 |     return ForestStats(tree_stats, self.params)
446 | 
447 | 
448 | class RandomTreeGraphs(object):
449 |   """Builds TF graphs for random tree training and inference."""
450 | 
451 |   def __init__(self, variables, params, t_ops, i_ops, tree_num):
452 |     self.training_ops = t_ops
453 |     self.inference_ops = i_ops
454 |     self.variables = variables
455 |     self.params = params
456 |     self.tree_num = tree_num
457 | 
458 |   def tree_initialization(self):
459 |     def _init_tree():
460 |       return state_ops.scatter_update(self.variables.tree, [0], [[-1, -1]]).op
461 | 
462 |     def _nothing():
463 |       return control_flow_ops.no_op()
464 | 
465 |     return control_flow_ops.cond(
466 |         math_ops.equal(array_ops.squeeze(array_ops.slice(
467 |             self.variables.tree, [0, 0], [1, 1])), -2),
468 |         _init_tree, _nothing)
469 | 
470 |   def _gini(self, class_counts):
471 |     """Calculate the Gini impurity.
472 | 
473 |     If c(i) denotes the i-th class count and c = sum_i c(i) then
474 |       score = 1 - sum_i ( c(i) / c )^2
475 | 
476 |     Args:
477 |       class_counts: A 2-D tensor of per-class counts, usually a slice or
478 |         gather from variables.node_sums.
479 | 
480 |     Returns:
481 |       A 1-D tensor of the Gini impurities for each row in the input.
482 |     """
483 |     smoothed = 1.0 + array_ops.slice(class_counts, [0, 1], [-1, -1])
484 |     sums = math_ops.reduce_sum(smoothed, 1)
485 |     sum_squares = math_ops.reduce_sum(math_ops.square(smoothed), 1)
486 | 
487 |     return 1.0 - sum_squares / (sums * sums)
488 | 
489 |   def _weighted_gini(self, class_counts):
490 |     """Our split score is the Gini impurity times the number of examples.
491 | 
492 |     If c(i) denotes the i-th class count and c = sum_i c(i) then
493 |       score = c * (1 - sum_i ( c(i) / c )^2 )
494 |             = c - sum_i c(i)^2 / c
495 |     Args:
496 |       class_counts: A 2-D tensor of per-class counts, usually a slice or
497 |         gather from variables.node_sums.
498 | 
499 |     Returns:
500 |       A 1-D tensor of the Gini impurities for each row in the input.
501 |     """
502 |     smoothed = 1.0 + array_ops.slice(class_counts, [0, 1], [-1, -1])
503 |     sums = math_ops.reduce_sum(smoothed, 1)
504 |     sum_squares = math_ops.reduce_sum(math_ops.square(smoothed), 1)
505 | 
506 |     return sums - sum_squares / sums
507 | 
508 |   def _variance(self, sums, squares):
509 |     """Calculate the variance for each row of the input tensors.
510 | 
511 |     Variance is V = E[x^2] - (E[x])^2.
512 | 
513 |     Args:
514 |       sums: A tensor containing output sums, usually a slice from
515 |         variables.node_sums.  Should contain the number of examples seen
516 |         in index 0 so we can calculate expected value.
517 |       squares: Same as sums, but sums of squares.
518 | 
519 |     Returns:
520 |       A 1-D tensor of the variances for each row in the input.
521 |     """
522 |     total_count = array_ops.slice(sums, [0, 0], [-1, 1])
523 |     e_x = sums / total_count
524 |     e_x2 = squares / total_count
525 | 
526 |     return math_ops.reduce_sum(e_x2 - math_ops.square(e_x), 1)
527 | 
528 |   def training_graph(self, input_data, input_labels, random_seed,
529 |                      data_spec, epoch=None):
530 | 
531 |     """Constructs a TF graph for training a random tree.
532 | 
533 |     Args:
534 |       input_data: A tensor or SparseTensor or placeholder for input data.
535 |       input_labels: A tensor or placeholder for labels associated with
536 |         input_data.
537 |       random_seed: The random number generator seed to use for this tree.  0
538 |         means use the current time as the seed.
539 |       data_spec: A list of tf.dtype values specifying the original types of
540 |         each column.
541 |       epoch: A tensor or placeholder for the epoch the training data comes from.
542 | 
543 |     Returns:
544 |       The last op in the random tree training graph.
545 |     """
546 |     epoch = [0] if epoch is None else epoch
547 | 
548 |     sparse_indices = []
549 |     sparse_values = []
550 |     sparse_shape = []
551 |     if isinstance(input_data, ops.SparseTensor):
552 |       sparse_indices = input_data.indices
553 |       sparse_values = input_data.values
554 |       sparse_shape = input_data.shape
555 |       input_data = []
556 | 
557 |     # Count extremely random stats.
558 |     (node_sums, node_squares, splits_indices, splits_sums,
559 |      splits_squares, totals_indices, totals_sums,
560 |      totals_squares, input_leaves) = (
561 |          self.training_ops.count_extremely_random_stats(
562 |              input_data, sparse_indices, sparse_values, sparse_shape,
563 |              data_spec, input_labels, self.variables.tree,
564 |              self.variables.tree_thresholds,
565 |              self.variables.node_to_accumulator_map,
566 |              self.variables.candidate_split_features,
567 |              self.variables.candidate_split_thresholds,
568 |              self.variables.start_epoch, epoch,
569 |              num_classes=self.params.num_output_columns,
570 |              regression=self.params.regression))
571 |     node_update_ops = []
572 |     node_update_ops.append(
573 |         state_ops.assign_add(self.variables.node_sums, node_sums))
574 | 
575 |     splits_update_ops = []
576 |     splits_update_ops.append(self.training_ops.scatter_add_ndim(
577 |         self.variables.candidate_split_sums,
578 |         splits_indices, splits_sums))
579 |     splits_update_ops.append(self.training_ops.scatter_add_ndim(
580 |         self.variables.accumulator_sums, totals_indices,
581 |         totals_sums))
582 | 
583 |     if self.params.regression:
584 |       node_update_ops.append(state_ops.assign_add(self.variables.node_squares,
585 |                                                   node_squares))
586 |       splits_update_ops.append(self.training_ops.scatter_add_ndim(
587 |           self.variables.candidate_split_squares,
588 |           splits_indices, splits_squares))
589 |       splits_update_ops.append(self.training_ops.scatter_add_ndim(
590 |           self.variables.accumulator_squares, totals_indices,
591 |           totals_squares))
592 | 
593 |     # Sample inputs.
594 |     update_indices, feature_updates, threshold_updates = (
595 |         self.training_ops.sample_inputs(
596 |             input_data, sparse_indices, sparse_values, sparse_shape,
597 |             self.variables.node_to_accumulator_map,
598 |             input_leaves, self.variables.candidate_split_features,
599 |             self.variables.candidate_split_thresholds,
600 |             split_initializations_per_input=(
601 |                 self.params.split_initializations_per_input),
602 |             split_sampling_random_seed=random_seed))
603 |     update_features_op = state_ops.scatter_update(
604 |         self.variables.candidate_split_features, update_indices,
605 |         feature_updates)
606 |     update_thresholds_op = state_ops.scatter_update(
607 |         self.variables.candidate_split_thresholds, update_indices,
608 |         threshold_updates)
609 | 
610 |     # Calculate finished nodes.
611 |     with ops.control_dependencies(splits_update_ops):
612 |       children = array_ops.squeeze(array_ops.slice(
613 |           self.variables.tree, [0, 0], [-1, 1]), squeeze_dims=[1])
614 |       is_leaf = math_ops.equal(constants.LEAF_NODE, children)
615 |       leaves = math_ops.to_int32(array_ops.squeeze(array_ops.where(is_leaf),
616 |                                                    squeeze_dims=[1]))
617 |       finished, stale = self.training_ops.finished_nodes(
618 |           leaves, self.variables.node_to_accumulator_map,
619 |           self.variables.candidate_split_sums,
620 |           self.variables.candidate_split_squares,
621 |           self.variables.accumulator_sums,
622 |           self.variables.accumulator_squares,
623 |           self.variables.start_epoch, epoch,
624 |           num_split_after_samples=self.params.split_after_samples,
625 |           min_split_samples=self.params.min_split_samples)
626 | 
627 |     # Update leaf scores.
628 |     non_fertile_leaves = array_ops.boolean_mask(
629 |         leaves, math_ops.less(array_ops.gather(
630 |             self.variables.node_to_accumulator_map, leaves), 0))
631 | 
632 |     # TODO(gilberth): It should be possible to limit the number of non
633 |     # fertile leaves we calculate scores for, especially since we can only take
634 |     # at most array_ops.shape(finished)[0] of them.
635 |     with ops.control_dependencies(node_update_ops):
636 |       sums = array_ops.gather(self.variables.node_sums, non_fertile_leaves)
637 |       if self.params.regression:
638 |         squares = array_ops.gather(self.variables.node_squares,
639 |                                    non_fertile_leaves)
640 |         non_fertile_leaf_scores = self._variance(sums, squares)
641 |       else:
642 |         non_fertile_leaf_scores = self._weighted_gini(sums)
643 | 
644 |     # Calculate best splits.
645 |     with ops.control_dependencies(splits_update_ops):
646 |       split_indices = self.training_ops.best_splits(
647 |           finished, self.variables.node_to_accumulator_map,
648 |           self.variables.candidate_split_sums,
649 |           self.variables.candidate_split_squares,
650 |           self.variables.accumulator_sums,
651 |           self.variables.accumulator_squares,
652 |           regression=self.params.regression)
653 | 
654 |     # Grow tree.
655 |     with ops.control_dependencies([update_features_op, update_thresholds_op]):
656 |       (tree_update_indices, tree_children_updates, tree_threshold_updates,
657 |        new_eot) = (self.training_ops.grow_tree(
658 |            self.variables.end_of_tree, self.variables.node_to_accumulator_map,
659 |            finished, split_indices, self.variables.candidate_split_features,
660 |            self.variables.candidate_split_thresholds))
661 |       tree_update_op = state_ops.scatter_update(
662 |           self.variables.tree, tree_update_indices, tree_children_updates)
663 |       thresholds_update_op = state_ops.scatter_update(
664 |           self.variables.tree_thresholds, tree_update_indices,
665 |           tree_threshold_updates)
666 |       # TODO(thomaswc): Only update the epoch on the new leaves.
667 |       new_epoch_updates = epoch * array_ops.ones_like(tree_threshold_updates,
668 |                                                       dtype=dtypes.int32)
669 |       epoch_update_op = state_ops.scatter_update(
670 |           self.variables.start_epoch, tree_update_indices,
671 |           new_epoch_updates)
672 | 
673 |     # Update fertile slots.
674 |     with ops.control_dependencies([tree_update_op]):
675 |       (node_map_updates, accumulators_cleared, accumulators_allocated) = (
676 |           self.training_ops.update_fertile_slots(
677 |               finished,
678 |               non_fertile_leaves,
679 |               non_fertile_leaf_scores,
680 |               self.variables.end_of_tree,
681 |               self.variables.accumulator_sums,
682 |               self.variables.node_to_accumulator_map,
683 |               stale,
684 |               regression=self.params.regression))
685 | 
686 |     # Ensure end_of_tree doesn't get updated until UpdateFertileSlots has
687 |     # used it to calculate new leaves.
688 |     gated_new_eot, = control_flow_ops.tuple([new_eot],
689 |                                             control_inputs=[node_map_updates])
690 |     eot_update_op = state_ops.assign(self.variables.end_of_tree, gated_new_eot)
691 | 
692 |     updates = []
693 |     updates.append(eot_update_op)
694 |     updates.append(tree_update_op)
695 |     updates.append(thresholds_update_op)
696 |     updates.append(epoch_update_op)
697 | 
698 |     updates.append(state_ops.scatter_update(
699 |         self.variables.node_to_accumulator_map,
700 |         array_ops.squeeze(array_ops.slice(node_map_updates, [0, 0], [1, -1]),
701 |                           squeeze_dims=[0]),
702 |         array_ops.squeeze(array_ops.slice(node_map_updates, [1, 0], [1, -1]),
703 |                           squeeze_dims=[0])))
704 | 
705 |     cleared_and_allocated_accumulators = array_ops.concat(
706 |         0, [accumulators_cleared, accumulators_allocated])
707 |     # Calculate values to put into scatter update for candidate counts.
708 |     # Candidate split counts are always reset back to 0 for both cleared
709 |     # and allocated accumulators. This means some accumulators might be doubly
710 |     # reset to 0 if the were released and not allocated, then later allocated.
711 |     split_values = array_ops.tile(
712 |         array_ops.expand_dims(array_ops.expand_dims(
713 |             array_ops.zeros_like(cleared_and_allocated_accumulators,
714 |                                  dtype=dtypes.float32), 1), 2),
715 |         [1, self.params.num_splits_to_consider, self.params.num_output_columns])
716 |     updates.append(state_ops.scatter_update(
717 |         self.variables.candidate_split_sums,
718 |         cleared_and_allocated_accumulators, split_values))
719 |     if self.params.regression:
720 |       updates.append(state_ops.scatter_update(
721 |           self.variables.candidate_split_squares,
722 |           cleared_and_allocated_accumulators, split_values))
723 | 
724 |     # Calculate values to put into scatter update for total counts.
725 |     total_cleared = array_ops.tile(
726 |         array_ops.expand_dims(
727 |             math_ops.neg(array_ops.ones_like(accumulators_cleared,
728 |                                              dtype=dtypes.float32)), 1),
729 |         [1, self.params.num_output_columns])
730 |     total_reset = array_ops.tile(
731 |         array_ops.expand_dims(
732 |             array_ops.zeros_like(accumulators_allocated,
733 |                                  dtype=dtypes.float32), 1),
734 |         [1, self.params.num_output_columns])
735 |     accumulator_updates = array_ops.concat(0, [total_cleared, total_reset])
736 |     updates.append(state_ops.scatter_update(
737 |         self.variables.accumulator_sums,
738 |         cleared_and_allocated_accumulators, accumulator_updates))
739 |     if self.params.regression:
740 |       updates.append(state_ops.scatter_update(
741 |           self.variables.accumulator_squares,
742 |           cleared_and_allocated_accumulators, accumulator_updates))
743 | 
744 |     # Calculate values to put into scatter update for candidate splits.
745 |     split_features_updates = array_ops.tile(
746 |         array_ops.expand_dims(
747 |             math_ops.neg(array_ops.ones_like(
748 |                 cleared_and_allocated_accumulators)), 1),
749 |         [1, self.params.num_splits_to_consider])
750 |     updates.append(state_ops.scatter_update(
751 |         self.variables.candidate_split_features,
752 |         cleared_and_allocated_accumulators, split_features_updates))
753 | 
754 |     updates += self.finish_iteration()
755 | 
756 |     return control_flow_ops.group(*updates)
757 | 
758 |   def finish_iteration(self):
759 |     """Perform any operations that should be done at the end of an iteration.
760 | 
761 |     This is mostly useful for subclasses that need to reset variables after
762 |     an iteration, such as ones that are used to finish nodes.
763 | 
764 |     Returns:
765 |       A list of operations.
766 |     """
767 |     return []
768 | 
769 |   def inference_graph(self, input_data, data_spec):
770 |     """Constructs a TF graph for evaluating a random tree.
771 | 
772 |     Args:
773 |       input_data: A tensor or SparseTensor or placeholder for input data.
774 |       data_spec: A list of tf.dtype values specifying the original types of
775 |         each column.
776 | 
777 |     Returns:
778 |       The last op in the random tree inference graph.
779 |     """
780 |     sparse_indices = []
781 |     sparse_values = []
782 |     sparse_shape = []
783 |     if isinstance(input_data, ops.SparseTensor):
784 |       sparse_indices = input_data.indices
785 |       sparse_values = input_data.values
786 |       sparse_shape = input_data.shape
787 |       input_data = []
788 |     return self.inference_ops.tree_predictions(
789 |         input_data, sparse_indices, sparse_values, sparse_shape, data_spec,
790 |         self.variables.tree,
791 |         self.variables.tree_thresholds,
792 |         self.variables.node_sums,
793 |         valid_leaf_threshold=self.params.valid_leaf_threshold)
794 | 
795 |   def average_impurity(self):
796 |     """Constructs a TF graph for evaluating the average leaf impurity of a tree.
797 | 
798 |     If in regression mode, this is the leaf variance. If in classification mode,
799 |     this is the gini impurity.
800 | 
801 |     Returns:
802 |       The last op in the graph.
803 |     """
804 |     children = array_ops.squeeze(array_ops.slice(
805 |         self.variables.tree, [0, 0], [-1, 1]), squeeze_dims=[1])
806 |     is_leaf = math_ops.equal(constants.LEAF_NODE, children)
807 |     leaves = math_ops.to_int32(array_ops.squeeze(array_ops.where(is_leaf),
808 |                                                  squeeze_dims=[1]))
809 |     counts = array_ops.gather(self.variables.node_sums, leaves)
810 |     gini = self._weighted_gini(counts)
811 |     # Guard against step 1, when there often are no leaves yet.
812 |     def impurity():
813 |       return gini
814 |     # Since average impurity can be used for loss, when there's no data just
815 |     # return a big number so that loss always decreases.
816 |     def big():
817 |       return array_ops.ones_like(gini, dtype=dtypes.float32) * 10000000.
818 |     return control_flow_ops.cond(math_ops.greater(
819 |         array_ops.shape(leaves)[0], 0), impurity, big)
820 | 
821 |   def size(self):
822 |     """Constructs a TF graph for evaluating the current number of nodes.
823 | 
824 |     Returns:
825 |       The current number of nodes in the tree.
826 |     """
827 |     return self.variables.end_of_tree - 1
828 | 
829 |   def get_stats(self, session):
830 |     num_nodes = self.variables.end_of_tree.eval(session=session) - 1
831 |     num_leaves = array_ops.where(
832 |         math_ops.equal(array_ops.squeeze(array_ops.slice(
833 |             self.variables.tree, [0, 0], [-1, 1])), constants.LEAF_NODE)
834 |         ).eval(session=session).shape[0]
835 |     return TreeStats(num_nodes, num_leaves)
836 | 


--------------------------------------------------------------------------------
/rf/tensor_forest_test.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2016 The TensorFlow Authors. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | # ==============================================================================
 15 | """Tests for tf.contrib.tensor_forest.ops.tensor_forest."""
 16 | from __future__ import absolute_import
 17 | from __future__ import division
 18 | from __future__ import print_function
 19 | 
 20 | import tensorflow as tf
 21 | 
 22 | from tensorflow.contrib.tensor_forest.python import tensor_forest
 23 | 
 24 | from tensorflow.python.framework import test_util
 25 | from tensorflow.python.platform import googletest
 26 | 
 27 | 
 28 | class TensorForestTest(test_util.TensorFlowTestCase):
 29 | 
 30 |   def testForestHParams(self):
 31 |     hparams = tensor_forest.ForestHParams(
 32 |         num_classes=2, num_trees=100, max_nodes=1000,
 33 |         split_after_samples=25, num_features=60).fill()
 34 |     self.assertEquals(2, hparams.num_classes)
 35 |     self.assertEquals(3, hparams.num_output_columns)
 36 |     # sqrt(num_features) < 10, so num_splits_to_consider should be 10.
 37 |     self.assertEquals(10, hparams.num_splits_to_consider)
 38 |     # Don't have more fertile nodes than max # leaves, which is 500.
 39 |     self.assertEquals(500, hparams.max_fertile_nodes)
 40 |     # Default value of valid_leaf_threshold
 41 |     self.assertEquals(1, hparams.valid_leaf_threshold)
 42 |     # split_after_samples is larger than 10
 43 |     self.assertEquals(1, hparams.split_initializations_per_input)
 44 |     self.assertEquals(0, hparams.base_random_seed)
 45 | 
 46 |   def testForestHParamsBigTree(self):
 47 |     hparams = tensor_forest.ForestHParams(
 48 |         num_classes=2, num_trees=100, max_nodes=1000000,
 49 |         split_after_samples=25,
 50 |         num_features=1000).fill()
 51 |     # sqrt(1000) = 31.63...
 52 |     self.assertEquals(32, hparams.num_splits_to_consider)
 53 |     # 1000000 / 32 = 31250
 54 |     self.assertEquals(31250, hparams.max_fertile_nodes)
 55 |     # floor(31.63 / 25) = 1
 56 |     self.assertEquals(1, hparams.split_initializations_per_input)
 57 | 
 58 |   def testTrainingConstructionClassification(self):
 59 |     input_data = [[-1., 0.], [-1., 2.],  # node 1
 60 |                   [1., 0.], [1., -2.]]  # node 2
 61 |     input_labels = [0, 1, 2, 3]
 62 | 
 63 |     params = tensor_forest.ForestHParams(
 64 |         num_classes=4, num_features=2, num_trees=10, max_nodes=1000,
 65 |         split_after_samples=25).fill()
 66 | 
 67 |     graph_builder = tensor_forest.RandomForestGraphs(params)
 68 |     graph = graph_builder.training_graph(input_data, input_labels)
 69 |     self.assertTrue(isinstance(graph, tf.Operation))
 70 | 
 71 |   def testTrainingConstructionRegression(self):
 72 |     input_data = [[-1., 0.], [-1., 2.],  # node 1
 73 |                   [1., 0.], [1., -2.]]  # node 2
 74 |     input_labels = [0, 1, 2, 3]
 75 | 
 76 |     params = tensor_forest.ForestHParams(
 77 |         num_classes=4, num_features=2, num_trees=10, max_nodes=1000,
 78 |         split_after_samples=25, regression=True).fill()
 79 | 
 80 |     graph_builder = tensor_forest.RandomForestGraphs(params)
 81 |     graph = graph_builder.training_graph(input_data, input_labels)
 82 |     self.assertTrue(isinstance(graph, tf.Operation))
 83 | 
 84 |   def testInferenceConstruction(self):
 85 |     input_data = [[-1., 0.], [-1., 2.],  # node 1
 86 |                   [1., 0.], [1., -2.]]  # node 2
 87 | 
 88 |     params = tensor_forest.ForestHParams(
 89 |         num_classes=4, num_features=2, num_trees=10, max_nodes=1000,
 90 |         split_after_samples=25).fill()
 91 | 
 92 |     graph_builder = tensor_forest.RandomForestGraphs(params)
 93 |     graph = graph_builder.inference_graph(input_data)
 94 |     self.assertTrue(isinstance(graph, tf.Tensor))
 95 | 
 96 |   def testImpurityConstruction(self):
 97 |     params = tensor_forest.ForestHParams(
 98 |         num_classes=4, num_features=2, num_trees=10, max_nodes=1000,
 99 |         split_after_samples=25).fill()
100 | 
101 |     graph_builder = tensor_forest.RandomForestGraphs(params)
102 |     graph = graph_builder.average_impurity()
103 |     self.assertTrue(isinstance(graph, tf.Tensor))
104 | 
105 |   def testTrainingConstructionClassificationSparse(self):
106 |     input_data = tf.SparseTensor(
107 |         indices=[[0, 0], [0, 3],
108 |                  [1, 0], [1, 7],
109 |                  [2, 1],
110 |                  [3, 9]],
111 |         values=[-1.0, 0.0,
112 |                 -1., 2.,
113 |                 1.,
114 |                 -2.0],
115 |         shape=[4, 10])
116 |     input_labels = [0, 1, 2, 3]
117 | 
118 |     params = tensor_forest.ForestHParams(
119 |         num_classes=4, num_features=10, num_trees=10, max_nodes=1000,
120 |         split_after_samples=25).fill()
121 | 
122 |     graph_builder = tensor_forest.RandomForestGraphs(params)
123 |     graph = graph_builder.training_graph(input_data, input_labels)
124 |     self.assertTrue(isinstance(graph, tf.Operation))
125 | 
126 |   def testInferenceConstructionSparse(self):
127 |     input_data = tf.SparseTensor(
128 |         indices=[[0, 0], [0, 3],
129 |                  [1, 0], [1, 7],
130 |                  [2, 1],
131 |                  [3, 9]],
132 |         values=[-1.0, 0.0,
133 |                 -1., 2.,
134 |                 1.,
135 |                 -2.0],
136 |         shape=[4, 10])
137 | 
138 |     params = tensor_forest.ForestHParams(
139 |         num_classes=4, num_features=10, num_trees=10, max_nodes=1000,
140 |         split_after_samples=25).fill()
141 | 
142 |     graph_builder = tensor_forest.RandomForestGraphs(params)
143 |     graph = graph_builder.inference_graph(input_data)
144 |     self.assertTrue(isinstance(graph, tf.Tensor))
145 | 
146 | 
147 | if __name__ == '__main__':
148 |   googletest.main()
149 | 


--------------------------------------------------------------------------------
/tf/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lbkchen/deep-learning/ee2dee949d545d9b7cc1997998ee49e5d9bb2642/tf/__init__.py


--------------------------------------------------------------------------------
/tf/mnist_sda.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Example testing SDA model on MNIST digits.
 3 | """
 4 | 
 5 | from sdautoencoder import SDAutoencoder
 6 | from softmax import test_model
 7 | from tensorflow.examples.tutorials.mnist import input_data
 8 | import tensorflow as tf
 9 | 
10 | mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
11 | 
12 | 
13 | def get_mnist_batch_generator(is_train, batch_size, batch_limit=100):
14 |     if is_train:
15 |         for _ in range(batch_limit):
16 |             yield mnist.train.next_batch(batch_size)
17 |     else:
18 |         for _ in range(batch_limit):
19 |             yield mnist.test.next_batch(batch_size)
20 | 
21 | 
22 | def get_mnist_batch_xs_generator(is_train, batch_size, batch_limit=100):
23 |     for x, _ in get_mnist_batch_generator(is_train, batch_size, batch_limit):
24 |         yield x
25 | 
26 | 
27 | def main():
28 |     sess = tf.Session()
29 |     sda = SDAutoencoder(dims=[784, 500],
30 |                         activations=["sigmoid"],
31 |                         sess=sess,
32 |                         noise=0.40,
33 |                         loss="cross-entropy")
34 | 
35 |     mnist_train_gen_f = lambda: get_mnist_batch_xs_generator(True, batch_size=100, batch_limit=12000)
36 | 
37 |     sda.pretrain_network_gen(mnist_train_gen_f)
38 |     trained_parameters = sda.finetune_parameters_gen(get_mnist_batch_generator(True, batch_size=100, batch_limit=18000),
39 |                                                      output_dim=10)
40 |     transformed_filepath = "../data/mnist_test_transformed.csv"
41 |     test_ys_filepath = "../data/mnist_test_ys.csv"
42 |     output_filepath = "../data/mnist_pred_ys.csv"
43 | 
44 |     sda.write_encoded_input_with_ys(transformed_filepath, test_ys_filepath,
45 |                                     get_mnist_batch_generator(False, batch_size=100, batch_limit=100))
46 |     sess.close()
47 | 
48 |     test_model(parameters_dict=trained_parameters,
49 |                input_dim=sda.output_dim,
50 |                output_dim=10,
51 |                x_test_filepath=transformed_filepath,
52 |                y_test_filepath=test_ys_filepath,
53 |                output_filepath=output_filepath)
54 | 
55 | if __name__ == "__main__":
56 |     main()
57 | 


--------------------------------------------------------------------------------
/tf/sdautoencoder.py:
--------------------------------------------------------------------------------
  1 | """Stacked Denoising Autoencoder Implementation"""
  2 | 
  3 | import tensorflow as tf
  4 | import numpy as np
  5 | from math import sqrt
  6 | from utils import *
  7 | 
  8 | __author__ = "Ken Chen"
  9 | __copyright__ = "Copyright (C) 2016 Ken Chen, HBI Solutions, Inc."
 10 | __version__ = "1.0"
 11 | 
 12 | 
 13 | """
 14 | ###########################
 15 | ### SETUP AND CONSTANTS ###
 16 | ###########################
 17 | """
 18 | 
 19 | 
 20 | ALLOWED_ACTIVATIONS = ["sigmoid", "tanh", "relu"]
 21 | ALLOWED_LOSSES = ["rmse", "cross-entropy"]
 22 | 
 23 | TENSORBOARD_LOGDIR = "../logs/tensorboard"
 24 | TENSORBOARD_LOG_STEP = 100
 25 | 
 26 | DEBUG = False
 27 | 
 28 | 
 29 | """
 30 | ###################
 31 | ### TENSORBOARD ###
 32 | ###################
 33 | """
 34 | 
 35 | 
 36 | def attach_variable_summaries(var, name, summ_list):
 37 |     """Attach statistical summaries to a tensor for tensorboard visualization."""
 38 |     with tf.name_scope("summaries"):
 39 |         mean = tf.reduce_mean(var)
 40 |         summ_mean = tf.scalar_summary("mean/" + name, mean)
 41 |         with tf.name_scope('stddev'):
 42 |             stddev = tf.sqrt(tf.reduce_sum(tf.square(tf.sub(var, mean))))
 43 |         summ_std = tf.scalar_summary('stddev/' + name, stddev)
 44 |         summ_max = tf.scalar_summary('max/' + name, tf.reduce_max(var))
 45 |         summ_min = tf.scalar_summary('min/' + name, tf.reduce_min(var))
 46 |         summ_hist = tf.histogram_summary(name, var)
 47 |     summ_list.extend([summ_mean, summ_std, summ_max, summ_min, summ_hist])
 48 | 
 49 | 
 50 | def attach_scalar_summary(var, name, summ_list):
 51 |     """Attach scalar summaries to a scalar."""
 52 |     summ = tf.scalar_summary(tags=name, values=var)
 53 |     summ_list.append(summ)
 54 | 
 55 | 
 56 | """
 57 | ############################
 58 | ### TENSORFLOW UTILITIES ###
 59 | ############################
 60 | """
 61 | 
 62 | 
 63 | def weight_variable(input_dim, output_dim, name=None, stretch_factor=1, dtype=tf.float32):
 64 |     """Creates a weight variable with initial weights as recommended by Bengio.
 65 |     Reference: http://arxiv.org/pdf/1206.5533v2.pdf. If sigmoid is used as the activation
 66 |     function, then a stretch_factor of 4 is recommended."""
 67 |     limit = sqrt(6 / (input_dim + output_dim))
 68 |     initial = tf.random_uniform(shape=[input_dim, output_dim],
 69 |                                 minval=-(stretch_factor * limit),
 70 |                                 maxval=stretch_factor * limit,
 71 |                                 dtype=dtype)
 72 |     return tf.Variable(initial, name=name)
 73 | 
 74 | 
 75 | def bias_variable(dim, initial_value=0.0, name=None, dtype=tf.float32):
 76 |     """Creates a bias variable with an initial constant value."""
 77 |     return tf.Variable(tf.constant(value=initial_value, dtype=dtype, shape=[dim]), name=name)
 78 | 
 79 | 
 80 | def corrupt(tensor, corruption_level=0.05):
 81 |     """Uses the masking noise algorithm to mask corruption_level proportion
 82 |     of the input.
 83 | 
 84 |     :param tensor: A tensor whose values are to be corrupted.
 85 |     :param corruption_level: An int [0, 1] specifying the probability to corrupt each value.
 86 |     :return: The corrupted tensor.
 87 |     """
 88 |     total_samples = tf.reduce_prod(tf.shape(tensor))
 89 |     corruption_matrix = tf.multinomial(tf.log([[corruption_level, 1 - corruption_level]]), total_samples)
 90 |     corruption_matrix = tf.cast(tf.reshape(corruption_matrix, shape=tf.shape(tensor)), dtype=tf.float32)
 91 |     return tf.mul(tensor, corruption_matrix)
 92 | 
 93 | 
 94 | """
 95 | ############################
 96 | ### NEURAL NETWORK LAYER ###
 97 | ############################
 98 | """
 99 | 
100 | 
101 | class NNLayer:
102 |     """A container class to represent a hidden layer in the autoencoder network."""
103 | 
104 |     def __init__(self, input_dim, output_dim, name="hidden_layer", activation=None, weights=None, biases=None):
105 |         """Initializes an NNLayer with empty weights/biases (default). Weights/biases
106 |         are meant to be updated during pre-training with set_wb. Also has methods to
107 |         transform an input_tensor to an encoded representation via the weights/biases
108 |         of the layer.
109 | 
110 |         :param input_dim: An int representing the dimension of input to this layer.
111 |         :param output_dim: An int representing the dimension of the encoded output.
112 |         :param activation: A function to transform the inputs to this layer (sigmoid, etc.).
113 |         :param weights: A tensor with shape [input_dim, output_dim]
114 |         :param biases: A tensor with shape [output_dim]
115 |         """
116 |         self.input_dim = input_dim
117 |         self.output_dim = output_dim
118 |         self.name = name
119 |         self.activation = activation
120 |         self.weights = weights      # Evaluated numpy array, static
121 |         self.biases = biases        # Evaluated numpy array, static
122 |         self._weights = None        # Weights Variable, dynamic
123 |         self._biases = None         # Biases Variable, dynamic
124 | 
125 |     @property
126 |     def is_pretrained(self):
127 |         return self.weights is not None and self.biases is not None
128 | 
129 |     def set_wb(self, weights, biases):
130 |         """Used during pre-training for convenience."""
131 |         self.weights = weights      # Evaluated numpy array
132 |         self.biases = biases        # Evaluated numpy array
133 | 
134 |         print("Set weights of layer with shape", weights.shape)
135 |         print("Set biases of layer with shape", biases.shape)
136 | 
137 |     def set_wb_variables(self, summ_list):
138 |         """This function is called at the beginning of supervised fine tuning to create new
139 |         variables with initial values based on their static parameter counterparts. These
140 |         variables can then all be adjusted simultaneously during the fine tune optimization."""
141 |         assert self.is_pretrained, "Cannot set Variables when not pretrained."
142 |         with tf.name_scope(self.name):
143 |             self._weights = tf.Variable(self.weights, dtype=tf.float32, name="weights")
144 |             self._biases = tf.Variable(self.biases, dtype=tf.float32, name="biases")
145 |             attach_variable_summaries(self._weights, name=self._weights.name, summ_list=summ_list)
146 |             attach_variable_summaries(self._biases, name=self._biases.name, summ_list=summ_list)
147 |         print("Created new weights and bias variables from current values.")
148 | 
149 |     def update_wb(self, sess):
150 |         """This function is called at the end of supervised fine tuning to update the static
151 |         weight and bias values to the newest snapshot of their dynamic variable counterparts."""
152 |         assert self._weights is not None and self._biases is not None, "Weights and biases Variables not set."
153 |         self.weights = sess.run(self._weights)
154 |         self.biases = sess.run(self._biases)
155 |         print("Updated weights and biases with corresponding evaluated variable values.")
156 | 
157 |     def get_weight_variable(self):
158 |         return self._weights
159 | 
160 |     def get_bias_variable(self):
161 |         return self._biases
162 | 
163 |     def encode(self, input_tensor, use_variables=False):
164 |         """Performs this layer's encoding on the input_tensor. use_variables is set to true
165 |         during the fine-tuning stage, when all parameters of each layer need to be adjusted."""
166 |         assert self.is_pretrained, "Cannot encode when not pre-trained."
167 |         if use_variables:
168 |             return self.activate(tf.matmul(input_tensor, self._weights) + self._biases)
169 |         else:
170 |             return self.activate(tf.matmul(input_tensor, self.weights) + self.biases)
171 | 
172 |     def activate(self, input_tensor, name=None):
173 |         """Applies the activation function for this layer based on self.activation."""
174 |         if self.activation == "sigmoid":
175 |             return tf.nn.sigmoid(input_tensor, name=name)
176 |         if self.activation == "tanh":
177 |             return tf.nn.tanh(input_tensor, name=name)
178 |         if self.activation == "relu":
179 |             return tf.nn.relu(input_tensor, name=name)
180 |         else:
181 |             print("Activation function not valid. Using the identity.")
182 |             return input_tensor
183 | 
184 | 
185 | """
186 | #####################################
187 | ### STACKED DENOISING AUTOENCODER ###
188 | #####################################
189 | """
190 | 
191 | 
192 | class SDAutoencoder:
193 |     """A stacked denoising autoencoder."""
194 | 
195 |     def check_assertions(self):
196 |         assert 0 <= self.noise <= 1, "Invalid noise value given: %s" % self.noise
197 |         assert self.loss in ALLOWED_LOSSES
198 | 
199 |     def __init__(self, dims, activations, sess, noise=0.0, loss="cross-entropy",
200 |                  pretrain_lr=0.001, finetune_lr=0.001, batch_size=100, print_step=100):
201 |         """Initializes a Stacked Denoising Autoencoder based on the dimension of each
202 |         layer in the neural network and the activation function of each layer. SDA only
203 |         undergoes parameter setup at initialization. Main functions to utilize the SDA are:
204 | 
205 |         pretrain_network: (unsupervised) Greedily pre-trains every layer of the neural network,
206 |             beginning with feeding the raw data input to the first layer, and getting an encoded
207 |             version from the output of the first layer. Adjusts parameters of the network (weights and
208 |             biases of each layer) during training, via a stochastic Adam optimization method.
209 | 
210 |         finetune_parameters: (supervised) Adds a layer of fine-tuning to the network, adjusting
211 |             the weights and biases of all layers simultaneously via a softmax classifier with test
212 |             y-values. Also prints batch accuracy during each print step.
213 | 
214 |         write_encoded_input: Reads the x-values from a test data source and transforms them
215 |             accordingly through the network (which has all parameters optimized from pre-training).
216 |             Writes the newly represented features to a specified file.
217 | 
218 |         (Example usage)
219 |             sda = SDAutoencoder([784, 400, 200, 10], ["relu", "relu", "relu"], noise=0.05)
220 |             sda.pretrain_network(X_TRAIN_PATH)
221 |             sda.finetune_parameters(X_TRAIN_PATH, Y_TRAIN_PATH)
222 |             sda.write_encoded_input(your_filename, X_TEST_PATH)
223 | 
224 |         :param dims: A list of ints containing the dimensions of the x-values at each step of
225 |             the network. The first entry is the overall input_dim, and the last entry is the
226 |             overall output_dim from the network.
227 |         :param activations: A list of activation functions for each layer in the network.
228 |         :param sess: A tf.Session to be used by the autoencoder
229 |         :param noise: A double from 0 to 1 representing the amount of masking on the input (noise).
230 |         :param loss: A string representing the loss function used.
231 |         :param pretrain_lr: A double representing the learning rate of the pretrain op.
232 |         :param finetune_lr: A double representing the learning rate of the finetune op.
233 |         :param batch_size: The number of cases fed to the network in each batch from file.
234 |         :param print_step: The number of batches processed before each print progress step.
235 |         """
236 |         self.input_dim = dims[0]  # The dimension of the raw input
237 |         self.output_dim = dims[-1]  # The output dimension of the last layer: fully encoded input
238 |         self.hidden_layers = self.create_new_layers(dims, activations)
239 |         self.sess = sess
240 | 
241 |         self.noise = noise
242 |         self.loss = loss
243 |         self.pretrain_lr = pretrain_lr
244 |         self.finetune_lr = finetune_lr
245 |         self.batch_size = batch_size
246 |         self.print_step = print_step
247 | 
248 |         self.check_assertions()
249 |         print("Initialized SDA network with dims %s, activations %s, noise %s, "
250 |               "loss %s, pretraining learning rate %s, finetuning learning rate %s, and batch size %s."
251 |               % (dims, activations, self.noise, self.loss, self.pretrain_lr, self.finetune_lr, self.batch_size))
252 | 
253 |     @property
254 |     def is_pretrained(self):
255 |         """Returns whether the whole autoencoder network (all layers) is pre-trained."""
256 |         return all([layer.is_pretrained for layer in self.hidden_layers])
257 | 
258 |     ##########################
259 |     # VARIABLE CONFIGURATION #
260 |     ##########################
261 | 
262 |     def get_all_variables(self, additional_vars=None):
263 |         """Returns all trainable variables of the neural network."""
264 |         all_vars = []
265 |         for layer in self.hidden_layers:
266 |             all_vars.extend([layer.get_weight_variable(), layer.get_bias_variable()])
267 |         if additional_vars:
268 |             all_vars.extend(additional_vars)
269 |         return all_vars
270 | 
271 |     def setup_all_variables(self, summ_list):
272 |         """See NNLayer.set_wb_variables. Performs layer method on all hidden layers."""
273 |         for layer in self.hidden_layers:
274 |             layer.set_wb_variables(summ_list)
275 | 
276 |     def finalize_all_variables(self):
277 |         """See NNLayer.finalize_all_variables. Performs layer method on all hidden layers."""
278 |         for layer in self.hidden_layers:
279 |             layer.update_wb(self.sess)
280 | 
281 |     def save_variables(self, filepath):
282 |         """Saves all Tensorflow variables in the desired filepath."""
283 |         saver = tf.train.Saver()
284 |         save_path = saver.save(self.sess, filepath)
285 |         print("Model saved in file: %s" % save_path)
286 | 
287 |     ################
288 |     # WRITING DATA #
289 |     ################
290 | 
291 |     @staticmethod
292 |     def write_data(data, filename):
293 |         """Writes data in data_tensor and appends to the end of filename in csv format.
294 | 
295 |         :param data: A 2-dimensional numpy array.
296 |         :param filename: A string representing the save filepath.
297 |         :return: None
298 |         """
299 |         with open(filename, "ab") as file:
300 |             np.savetxt(file, data, delimiter=",")
301 | 
302 |     @stopwatch
303 |     def write_encoded_input(self, filepath, x_test_path):
304 |         """Reads from x_test_path and encodes the input through the entire model. Then
305 |         writes the encoded result to filepath. Call this function after pretraining and
306 |         fine-tuning to get the newly learned features.
307 |         """
308 |         x_test = get_batch_generator(x_test_path, self.batch_size)
309 |         self.write_encoded_input_gen(filepath, x_test_gen=x_test)
310 | 
311 |     @stopwatch
312 |     def write_encoded_input_gen(self, filepath, x_test_gen):
313 |         """Get encoded feature representation and writes to filepath.
314 | 
315 |         :param filepath: A string specifying the file path/name to write the encoded input to.
316 |         :param x_test_gen: A generator that iterates through the x-test values.
317 |         :return: None
318 |         """
319 |         sess = self.sess
320 |         x_input = tf.placeholder(tf.float32, shape=[None, self.input_dim])
321 |         x_encoded = self.get_encoded_input(x_input, depth=-1, use_variables=False)
322 | 
323 |         print("Beginning to write to file.")
324 |         for x_batch in x_test_gen:
325 |             self.write_data(sess.run(x_encoded, feed_dict={x_input: x_batch}), filepath)
326 |         print("Written encoded input to file %s" % filepath)
327 | 
328 |     def write_encoded_input_with_ys(self, filepath_x, filepath_y, xy_test_gen):
329 |         """For use in testing MNIST. Writes the encoded x values along with their corresponding
330 |         y values to file.
331 | 
332 |         :param filepath_x: A string, the filepath to store the encoded x values.
333 |         :param filepath_y: A string, the filepath to store the y values.
334 |         :param xy_test_gen: A generator that yields tuples of x and y test values.
335 |         :return: None
336 |         """
337 |         sess = self.sess
338 |         x_input = tf.placeholder(tf.float32, shape=[None, self.input_dim])
339 |         x_encoded = self.get_encoded_input(x_input, depth=-1, use_variables=False)
340 | 
341 |         print("Beginning to write to file encoded x with ys.")
342 |         for x_batch, y_batch in xy_test_gen:
343 |             self.write_data(sess.run(x_encoded, feed_dict={x_input: x_batch}), filepath_x)
344 |             self.write_data(y_batch, filepath_y)
345 |         print("Written encoded input to file %s and test ys to %s" % (filepath_x, filepath_y))
346 | 
347 |     ###################
348 |     # GENERAL UTILITY #
349 |     ###################
350 | 
351 |     def get_encoded_input(self, input_tensor, depth, use_variables=False):
352 |         """Performs an encoding on input_tensor through the neural network depending on depth.
353 |         If depth is 0, then input_tensor is simply returned. If depth is 3, then input_tensor
354 |         will be encoded through the first three layers of the network. If depth is -1, then
355 |         input_tensor will be encoded through the entire network.
356 | 
357 |         :param input_tensor: A tensor to encode.
358 |         :param depth: The number of layers through which input_tensor will be encoded. If -1,
359 |             then the full network encoding will be used.
360 |         :param use_variables: A boolean representing whether to use tf.Variable representations
361 |             of layer parameters. This is set to True only during the fine-tuning stage.
362 |         :return: The encoded input_tensor.
363 |         """
364 |         depth = len(self.hidden_layers) if depth == -1 else depth
365 |         for i in range(depth):
366 |             input_tensor = self.hidden_layers[i].encode(input_tensor, use_variables=use_variables)
367 |         return input_tensor
368 | 
369 |     def get_loss(self, labels, values, epsilon=1e-10):
370 |         """Returns the loss value between labels and values based on the method, either rmse
371 |         or cross-entropy.
372 | 
373 |         Note: cross-entropy should only be used when the values are between 0 and 1."""
374 |         if self.loss == "rmse":
375 |             return tf.sqrt(tf.reduce_mean(tf.square(tf.sub(labels, values))))
376 |         elif self.loss == "cross-entropy":
377 |             return tf.reduce_mean(-tf.reduce_sum(
378 |                 labels * tf.log(values + epsilon) + (1 - labels) * tf.log(1 - values + epsilon), reduction_indices=[1]
379 |             ))
380 | 
381 |     @staticmethod
382 |     def create_new_layers(dims, activations):
383 |         """Creates and sets up template layers (un-pretrained) for the network based on dimensions
384 |         and activation functions.
385 | 
386 |         :param dims: Ex. [784, 200, 10]
387 |         :param activations: Ex. ['relu', 'relu']
388 |         :return: [NNLayer(input_dim=784, output_dim=200), NNLayer(input_dim=200, output_dim=10)]
389 |         """
390 |         assert len(dims) >= 2 and len(activations) >= 1, "Invalid number of layers given by `dims` and `activations`."
391 |         assert set(activations + ALLOWED_ACTIVATIONS) == set(ALLOWED_ACTIVATIONS), "Incorrect activation(s) given."
392 |         assert len(dims) == len(activations) + 1, "Incorrect number of layers/activations."
393 |         return [NNLayer(dims[i], dims[i + 1], "hidden_layer_" + str(i), activations[i])
394 |                 for i in range(len(activations))]
395 | 
396 |     ###############
397 |     # PRETRAINING #
398 |     ###############
399 | 
400 |     @stopwatch
401 |     def pretrain_network(self, x_train_path, epochs=1, batch_method="random"):
402 |         """Pretrains the network using x-train values from a csv file.
403 | 
404 |         :param x_train_path: A string: the filepath to the train data.
405 |         :param epochs: The number of epochs to iterate through the train data.
406 |         :param batch_method: A string, either "random" or "sequential", indicating the method to
407 |             use for batch generation (get_random_batch_generator vs. get_batch_generator).
408 |         """
409 |         print("Starting to pretrain autoencoder network.")
410 |         for i in range(len(self.hidden_layers)):
411 |             if batch_method == "random":
412 |                 x_train = get_random_batch_generator(self.batch_size, x_train_path, repeat=epochs - 1)
413 |             else:
414 |                 x_train = get_batch_generator(x_train_path, self.batch_size, repeat=epochs-1)
415 |             self.pretrain_layer(i, x_train)
416 |         print("Finished pretraining of autoencoder network.")
417 | 
418 |     @stopwatch
419 |     def pretrain_network_gen(self, x_train_gen_f):
420 |         """Pretrains the network with a generator supplying input. Use for testing MNIST.
421 | 
422 |         :param x_train_gen_f: A function that when called with no arguments returns a generator
423 |             that iterates through the entire train dataset.
424 |         :return: None
425 |         """
426 |         print("Starting to pretrain autoencoder network.")
427 |         for i in range(len(self.hidden_layers)):
428 |             x_train_gen = x_train_gen_f()
429 |             self.pretrain_layer(i, x_train_gen)
430 |         print("Finished pretraining of autoencoder network.")
431 | 
432 |     def pretrain_layer(self, depth, batch_generator):
433 |         """Pretrains the layer at depth `depth` feeding data from batch_generator. Do not call
434 |         this method externally unless specific pretraining of a particular layer is required.
435 |         Use `pretrain_network` instead."""
436 |         sess = self.sess
437 | 
438 |         print("Starting to pretrain layer %d." % depth)
439 |         hidden_layer = self.hidden_layers[depth]
440 |         summary_list = []
441 | 
442 |         with tf.name_scope(hidden_layer.name):
443 |             input_dim, output_dim = hidden_layer.input_dim, hidden_layer.output_dim
444 | 
445 |             with tf.name_scope("x_values"):
446 |                 x_original = tf.placeholder(tf.float32, shape=[None, self.input_dim])
447 |                 x_latent = self.get_encoded_input(x_original, depth, use_variables=False)
448 |                 x_corrupt = corrupt(x_latent, corruption_level=self.noise)
449 | 
450 |             with tf.name_scope("encoding_vars"):
451 |                 stretch_factor = 4 if self.loss == "sigmoid" else 1
452 |                 encode = {
453 |                     "weights": weight_variable(input_dim, output_dim, name="weights", stretch_factor=stretch_factor),
454 |                     "biases": bias_variable(output_dim, initial_value=0, name="biases")
455 |                 }
456 |                 attach_variable_summaries(encode["weights"], encode["weights"].name, summ_list=summary_list)
457 |                 attach_variable_summaries(encode["biases"], encode["biases"].name, summ_list=summary_list)
458 | 
459 |             with tf.name_scope("decoding_vars"):
460 |                 decode = {
461 |                     "weights": tf.transpose(encode["weights"], name="transposed_weights"),  # Tied weights
462 |                     "biases": bias_variable(input_dim, initial_value=0, name="decode_biases")
463 |                 }
464 |                 attach_variable_summaries(decode["weights"], decode["weights"].name, summ_list=summary_list)
465 |                 attach_variable_summaries(decode["biases"], decode["biases"].name, summ_list=summary_list)
466 | 
467 |             with tf.name_scope("encoded_and_decoded"):
468 |                 encoded = hidden_layer.activate(tf.matmul(x_corrupt, encode["weights"]) + encode["biases"])
469 |                 decoded = hidden_layer.activate(tf.matmul(encoded, decode["weights"]) + decode["biases"])
470 |                 attach_variable_summaries(encoded, "encoded", summ_list=summary_list)
471 |                 attach_variable_summaries(decoded, "decoded", summ_list=summary_list)
472 | 
473 |             # Reconstruction loss
474 |             with tf.name_scope("reconstruction_loss"):
475 |                 loss = self.get_loss(x_latent, decoded)
476 |                 attach_scalar_summary(loss, "%s_loss" % self.loss, summ_list=summary_list)
477 | 
478 |             trainable_vars = [encode["weights"], encode["biases"], decode["biases"]]
479 |             # Only optimize variables for this layer ("greedy")
480 |             with tf.name_scope("train_step"):
481 |                 train_op = tf.train.AdamOptimizer(learning_rate=self.pretrain_lr).minimize(
482 |                     loss, var_list=trainable_vars)
483 |             sess.run(tf.initialize_all_variables())
484 | 
485 |             # Merge summaries and get a summary writer
486 |             merged = tf.merge_summary(summary_list)
487 |             pretrain_writer = tf.train.SummaryWriter(TENSORBOARD_LOGDIR + "/train/" + hidden_layer.name, sess.graph)
488 | 
489 |             step = 0
490 |             for batch_x_original in batch_generator:
491 |                 sess.run(train_op, feed_dict={x_original: batch_x_original})
492 | 
493 |                 if step % self.print_step == 0:
494 |                     loss_value = sess.run(loss, feed_dict={x_original: batch_x_original})
495 |                     print("Step %s, batch %s loss = %s" % (step, self.loss, loss_value))
496 | 
497 |                 if step % TENSORBOARD_LOG_STEP == 0:
498 |                     summary = sess.run(merged, feed_dict={x_original: batch_x_original})
499 |                     pretrain_writer.add_summary(summary, global_step=step)
500 | 
501 |                 # Break for debugging purposes
502 |                 if DEBUG and step > 5:
503 |                     break
504 | 
505 |                 step += 1
506 | 
507 |             # Set the weights and biases of pretrained hidden layer
508 |             hidden_layer.set_wb(weights=sess.run(encode["weights"]), biases=sess.run(encode["biases"]))
509 |             print("Finished pretraining of layer %d. Updated layer weights and biases." % depth)
510 | 
511 |     ##############
512 |     # FINETUNING #
513 |     ##############
514 | 
515 |     @stopwatch
516 |     def finetune_parameters(self, x_train_path, y_train_path, output_dim, epochs=1, batch_method="random"):
517 |         """Performs fine tuning on all parameters of the neural network plus two additional softmax
518 |         variables. Call this method after `pretrain_network` is complete. Y values should be represented
519 |         in one-hot format.
520 | 
521 |         :param x_train_path: A string, the path to the x train values.
522 |         :param y_train_path: A string, the path to the y train values.
523 |         :param output_dim: An int, the number of classes in the target classification problem. Ex: 10 for MNIST.
524 |         :param epochs: An int, the number of iterations to tune through the entire dataset.
525 |         :param batch_method: A string, either 'random' or 'sequential', to indicate how batches are retrieved.
526 |         :return: The tuned softmax parameters (weights and biases) of the classification layer.
527 |         """
528 |         if batch_method == "random":
529 |             xy_train = get_random_batch_generator(self.batch_size, x_train_path, y_train_path, repeat=epochs - 1)
530 |         else:
531 |             x_train = get_batch_generator(x_train_path, self.batch_size, repeat=epochs - 1)
532 |             y_train = get_batch_generator(y_train_path, self.batch_size, repeat=epochs - 1)
533 |             xy_train = merge_generators(x_train, y_train)
534 |         return self.finetune_parameters_gen(xy_train_gen=xy_train, output_dim=output_dim)
535 | 
536 |     @stopwatch
537 |     def finetune_parameters_gen(self, xy_train_gen, output_dim):
538 |         """An implementation of finetuning to support data feeding from generators."""
539 |         sess = self.sess
540 |         summary_list = []
541 | 
542 |         print("Starting to fine tune parameters of network.")
543 |         with tf.name_scope("finetuning"):
544 |             self.setup_all_variables(summary_list)
545 | 
546 |             with tf.name_scope("inputs"):
547 |                 x = tf.placeholder(tf.float32, shape=[None, self.input_dim], name="raw_input")
548 |                 with tf.name_scope("fully_encoded"):
549 |                     x_encoded = self.get_encoded_input(x, depth=-1, use_variables=True)  # Full depth encoding
550 | 
551 |             """Note on W below: The difference between self.output_dim and output_dim is that the former
552 |             is the output dimension of the autoencoder stack, which is the dimension of the new feature
553 |             space. The latter is the dimension of the y value space for classification. Ex: If the output
554 |             should be binary, then the output_dim = 2."""
555 |             with tf.name_scope("softmax_variables"):
556 |                 W = weight_variable(self.output_dim, output_dim, name="weights")
557 |                 b = bias_variable(output_dim, initial_value=0, name="biases")
558 |                 attach_variable_summaries(W, W.name, summ_list=summary_list)
559 |                 attach_variable_summaries(b, b.name, summ_list=summary_list)
560 | 
561 |             with tf.name_scope("outputs"):
562 |                 y_logits = tf.matmul(x_encoded, W) + b
563 |                 with tf.name_scope("predicted"):
564 |                     y_pred = tf.nn.softmax(y_logits, name="y_pred")
565 |                     attach_variable_summaries(y_pred, y_pred.name, summ_list=summary_list)
566 |                 with tf.name_scope("actual"):
567 |                     y_actual = tf.placeholder(tf.float32, shape=[None, output_dim], name="y_actual")
568 |                     attach_variable_summaries(y_actual, y_actual.name, summ_list=summary_list)
569 | 
570 |             with tf.name_scope("cross_entropy"):
571 |                 cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_logits, y_actual))
572 |                 attach_scalar_summary(cross_entropy, "cross_entropy", summ_list=summary_list)
573 | 
574 |             trainable_vars = self.get_all_variables(additional_vars=[W, b])
575 |             with tf.name_scope("train_step"):
576 |                 train_step = tf.train.AdamOptimizer(learning_rate=self.finetune_lr).minimize(
577 |                     cross_entropy, var_list=trainable_vars)
578 | 
579 |             with tf.name_scope("evaluation"):
580 |                 correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.argmax(y_actual, 1))
581 |                 accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
582 |                 attach_scalar_summary(accuracy, "finetune_accuracy", summ_list=summary_list)
583 | 
584 |             sess.run(tf.initialize_all_variables())
585 | 
586 |             # Merge summaries and get a summary writer
587 |             merged = tf.merge_summary(summary_list)
588 |             train_writer = tf.train.SummaryWriter(TENSORBOARD_LOGDIR + "/train/finetune", sess.graph)
589 | 
590 |             step = 0
591 |             for batch_xs, batch_ys in xy_train_gen:
592 |                 if step % self.print_step == 0:
593 |                     print("Step %s, batch accuracy: " % step,
594 |                           sess.run(accuracy, feed_dict={x: batch_xs, y_actual: batch_ys}))
595 | 
596 |                 # For debugging predicted y values
597 |                 if step % (self.print_step * 10) == 0:
598 |                     print("Predicted y-value:", sess.run(y_pred, feed_dict={x: batch_xs})[0])
599 |                     print("Actual y-value:", batch_ys[0])
600 | 
601 |                 if step % TENSORBOARD_LOG_STEP == 0:
602 |                     summary = sess.run(merged, feed_dict={x: batch_xs, y_actual: batch_ys})
603 |                     train_writer.add_summary(summary, global_step=step)
604 | 
605 |                 # For debugging, break early.
606 |                 if DEBUG and step > 5:
607 |                     break
608 | 
609 |                 sess.run(train_step, feed_dict={x: batch_xs, y_actual: batch_ys})
610 |                 step += 1
611 | 
612 |             self.finalize_all_variables()
613 |             print("Completed fine-tuning of parameters.")
614 |             tuned_params = {"weights": sess.run(W), "biases": sess.run(b)}
615 | 
616 |             return tuned_params
617 | 


--------------------------------------------------------------------------------
/tf/softmax.py:
--------------------------------------------------------------------------------
  1 | from sdautoencoder import SDAutoencoder, get_batch_generator, merge_generators, stopwatch, DEBUG
  2 | import tensorflow as tf
  3 | import numpy as np
  4 | 
  5 | 
  6 | # X_TRAIN_PATH = "../data/x_train_transformed_SAM_2.csv"
  7 | # Y_TRAIN_PATH = "../data/splits/OPYTrainSAM.csv"
  8 | # X_TEST_PATH = "../data/x_test_transformed_SAM_2.csv"
  9 | # Y_TEST_PATH = "../data/splits/OPYTestSAM.csv"
 10 | 
 11 | # NEED TO RENAME FOR EVERY TRIAL
 12 | OUTPUT_PATH = "../data/ami/smote4k/outputs/pred_ys_8_10.csv"
 13 | TRANSFORMED_PATH = "../data/ami/smote4k/outputs/x_test_transformed_8_10.csv"
 14 | 
 15 | X_TRAIN_PATH = "../data/ami/smote4k/AMI_SAM_train_x.csv"
 16 | Y_TRAIN_PATH = "../data/ami/smote4k/AMI_SAM_train_y.csv"
 17 | X_TEST_PATH = "../data/ami/smote4k/AMI_SAM_test_x.csv"
 18 | Y_TEST_PATH = "../data/ami/smote4k/AMI_SAM_test_y.csv"
 19 | 
 20 | VARIABLE_SAVE_PATH = "../data/ami/smote4k/vars/last_vars.ckpt"
 21 | 
 22 | 
 23 | def average(lst):
 24 |     return sum(lst) / len(lst)
 25 | 
 26 | 
 27 | def append_with_limit(lst, val, limit=10):
 28 |     """Non-destructive function that returns a copy of the original list with the appended value and limit."""
 29 |     lst_copy = lst[:]
 30 |     lst_copy.append(val)
 31 |     return lst_copy[-limit:]
 32 | 
 33 | 
 34 | def write_data(data, filename):  # FIXME: Copied from sda, should refactor to static
 35 |     """Writes data in data_tensor and appends to the end of filename in csv format.
 36 | 
 37 |     :param data: A 2-dimensional numpy array.
 38 |     :param filename: A string representing the save filepath.
 39 |     :return: None
 40 |     """
 41 |     with open(filename, "ab") as file:
 42 |         np.savetxt(file, data, delimiter=",")
 43 | 
 44 | 
 45 | # @stopwatch
 46 | # def train_softmax(input_dim, output_dim, x_train_filepath, y_train_filepath, lr=0.001, batch_size=100,
 47 | #                   print_step=50, epochs=1):
 48 | #     """Trains a softmax model for prediction."""
 49 | #     # Model input and parameters
 50 | #     x = tf.placeholder(tf.float32, [None, input_dim])
 51 | #     weights = tf.Variable(tf.truncated_normal(shape=[input_dim, output_dim], stddev=0.1))
 52 | #     biases = tf.Variable(tf.constant(0.1, shape=[output_dim]))
 53 | #
 54 | #     # Outputs and true y-values
 55 | #     y_logits = tf.matmul(x, weights) + biases
 56 | #     y_pred = tf.nn.softmax(y_logits)
 57 | #     y_actual = tf.placeholder(tf.float32, [None, output_dim])
 58 | #
 59 | #     # Cross entropy and training step
 60 | #     cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=y_logits, labels=y_actual))
 61 | #     train_step = tf.train.AdamOptimizer(learning_rate=lr).minimize(cross_entropy)
 62 | #
 63 | #     # Start session and run batches based on number of epochs
 64 | #     sess = tf.Session()
 65 | #     sess.run(tf.initialize_all_variables())
 66 | #     x_train = get_batch_generator(filename=x_train_filepath, batch_size=batch_size,
 67 | #                                   repeat=epochs - 1)
 68 | #     y_train = get_batch_generator(filename=y_train_filepath, batch_size=batch_size,
 69 | #                                   repeat=epochs - 1)
 70 | #     step = 0
 71 | #     accuracy_history = []
 72 | #     for batch_xs, batch_ys in zip(x_train, y_train):
 73 | #         sess.run(train_step, feed_dict={x: batch_xs, y_actual: batch_ys})
 74 | #
 75 | #         # Debug
 76 | #         # if step == 100:
 77 | #         #     break
 78 | #
 79 | #         # Assess training accuracy for current batch
 80 | #         if step % print_step == 0:
 81 | #             correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.argmax(y_actual, 1))
 82 | #             accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
 83 | #             accuracy_val = sess.run(accuracy, feed_dict={x: batch_xs, y_actual: batch_ys})
 84 | #             print("Step %s, current batch training accuracy: %s" % (step, accuracy_val))
 85 | #             accuracy_history = append_with_limit(accuracy_history, accuracy_val)
 86 | #
 87 | #         # Assess training accuracy for last 10 batches
 88 | #         if step > 0 and step % (print_step * 10) == 0:
 89 | #             print("Predicted y-values:\n", sess.run(y_pred, feed_dict={x: batch_xs}))
 90 | #             print("Overall batch training accuracy for steps %s to %s: %s" % (step - 10 * print_step,
 91 | #                                                                               step,
 92 | #                                                                               average(accuracy_history)))
 93 | #
 94 | #         step += 1
 95 | #
 96 | #     parameters_dict = {
 97 | #         "weights": sess.run(weights),
 98 | #         "biases": sess.run(biases)
 99 | #     }
100 | #     sess.close()
101 | #     return parameters_dict
102 | 
103 | 
104 | @stopwatch
105 | def test_model(parameters_dict, input_dim, output_dim, x_test_filepath, y_test_filepath, output_filepath,
106 |                batch_size=100, print_step=100):
107 |     x_test = get_batch_generator(filename=x_test_filepath, batch_size=batch_size)
108 |     y_test = get_batch_generator(filename=y_test_filepath, batch_size=batch_size)  # FIXME: Check if headers
109 |     xy_test_gen = merge_generators(x_test, y_test)
110 |     test_model_gen(parameters_dict, input_dim, output_dim, xy_test_gen, output_filepath, print_step)
111 | 
112 | 
113 | @stopwatch
114 | def test_model_gen(parameters_dict, input_dim, output_dim, xy_test_gen, output_filepath, print_step=100):
115 |     """
116 | 
117 |     :param parameters_dict: Must contain keys 'weights' and 'biases' with their respective values
118 |     :param input_dim:
119 |     :param output_dim:
120 |     :param x_test_filepath:
121 |     :param y_test_filepath:
122 |     :param output_filepath:
123 |     :param batch_size:
124 |     :param print_step:
125 |     :return:
126 |     """
127 |     # Model input and parameters
128 |     x = tf.placeholder(tf.float32, [None, input_dim])
129 |     weights = parameters_dict["weights"]
130 |     biases = parameters_dict["biases"]
131 | 
132 |     # Outputs and true y-values
133 |     y_pred = tf.nn.softmax(tf.matmul(x, weights) + biases)
134 |     y_actual = tf.placeholder(tf.float32, [None, output_dim])
135 | 
136 |     # Evaluate testing accuracy
137 |     correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.argmax(y_actual, 1))
138 |     accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
139 |     sess = tf.Session()
140 | 
141 |     step = 0
142 |     accuracy_history = []
143 |     for batch_xs, batch_ys in xy_test_gen:
144 |         write_data(data=sess.run(y_pred, feed_dict={x: batch_xs}), filename=output_filepath)
145 | 
146 |         # Break early if debug
147 |         if DEBUG and step == 10:
148 |             break
149 | 
150 |         accuracy_val = sess.run(accuracy, feed_dict={x: batch_xs, y_actual: batch_ys})
151 |         accuracy_history.append(accuracy_val)
152 | 
153 |         if step % print_step == 0:
154 |             print("Step %s, current batch testing accuracy: %s" % (step, accuracy_val))
155 |             # print("Predicted y-values:\n", sess.run(y_pred, feed_dict={x: batch_xs}))
156 | 
157 |         step += 1
158 |     sess.close()
159 |     print("Testing complete and written to %s, overall accuracy: %s" % (output_filepath, average(accuracy_history)))
160 | 
161 | 
162 | @stopwatch
163 | def unsupervised():
164 |     sess = tf.Session()
165 |     sda = SDAutoencoder(dims=[4000, 1000, 500, 200],
166 |                         activations=["sigmoid", "sigmoid", "sigmoid"],
167 |                         sess=sess,
168 |                         noise=0.05,
169 |                         loss="rmse",
170 |                         batch_size=100,
171 |                         print_step=50)
172 | 
173 |     layer_1_weights_path = "../data/outputs/last_weights"
174 |     layer_1_biases_path = "../data/outputs/last_biases"
175 | 
176 |     sda.pretrain_network(X_TRAIN_PATH, epochs=8)
177 |     sda.write_data(sda.hidden_layers[1].weights, layer_1_weights_path)
178 |     sda.write_data(sda.hidden_layers[1].biases, layer_1_biases_path)
179 |     sda.write_encoded_input(TRANSFORMED_PATH, X_TEST_PATH)
180 |     sda.save_variables(VARIABLE_SAVE_PATH)
181 |     sess.close()
182 | 
183 | 
184 | @stopwatch
185 | def full_test():
186 |     sess = tf.Session()
187 |     sda = SDAutoencoder(dims=[4000, 400, 400, 400],
188 |                         activations=["sigmoid", "sigmoid", "sigmoid"],
189 |                         sess=sess,
190 |                         noise=0.20,
191 |                         loss="cross-entropy",
192 |                         pretrain_lr=1e-6,
193 |                         finetune_lr=1e-5,
194 |                         batch_size=50,
195 |                         print_step=500)
196 | 
197 |     sda.pretrain_network(X_TRAIN_PATH, epochs=50)
198 |     trained_parameters = sda.finetune_parameters(X_TRAIN_PATH, Y_TRAIN_PATH, output_dim=2, epochs=80)
199 |     sda.write_encoded_input(TRANSFORMED_PATH, X_TEST_PATH)
200 |     sda.save_variables(VARIABLE_SAVE_PATH)
201 |     sess.close()
202 | 
203 |     test_model(parameters_dict=trained_parameters,
204 |                input_dim=sda.output_dim,
205 |                output_dim=2,
206 |                x_test_filepath=TRANSFORMED_PATH,
207 |                y_test_filepath=Y_TEST_PATH,
208 |                output_filepath=OUTPUT_PATH)
209 | 
210 | 
211 | @stopwatch
212 | def main():
213 |     full_test()
214 | 
215 | 
216 | if __name__ == "__main__":
217 |     main()
218 | 


--------------------------------------------------------------------------------
/tf/utils.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Utility functions for SDA
  3 | 
  4 | Includes batch generation methods, and generator repeating/merging.
  5 | 
  6 | Ken Chen
  7 | """
  8 | 
  9 | import random
 10 | import csv
 11 | import time
 12 | from math import ceil
 13 | from functools import wraps
 14 | 
 15 | 
 16 | def stopwatch(f):
 17 |     """Simple decorator that prints the execution time of a function."""
 18 | 
 19 |     @wraps(f)
 20 |     def wrapped(*args, **kwargs):
 21 |         start_time = time.time()
 22 |         result = f(*args, **kwargs)
 23 |         elapsed_time = time.time() - start_time
 24 |         print("Total seconds elapsed for execution of %s:" % f, elapsed_time)
 25 |         return result
 26 | 
 27 |     return wrapped
 28 | 
 29 | 
 30 | def file_len(filename):
 31 |     """Returns the number of lines in a file."""
 32 |     i = 0
 33 |     with open(filename) as f:
 34 |         for i, line in enumerate(f):
 35 |             pass
 36 |     return i + 1
 37 | 
 38 | 
 39 | def get_batch_generator(filename, batch_size, repeat=0):
 40 |     """Generator that sequentially gets batches of batch_size x or y values
 41 |     from the given file.
 42 | 
 43 |     :param filename: A string of the file to write to.
 44 |     :param batch_size: An int: the number of lines to include in each batch.
 45 |     :param repeat: An int specifying the number of times to repeat going through
 46 |         the file. Repeat of 2 will return a generator that iterates through the
 47 |         full file three times before stopping iteration.
 48 |     :return: A generator.
 49 |     """
 50 |     assert repeat < 1000, "Recursion depth will be exceeded."
 51 |     with open(filename, "rt") as file:
 52 |         reader = csv.reader(file)
 53 | 
 54 |         index = 0
 55 |         this_batch = []
 56 |         for row in reader:
 57 |             this_batch.append(row)
 58 |             index += 1
 59 | 
 60 |             if index % batch_size == 0:
 61 |                 yield this_batch
 62 |                 this_batch = []
 63 | 
 64 |         # Catch any remainders in current data set
 65 |         if this_batch:
 66 |             yield this_batch
 67 | 
 68 |         print("Finished a batch iteration through %s" % filename)
 69 |         if repeat > 0:
 70 |             for item in get_batch_generator(filename, batch_size, repeat - 1):
 71 |                 yield item
 72 | 
 73 | 
 74 | def get_random_batch_generator(batch_size, filename, paired_filename=None, repeat=0):
 75 |     """Given a csv file `filename` and a specified batch_size, returns a generator that randomly
 76 |     yields `batch_size` cases from the file at a time and repeats its entire set of rows for
 77 |     `repeat` number of times.
 78 | 
 79 |     Note: use only for smaller files, as this process will consume significant memory.
 80 | 
 81 |     :param batch_size: An int, the number of lines to include in each batch.
 82 |     :param filename: A string, the path to the file to be batched.
 83 |     :param paired_filename: A string (optional), the path to another file to be batched together
 84 |         with `filename`.
 85 |     :param repeat: An int, the number of times to repeat batching of the entire dataset.
 86 |     :return: If `paired_filename` is not None, returns a generator that yields corresponding tuples
 87 |         of batches from both datasets. If `paired_filename` is None, returns a generator that yields
 88 |         just batches from `filename`.
 89 |     """
 90 |     def batch_list(lst):
 91 |         return [lst[j*batch_size:(j+1)*batch_size] for j in range(int(ceil(len(lst) / batch_size)))]
 92 | 
 93 |     for _ in range(repeat + 1):
 94 |         with open(filename, "rt") as file:
 95 |             if paired_filename:
 96 |                 with open(paired_filename, "rt") as paired:
 97 |                     paired = list(zip(list(csv.reader(file)), list(csv.reader(paired))))
 98 |                     random.shuffle(paired)
 99 |                     lines_0, lines_1 = list(zip(*paired))
100 |                     lines_0, lines_1 = batch_list(lines_0), batch_list(lines_1)
101 |                     for batch_0, batch_1 in zip(lines_0, lines_1):
102 |                         yield batch_0, batch_1
103 |             else:
104 |                 lines = list(csv.reader(file))
105 |                 random.shuffle(lines)
106 |                 lines = batch_list(lines)
107 |                 for batch in lines:
108 |                     yield batch
109 | 
110 | 
111 | def repeat_generator(f_gen, multiple=2):
112 |     """Repeats a generator.
113 | 
114 |     :param f_gen: A function that when called with no arguments returns a generator
115 |         to be repeated.
116 |     :param multiple: The number of times the generator should be iterated through.
117 |     :return: A generator that iterates through the original generator `multiple`
118 |     number of times.
119 |     """
120 |     for _ in range(multiple):
121 |         gen = f_gen()
122 |         for item in gen:
123 |             yield item
124 | 
125 | 
126 | def merge_generators(gen_1, gen_2):
127 |     """Returns a generator that yields combined tuples of the results of `gen_1` and `gen_2`."""
128 |     for x, y in zip(gen_1, gen_2):
129 |         yield x, y
130 | 


--------------------------------------------------------------------------------