├── LICENCE ├── README.md ├── libsvm_to_tfrecord.py ├── resources └── dataflow.png ├── spark_to_libsvm.scala ├── test.py ├── train.py └── utils.py /LICENCE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Thorough Images 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) 2 | 3 | # EasySparse 4 | 5 | ## Motivation 6 | In production environments, we find TensorFlow poorly deals with sparse learning scenarios. Even when one reads out records from a TFRecord file, feeding the records into a deep learning model can be a hard-bone. Thus, we have open-sourced this project to help researches or engineers build their own model using sparse data while hiding the difficulties. 7 | 8 | This project naturally fits into scenarios when one uses TensorFlow to build deep learning models by data acquired from Spark. Other scenarios can be easily generalized. 9 | 10 | ## Data Flow 11 |
DataFlow
12 | 13 | ## Programs 14 | `spark_to_libsvm.scala` Read data from Spark to a LibSVM file while one-hot encode features by demand. 15 | 16 | `libsvm_to_tfrecord.py` Convert a LibSVM file into a TFRecord. 17 | 18 | `train.py` Training code for fully connected NN with multi-GPU support. 19 | 20 | `test.py` Test the performance of the trained model. 21 | 22 | `utils.py` Contains all the functions used in training and testing. 23 | 24 | ## Usage 25 | 1. Read data from Spark, one-hot encode some features, and write them into a LibSVM file. Be sure to manually split the data into three LibSVM files, each for training, validating and testing (`spark_to_libsvm.scala`). 26 | 2. Transform the training LibSVM file into TFRecord (`libsvm_to_tfrecord.py`). 27 | 3. Run the training program (`train.py`). 28 | 4. Test the trained model (`test.py`). 29 | 30 | ## Environment 31 | 1. Spark v2.0 32 | 2. TensorFlow >= v0.12.1 33 | 34 | ## Python Package Requirements 35 | 1. Numpy (required) 36 | 2. Sci-kit Learn (only required in `test.py`) 37 | 3. TensorFlow (required, >= v0.12.1) 38 | 39 | ## Implementation Notes 40 | 1. In the training process, `train.py` reads all the validation data from the LibSVM file into the memory, and harvests shuffled training batches from the TFRecord file. Meanwhile, in the test process, all the test data is read from the LibSVM file. Therefore, one does not need to convert validation and test LibSVM files to TFRecords. However, this implementation may not work when validation and test sets are too large to fit into the memory. Although it rarely happens since validation and test sets are usually much smaller than the training set. If that happens, one need to write TFRecord file queues for validation and test sets. 41 | 2. All the parameters required in the training process are defined at the top of `train.py`. Here we use a two-layer FC-NN to model our data. Note that we have adopted AdamOptimizer and exponential decayed learning rate. 42 | 3. One may play with varies types of deep learning models by only modifying the model definition in `train.py`. 43 | 44 | ## Contribution 45 | Contributions and comments are welcomed! 46 | 47 | ## Licence 48 | MIT 49 | -------------------------------------------------------------------------------- /libsvm_to_tfrecord.py: -------------------------------------------------------------------------------- 1 | import os 2 | import numpy as np 3 | import tensorflow as tf 4 | 5 | 6 | def convert_tfrecords(input_filename, output_filename): 7 | """Concert the LibSVM contents to TFRecord. 8 | Args: 9 | input_filename: LibSVM filename. 10 | output_filename: Desired TFRecord filename. 11 | """ 12 | print("Starting to convert {} to {}...".format(input_filename, output_filename)) 13 | writer = tf.python_io.TFRecordWriter(output_filename) 14 | 15 | for line in open(input_filename, "r"): 16 | data = line.split(" ") 17 | label = float(data[0]) 18 | ids = [] 19 | values = [] 20 | for fea in data[1:]: 21 | id, value = fea.split(":") 22 | ids.append(int(id)) 23 | values.append(float(value)) 24 | # Write samples one by one 25 | example = tf.train.Example(features=tf.train.Features(feature={ 26 | "label": 27 | tf.train.Feature(float_list=tf.train.FloatList(value=[label])), 28 | "ids": 29 | tf.train.Feature(int64_list=tf.train.Int64List(value=ids)), 30 | "values": 31 | tf.train.Feature(float_list=tf.train.FloatList(value=values)) 32 | })) 33 | writer.write(example.SerializeToString()) 34 | writer.close() 35 | print("Successfully converted {} to {}!".format(input_filename, output_filename)) 36 | 37 | 38 | sess = tf.InteractiveSession() 39 | convert_tfrecords("/path/to/libsvm/file/train.libsvm", "/path/to/tfrecord/file/train.tfrecords") 40 | sess.close() 41 | -------------------------------------------------------------------------------- /resources/dataflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ThoroughImages/EasySparse/ebed5ee3c104523d35ccb62795b3b78b8df51d65/resources/dataflow.png -------------------------------------------------------------------------------- /spark_to_libsvm.scala: -------------------------------------------------------------------------------- 1 | /* 2 | * Program for: 3 | * 4 | * 1. Data Preprocessing 5 | * one-hot encoding for discrete features 6 | * 2. File Format Transformation 7 | * converting table in HDFS to 8 | * LibSVM format 9 | */ 10 | import org.apache.spark.rdd.RDD 11 | 12 | /** 13 | * Converting String to Double, 14 | * otherwise return 0.0. 15 | */ 16 | def parseDouble(s: String) = try { s.toDouble } catch { case _ => 0.0 } 17 | 18 | /** 19 | * Load RDD from HDFS and split each row 20 | * using sep as the delimiter string. 21 | * 22 | * @param path the path of source file to be convert in HDFS 23 | * @param sep the delimiter string using "\001" as default 24 | * 25 | * @return rdd of type RDD[Array[String]] 26 | */ 27 | def loadRddFromHdfs(path:String, sep:String="\001"): RDD[Array[String]] = { 28 | val rdd = sc.textFile(path) 29 | rdd.map(_.split(sep)) 30 | } 31 | 32 | /** 33 | * Converting a row of data to LibSVM format 34 | * 35 | */ 36 | def rowToLibSVM(row:Seq[Double]):String = { 37 | val label = row.head 38 | val featureIndex = row.tail.zipWithIndex 39 | val features = featureIndex.foldLeft(Seq.empty[String])((acc, e) => 40 | if(e._1 != 0) acc :+ s"${e._2 + 1}:${e._1}" else acc 41 | ) 42 | s"$label ${features mkString " "}" 43 | } 44 | 45 | /** 46 | * Converting rdd to LibSVM format 47 | * 48 | * @param rdd the target storage level 49 | * @param columns the sequence of column name 50 | * @param columnTypeConfig configuration about the type of columns including lable, continuous 51 | * features, discrete features, and columns to be omitted 52 | * @param oneHotConfig name of columns and the corresponding sequence of enumerations 53 | * 54 | * @return rdd rows of LibSVM format 55 | */ 56 | 57 | def rddToLibSVM(rdd:RDD[Array[String]], columns:Seq[String], 58 | columnTypeConfig:Map[String,Seq[String]], oneHotConfig:Map[String,Seq[String]]):RDD[String] = { 59 | val columnIndexMap = columns.zipWithIndex.toMap 60 | val oneHotMap:Map[Int, Map[String, Array[Double]]] = oneHotConfig.map{case (colName, enumSeq) => { 61 | val index = columnIndexMap(colName) 62 | val map = enumSeq.zipWithIndex.map(e => (e._1, Array.fill[Double](enumSeq.length)(0.0).updated(e._2, 1.0))).toMap 63 | (index -> map) 64 | }} 65 | val columnTypeIndex = columnTypeConfig.map(e => { 66 | (e._1, e._2.map(c => columnIndexMap(c))) 67 | }) 68 | rdd.map(row => { 69 | val rowMap = row.zipWithIndex.map(e => (e._2, e._1)).toMap 70 | val label = columnTypeIndex("label").map(e => parseDouble(rowMap(e))) 71 | val continuous = columnTypeIndex("continuousColumns").map(e => parseDouble(rowMap(e))) 72 | val discreteIndex = columnTypeIndex("discreteColumns").map(e => (e, rowMap(e))) // (index, value) 73 | val discrete = discreteIndex.map(e => oneHotMap(e._1)(e._2)).flatten // (enumIndex, enumValue) 74 | rowToLibSVM(label ++ discrete ++ continuous) 75 | }) 76 | } 77 | 78 | 79 | /** 80 | * Simple Demo 81 | */ 82 | 83 | val columns = Seq("key", "col_1", "col_2", "col_3", "col_4", "col_5", "y") 84 | 85 | // Specify the type of columns 86 | val columnTypeConfig = Map( 87 | ("label", Seq("y")), 88 | ("continuousColumns", Seq("col_1", "col_2")), 89 | ("discreteColumns", Seq("col_3", "col_4")), 90 | ("omittedColumns", Seq("key")) 91 | ) 92 | 93 | // Specify column of discrete feature and the corresponding enumerations 94 | val oneHotConfig = Map ( 95 | "col_3" -> Seq("col_3_a", "col_3_b", "col_3_c"), 96 | "col_4" -> Seq("col_4_a", "col_4_b"), 97 | "col_5" -> Seq("col_5_a", "col_5_b", "col_5_c", "col_5_d") 98 | ) 99 | 100 | val filepath = "hdfs:///SOURCE_FILE_PATH" 101 | val savepath = "hdfs:///RESULT_PATH" 102 | val rdd = loadRddFromHdfs(filepath) 103 | val LibSVMRDD = rddToLibSVM(rdd, columns, columnTypeConfig, oneHotConfig) 104 | LibSVMRDD.saveAsTextFile(savepath) 105 | 106 | -------------------------------------------------------------------------------- /test.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | import sklearn.metrics as sk 4 | import sklearn.preprocessing as pp 5 | import utils 6 | 7 | 8 | # TF config, allowing to use GPU by multiple programs 9 | config = tf.ConfigProto() 10 | config.gpu_options.per_process_gpu_memory_fraction = 0.5 11 | config.gpu_options.allow_growth = True 12 | config.allow_soft_placement = True 13 | 14 | test_file = 'path/to/test/set/test.libsvm' 15 | model_file = 'path/to/model/file/model.ckpt' 16 | filter_number = 64 17 | feature_num = 20000 18 | gpu_num = 2 19 | 20 | 21 | graph = tf.Graph() 22 | 23 | # Create the model: 2 layer FC NN, same as training 24 | with graph.as_default(): 25 | sp_indice = tf.placeholder(tf.int64) 26 | sp_value = tf.placeholder(tf.float32) 27 | x = tf.SparseTensor(sp_indice, sp_value, [batch_size, feature_num]) 28 | y_ = tf.placeholder(tf.float32, [None, 2]) 29 | keep_prob = tf.placeholder("float32") 30 | W_fc1 = utils.weight_variable([feature_num, filter_number]) 31 | b_fc1 = utils.bias_variable([filter_number]) 32 | W_fc2 = utils.weight_variable([filter_number, filter_number]) 33 | b_fc2 = utils.bias_variable([filter_number]) 34 | W_out = utils.weight_variable([filter_number, 2]) 35 | b_out = utils.bias_variable([2]) 36 | tower_pred = [[] for _ in range(gpu_num)] 37 | tower_loss = [0.0 for _ in range(gpu_num)] 38 | for i in range(gpu_num): 39 | with tf.device("/gpu:%d" % i): 40 | # We split the data into $gpu_num parts. 41 | next_batch = tf.sparse_split(split_dim=0, num_split=gpu_num, sp_input=x)[i] 42 | next_label = y_[i * batch_size / gpu_num: (i+1) * batch_size / gpu_num, :] 43 | hidden_1 = tf.nn.relu(tf.sparse_tensor_dense_matmul(next_batch, W_fc1) + b_fc1) 44 | hidden_2 = tf.nn.dropout(tf.nn.relu(tf.matmul(hidden_1, W_fc2) + b_fc2), keep_prob) 45 | tower_pred[i] = tf.nn.softmax(tf.matmul(hidden_2, W_out) + b_out) 46 | tower_loss[i] = -tf.reduce_mean(tf.reduce_sum(tf.cast(next_label, "float") * tf.log(tf.clip_by_value(tf.cast(tower_pred[i], "float"), 1e-10, 1.0)), reduction_indices=[1])) 47 | params = tf.trainable_variables() 48 | tf.get_variable_scope().reuse_variables() 49 | pred = tf.concat(0, [tower_pred[_] for _ in range(gpu_num)]) 50 | loss = tower_loss[0] 51 | for _ in range(1, gpu_num): 52 | loss = tf.add(loss, tower_loss[_]) 53 | loss = loss / (gpu_num + 0.0) 54 | 55 | 56 | with graph.as_default(): 57 | with tf.Session(config=config, graph=graph) as sess: 58 | saver = tf.train.Saver() 59 | sess.run(tf.initialize_all_variables()) 60 | saver.restore(sess, model_file) 61 | label_v, ids_v, values_v = utils.libsvm_data_read(test_file) 62 | test_num = int(label_v.shape[0] / batch_size) 63 | iter = 1 64 | test_ids = ids_v[0: batch_size] 65 | test_values = values_v[0: batch_size] 66 | test_labels = label_v[0: batch_size] 67 | ids_flatten_v, value_flatten_v = utils.libsvm_convert_sparse_tensor(test_ids, test_values) 68 | y_prediction = sess.run(pred, feed_dict={sp_indice: ids_flatten_v , sp_value: value_flatten_v, y_: test_labels, keep_prob: 1.0}) 69 | test_y_real = test_labels 70 | for index in range(1, test_num): 71 | print 'Testing Stage ' + str(index) + ' / ' + str(test_num) 72 | test_ids = ids_v[index * batch_size: (index + 1) * batch_size] 73 | test_values = values_v[index * batch_size: (index + 1) * batch_size] 74 | test_labels = label_v[index * batch_size:(index + 1) * batch_size] 75 | ids_flatten_v, value_flatten_v = utils.libsvm_convert_sparse_tensor(test_ids, test_values) 76 | test_y_real = np.concatenate((test_y_real, test_labels), 0) 77 | y_prediction = np.concatenate((y_prediction, sess.run(pred, feed_dict={sp_indice: ids_flatten_v, sp_value: value_flatten_v, y_: test_labels, keep_prob: 1.0})), 0) 78 | test_y_real = np.argmax(test_y_real, 1) 79 | for threshold in range(0, 1001, 1): 80 | test_prediction = np.array(pp.binarize(np.array([y_prediction[:, 1]]), threshold / 1000.0)[0]) 81 | cm = sk.confusion_matrix(test_y_real, test_prediction) 82 | print cm 83 | # The former two values are used for ROC, the later two are used for P-R 84 | print str(cm[0][1] / (cm[0][1] + cm[0][0] + 0.0)) + '\t' + str(cm[1][1] / (cm[1][1] + cm[1][0] + 0.0)) + \ 85 | '\t' + str(cm[1][1] / (cm[1][1] + cm[1][0] + 0.0)) + '\t' + str(cm[1][1] / (cm[1][1] + cm[0][1] + 0.0)) 86 | -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | import os 2 | import tensorflow as tf 3 | import numpy as np 4 | import utils 5 | 6 | 7 | # TF config, allowing to use GPU by multiple programs 8 | config = tf.ConfigProto() 9 | config.gpu_options.per_process_gpu_memory_fraction = 0.5 10 | config.gpu_options.allow_growth = True 11 | config.allow_soft_placement = True 12 | 13 | # Parameter Definition 14 | training_sample = os.path.join('/path/to/tfrecord', 'train.tfrecords') 15 | validation_sample = '/path/to/validation/set/validation.libsvm' 16 | model_path = './models/fcnn' 17 | filter_number = 64 18 | training_iteration = 10000 19 | batch_size = 2048 20 | display_step = 100 21 | test_step = 1000 22 | save_step = 1000 23 | shuffle_index = 0 24 | starter_learning_rate = 1e-3 25 | feature_num = 20000 # Should be the same as the sparse input shape 26 | epoch_num = None 27 | shuffle_or_not = True 28 | gpu_num = 2 29 | max_model_num_to_keep = 1000 30 | capacity = 40960 31 | min_after_dequeue = 10240 32 | learning_rate_decay_step = 1000 33 | learning_rate_decay_rate = 0.5 34 | 35 | # Define TF graph 36 | graph = tf.Graph() 37 | 38 | # Training sample queue 39 | with graph.as_default(): 40 | filename_queue = tf.train.string_input_producer([training_sample], num_epochs=epoch_num, shuffle=shuffle_or_not) 41 | label_batch, ids_batch, values_batch = utils.read_and_decode_batch(filename_queue, batch_size, capacity, min_after_dequeue) 42 | dense_values = tf.sparse_tensor_to_dense(values_batch, -1) # For further process 43 | dense_ids = tf.sparse_tensor_to_dense(ids_batch, -1) # For further process 44 | 45 | # Create the model: 2 layer FC NN 46 | with graph.as_default(): 47 | global_step = tf.Variable(0, trainable=False) 48 | # Here we use the indices and values to reproduce the input SparseTensor 49 | sp_indice = tf.placeholder(tf.int64) 50 | sp_value = tf.placeholder(tf.float32) 51 | x = tf.SparseTensor(sp_indice, sp_value, [batch_size, feature_num]) 52 | y_ = tf.placeholder(tf.float32, [None, 2]) 53 | keep_prob = tf.placeholder("float32") 54 | W_fc1 = utils.weight_variable([feature_num, filter_number]) 55 | b_fc1 = utils.bias_variable([filter_number]) 56 | W_fc2 = utils.weight_variable([filter_number, filter_number]) 57 | b_fc2 = utils.bias_variable([filter_number]) 58 | W_out = utils.weight_variable([filter_number, 2]) 59 | b_out = utils.bias_variable([2]) 60 | learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, learning_rate_decay_step, learning_rate_decay_rate, staircase=True) 61 | opt_= tf.train.AdamOptimizer(learning_rate) 62 | tower_grads = [] 63 | tower_pred = [[] for _ in range(gpu_num)] 64 | tower_loss = [0.0 for _ in range(gpu_num)] 65 | for i in range(gpu_num): 66 | with tf.device("/gpu:%d" % i): 67 | # We split the data into $gpu_num parts. 68 | next_batch = tf.sparse_split(split_dim=0, num_split=gpu_num, sp_input=x)[i] 69 | next_label = y_[i * batch_size / gpu_num: (i+1) * batch_size / gpu_num, :] 70 | hidden_1 = tf.nn.relu(tf.sparse_tensor_dense_matmul(next_batch, W_fc1) + b_fc1) 71 | hidden_2 = tf.nn.dropout(tf.nn.relu(tf.matmul(hidden_1, W_fc2) + b_fc2), keep_prob) 72 | tower_pred[i] = tf.nn.softmax(tf.matmul(hidden_2, W_out) + b_out) 73 | tower_loss[i] = -tf.reduce_mean(tf.reduce_sum(tf.cast(next_label, "float") * tf.log(tf.clip_by_value(tf.cast(tower_pred[i], "float"), 1e-10, 1.0)), reduction_indices=[1])) 74 | params = tf.trainable_variables() 75 | tf.get_variable_scope().reuse_variables() 76 | grads = opt_.compute_gradients(tower_loss[i], var_list = params) 77 | tower_grads.append(grads) 78 | pred = tf.concat(0, [tower_pred[_] for _ in range(gpu_num)]) 79 | loss = tower_loss[0] 80 | for _ in range(1, gpu_num): 81 | loss = tf.add(loss, tower_loss[_]) 82 | loss = loss / (gpu_num + 0.0) 83 | grads_ave = utils.average_gradients(tower_grads) 84 | train_step = opt_.apply_gradients(grads_ave, global_step=global_step) 85 | 86 | with graph.as_default(): 87 | with tf.Session(config=config, graph=graph) as sess: 88 | saver = tf.train.Saver(max_to_keep=max_model_num_to_keep) 89 | sess.run(tf.initialize_all_variables()) 90 | coord = tf.train.Coordinator() 91 | threads = tf.train.start_queue_runners(coord=coord, sess=sess) 92 | print 'Reading From LibSVM file...' 93 | label_v, ids_v, values_v = utils.libsvm_data_read(validation_sample) 94 | test_num = int(label_v.shape[0] / batch_size) 95 | iter = 1 96 | print 'Starting Training Process...' 97 | try: 98 | while not coord.should_stop(): 99 | label, ids, values = sess.run([label_batch, dense_ids, dense_values]) 100 | label, ids_flatten, value_flatten = utils.sparse_tensor_to_train_batch(label, ids, values) 101 | sess.run(train_step, feed_dict={sp_indice: ids_flatten, sp_value: value_flatten, y_: label, keep_prob: 0.5}) 102 | if iter % display_step == 0: 103 | print "Iteration:", '%04d' % (iter), ", Training Sample Loss: ", "{:.9f}".format( 104 | sess.run(loss, feed_dict={sp_indice: ids_flatten, sp_value: value_flatten, y_: label, keep_prob: 1.0})) 105 | if iter % test_step == 0: 106 | test_loss = 0.0 107 | for index in range(0, test_num): 108 | test_ids = ids_v[index * batch_size: (index + 1) * batch_size] 109 | test_values = values_v[index * batch_size: (index + 1) * batch_size] 110 | ids_flatten_v, value_flatten_v = utils.libsvm_convert_sparse_tensor(test_ids,test_values) 111 | test_labels = label_v[index * batch_size:(index + 1) * batch_size] 112 | test_loss += sess.run(loss, feed_dict={sp_indice: ids_flatten_v , sp_value: value_flatten_v ,y_: test_labels, keep_prob: 1.0}) 113 | test_loss /= (test_num + 0.0) 114 | print 'Validation Loss: %s' % str(test_loss) 115 | if iter % save_step == 0: 116 | save_path = saver.save(sess, model_path + '-' + str(iter) + '.ckpt') 117 | iter += 1 118 | except tf.errors.OutOfRangeError: 119 | print("Training Done!") 120 | finally: 121 | coord.request_stop() 122 | coord.join(threads) 123 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | 4 | 5 | def read_and_decode_batch(filename_queue, batch_size, capacity, min_after_dequeue): 6 | """Dequeue a batch of data from the TFRecord. 7 | Args: 8 | filename_queue: Filename Queue of the TFRecord. 9 | batch_size: How many records dequeued each time. 10 | capacity: The capacity of the queue. 11 | min_after_dequeue: Ensures a minimum amount of shuffling of examples. 12 | Returns: 13 | List of the dequeued (batch_label, batch_ids, batch_values). 14 | """ 15 | reader = tf.TFRecordReader() 16 | _, serialized_example = reader.read(filename_queue) 17 | batch_serialized_example = tf.train.shuffle_batch([serialized_example], 18 | batch_size=batch_size, capacity=capacity, min_after_dequeue=min_after_dequeue) 19 | # The feature definition here should BE consistent with LibSVM TO TFRecord process. 20 | features = tf.parse_example(batch_serialized_example, 21 | features={ 22 | "label": tf.FixedLenFeature([], tf.float32), 23 | "ids": tf.VarLenFeature(tf.int64), 24 | "values": tf.VarLenFeature(tf.float32) 25 | }) 26 | batch_label = features["label"] 27 | batch_ids = features["ids"] 28 | batch_values = features["values"] 29 | return batch_label, batch_ids, batch_values 30 | 31 | 32 | def sparse_tensor_to_train_batch(dense_label, dense_ids, dense_values): 33 | """Transform the dence ids and values to TF understandable inputs. Meanwhile, one-hot encode the labels. 34 | For instance, for dense_ids in the form of 35 | [[1, 4, 6, -1], 36 | [2, 3, -1, -1], 37 | [3, 4, 5, 6], ... 38 | ] 39 | should be transformed into 40 | [[0, 1], [0, 4], [0, 6], 41 | [1, 2], [1, 3], 42 | [2, 3], [2, 4], [2, 5], [2, 6], ... 43 | ] 44 | For dense_values in the form of: 45 | [[0.01, 0.23, 0.45, -1], 46 | [0.34, 0.25, -1, -1], 47 | [0.23, 0.78, 0.12, 0.56], ... 48 | ] 49 | should be transformed into 50 | [0.01, 0.23, 0.45, 0.34, 0.25, 0.23, 0.78, 0.12, 0.56, ...] 51 | Args: 52 | dense_label: Labels. 53 | dense_ids: Sparse indices. 54 | dense_values: Sparse values. 55 | Returns: 56 | List of the processed (label, ids, values) ready for training inputs. 57 | """ 58 | indice_flatten = [] 59 | values_flatten = [] 60 | label_flatten = [] 61 | index = 0 62 | for i in range(0, dense_label.shape[0]): 63 | if int(float(dense_label[i])) == 0: 64 | label_flatten.append([1.0, 0.0]) 65 | else: 66 | label_flatten.append([0.0, 1.0]) 67 | values = list(dense_values) 68 | indice = list(dense_ids) 69 | for j in range(0,len(indice[i])): 70 | if not indice[i][j] == -1: 71 | indice_flatten.append([index,indice[i][j]]) 72 | values_flatten.append(values[i][j]) 73 | else: 74 | break 75 | index += 1 76 | return np.array(label_flatten), indice_flatten, values_flatten 77 | 78 | 79 | def libsvm_data_read(input_filename): 80 | """Read all the data from the LibSVM file. 81 | Args: 82 | input_filename: Filename of the LibSVM. 83 | Returns: 84 | List of the acquired (label, ids, values). 85 | """ 86 | labels = [] 87 | ids_all = [] 88 | values_all = [] 89 | for line in open(input_filename, 'r'): 90 | data = line.split(' ') 91 | if int(float(data[0])) == 0: 92 | labels.append([1.0, 0.0]) 93 | else: 94 | labels.append([0.0, 1.0]) 95 | ids = [] 96 | values = [] 97 | for fea in data[1:]: 98 | id, value = fea.split(':') 99 | ids.append(int(id)) 100 | values.append(float(value)) 101 | ids_all.append(ids) 102 | values_all.append(values) 103 | return np.array(labels), np.array(ids_all), np.array(values_all) 104 | 105 | 106 | def libsvm_convert_sparse_tensor(array_ids, array_values): 107 | """Transform the contents into TF understandable formats, which is similar to 108 | sparse_tensor_to_train_batch(). 109 | Args: 110 | array_ids: Sparse indices. 111 | array_values: Sparse values. 112 | Returns: 113 | List of the transformed (ids, values). 114 | """ 115 | indice_flatten_v = [] 116 | values_flatten_v = [] 117 | index = 0 118 | for i in range(0, array_ids.shape[0]): 119 | for j in range(0, len(array_ids[i])): 120 | indice_flatten_v.append([index, array_ids[i][j]]) 121 | values_flatten_v.append(array_values[i][j]) 122 | index += 1 123 | return indice_flatten_v, values_flatten_v 124 | 125 | 126 | def weight_variable(shape): 127 | initial = tf.truncated_normal(shape, stddev=0.1) 128 | return tf.Variable(initial) 129 | 130 | 131 | def bias_variable(shape): 132 | initial = tf.constant(0.1, shape=shape) 133 | return tf.Variable(initial) 134 | 135 | def average_gradients(tower_grads): 136 | """Calculate the average gradient for each shared variable across all towers. 137 | Note that this function provides a synchronization point across all towers. 138 | Args: 139 | tower_grads: List of lists of (gradient, variable) tuples. The outer list 140 | is over individual gradients. The inner list is over the gradient 141 | calculation for each tower. 142 | Returns: 143 | List of pairs of (gradient, variable) where the gradient has been averaged 144 | across all towers. 145 | """ 146 | average_grads = [] 147 | for grad_and_vars in zip(*tower_grads): 148 | # Note that each grad_and_vars looks like the following: 149 | # ((grad0_gpu0, var0_gpu0), ... , (grad0_gpuN, var0_gpuN)) 150 | grads = [] 151 | for g, _ in grad_and_vars: 152 | # Add 0 dimension to the gradients to represent the tower. 153 | expanded_g = tf.expand_dims(g, 0) 154 | 155 | # Append on a 'tower' dimension which we will average over below. 156 | grads.append(expanded_g) 157 | 158 | # Average over the 'tower' dimension. 159 | grad = tf.concat(0, grads) 160 | grad = tf.reduce_mean(grad, 0) 161 | 162 | # Keep in mind that the Variables are redundant because they are shared 163 | # across towers. So .. we will just return the first tower's pointer to 164 | # the Variable. 165 | v = grad_and_vars[0][1] 166 | grad_and_var = (grad, v) 167 | average_grads.append(grad_and_var) 168 | return average_grads 169 | --------------------------------------------------------------------------------