├── LICENCE
├── README.md
├── libsvm_to_tfrecord.py
├── resources
    └── dataflow.png
├── spark_to_libsvm.scala
├── test.py
├── train.py
└── utils.py


/LICENCE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Thorough Images
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 2 | 
 3 | # EasySparse
 4 | 
 5 | ## Motivation
 6 | In production environments, we find TensorFlow poorly deals with sparse learning scenarios. Even when one reads out records from a TFRecord file, feeding the records into a deep learning model can be a hard-bone. Thus, we have open-sourced this project to help researches or engineers build their own model using sparse data while hiding the difficulties. 
 7 | 
 8 | This project naturally fits into scenarios when one uses TensorFlow to build deep learning models by data acquired from Spark. Other scenarios can be easily generalized.
 9 | 
10 | ## Data Flow
11 | <div align=center><img alt="DataFlow" width="600" src="https://raw.githubusercontent.com/ThoroughImages/EasySparse/master/resources/dataflow.png"/></div>
12 | 
13 | ## Programs
14 | `spark_to_libsvm.scala`  Read data from Spark to a LibSVM file while one-hot encode features by demand.
15 | 
16 | `libsvm_to_tfrecord.py`  Convert a LibSVM file into a TFRecord.
17 | 
18 | `train.py`  Training code for fully connected NN with multi-GPU support.
19 | 
20 | `test.py`  Test the performance of the trained model.
21 | 
22 | `utils.py`  Contains all the functions used in training and testing.
23 | 
24 | ## Usage
25 | 1. Read data from Spark, one-hot encode some features, and write them into a LibSVM file. Be sure to manually split the data into three LibSVM files, each for training, validating and testing (`spark_to_libsvm.scala`).
26 | 2. Transform the training LibSVM file into TFRecord (`libsvm_to_tfrecord.py`).
27 | 3. Run the training program (`train.py`). 
28 | 4. Test the trained model (`test.py`).
29 | 
30 | ## Environment
31 | 1. Spark v2.0
32 | 2. TensorFlow >= v0.12.1
33 | 
34 | ## Python Package Requirements
35 | 1. Numpy (required)
36 | 2. Sci-kit Learn (only required in `test.py`)
37 | 3. TensorFlow (required, >= v0.12.1)
38 | 
39 | ## Implementation Notes
40 | 1. In the training process, `train.py` reads all the validation data from the LibSVM file into the memory, and harvests shuffled training batches from the TFRecord file. Meanwhile, in the test process, all the test data is read from the LibSVM file. Therefore, one does not need to convert validation and test LibSVM files to TFRecords. However, this implementation may not work when validation and test sets are too large to fit into the memory. Although it rarely happens since validation and test sets are usually much smaller than the training set. If that happens, one need to write TFRecord file queues for validation and test sets.
41 | 2. All the parameters required in the training process are defined at the top of `train.py`. Here we use a two-layer FC-NN to model our data. Note that we have adopted AdamOptimizer and exponential decayed learning rate.
42 | 3. One may play with varies types of deep learning models by only modifying the model definition in `train.py`.
43 | 
44 | ## Contribution
45 | Contributions and comments are welcomed!
46 | 
47 | ## Licence
48 | MIT
49 | 


--------------------------------------------------------------------------------
/libsvm_to_tfrecord.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import numpy as np
 3 | import tensorflow as tf
 4 | 
 5 | 
 6 | def convert_tfrecords(input_filename, output_filename):
 7 |     """Concert the LibSVM contents to TFRecord.
 8 |     Args:
 9 |     input_filename: LibSVM filename.
10 |     output_filename: Desired TFRecord filename.
11 |     """
12 |     print("Starting to convert {} to {}...".format(input_filename, output_filename))
13 |     writer = tf.python_io.TFRecordWriter(output_filename)
14 | 
15 |     for line in open(input_filename, "r"):
16 |         data = line.split(" ")
17 |         label = float(data[0])
18 |         ids = [] 
19 |         values = [] 
20 |         for fea in data[1:]:
21 |             id, value = fea.split(":")
22 |             ids.append(int(id))
23 |             values.append(float(value))
24 |         # Write samples one by one
25 |         example = tf.train.Example(features=tf.train.Features(feature={
26 |             "label":
27 |                 tf.train.Feature(float_list=tf.train.FloatList(value=[label])),
28 |             "ids":
29 |                 tf.train.Feature(int64_list=tf.train.Int64List(value=ids)),
30 |             "values":
31 |                 tf.train.Feature(float_list=tf.train.FloatList(value=values))
32 |         }))
33 |         writer.write(example.SerializeToString())
34 |     writer.close()
35 |     print("Successfully converted {} to {}!".format(input_filename, output_filename))
36 | 
37 | 
38 | sess = tf.InteractiveSession()
39 | convert_tfrecords("/path/to/libsvm/file/train.libsvm", "/path/to/tfrecord/file/train.tfrecords")
40 | sess.close()
41 | 


--------------------------------------------------------------------------------
/resources/dataflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ThoroughImages/EasySparse/ebed5ee3c104523d35ccb62795b3b78b8df51d65/resources/dataflow.png


--------------------------------------------------------------------------------
/spark_to_libsvm.scala:
--------------------------------------------------------------------------------
  1 | /*
  2 |  * Program for:
  3 |  *
  4 |  * 1. Data Preprocessing
  5 |  *    one-hot encoding for discrete features
  6 |  * 2. File Format Transformation
  7 |  *    converting table in HDFS to 
  8 |  *    LibSVM format 
  9 |  */
 10 | import org.apache.spark.rdd.RDD
 11 | 
 12 | /**
 13 |  * Converting String to Double,
 14 |  * otherwise return 0.0.
 15 |  */
 16 | def parseDouble(s: String) = try { s.toDouble } catch { case _ => 0.0 }
 17 | 
 18 | /**
 19 |  * Load RDD from HDFS and split each row
 20 |  * using sep as the delimiter string.
 21 |  * 
 22 |  * @param path the path of source file to be convert in HDFS
 23 |  * @param sep the delimiter string using "\001" as default
 24 |  *
 25 |  * @return rdd of type RDD[Array[String]]
 26 |  */
 27 | def loadRddFromHdfs(path:String, sep:String="\001"): RDD[Array[String]] = {
 28 |     val rdd = sc.textFile(path)
 29 |     rdd.map(_.split(sep))
 30 | }
 31 | 
 32 | /**
 33 |  * Converting a row of data to LibSVM format
 34 |  * 
 35 |  */
 36 | def rowToLibSVM(row:Seq[Double]):String = {
 37 |     val label = row.head
 38 |     val featureIndex = row.tail.zipWithIndex
 39 |     val features = featureIndex.foldLeft(Seq.empty[String])((acc, e) => 
 40 |         if(e._1 != 0) acc :+ s"${e._2 + 1}:${e._1}" else acc 
 41 |     )
 42 |     s"$label ${features mkString " "}"
 43 | }
 44 | 
 45 | /**
 46 |  * Converting rdd to LibSVM format 
 47 |  *
 48 |  * @param rdd the target storage level
 49 |  * @param columns the sequence of column name
 50 |  * @param columnTypeConfig configuration about the type of columns including lable, continuous 
 51 |  *        features, discrete features, and columns to be omitted
 52 |  * @param oneHotConfig name of columns and the corresponding sequence of enumerations
 53 |  *
 54 |  * @return rdd rows of LibSVM format
 55 |  */
 56 | 
 57 | def rddToLibSVM(rdd:RDD[Array[String]], columns:Seq[String], 
 58 |     columnTypeConfig:Map[String,Seq[String]], oneHotConfig:Map[String,Seq[String]]):RDD[String] = {
 59 |     val columnIndexMap = columns.zipWithIndex.toMap
 60 |     val oneHotMap:Map[Int, Map[String, Array[Double]]] = oneHotConfig.map{case (colName, enumSeq) =>  {
 61 |         val index = columnIndexMap(colName)
 62 |         val map = enumSeq.zipWithIndex.map(e => (e._1, Array.fill[Double](enumSeq.length)(0.0).updated(e._2, 1.0))).toMap
 63 |         (index -> map)
 64 |     }}
 65 |     val columnTypeIndex = columnTypeConfig.map(e => {
 66 |         (e._1, e._2.map(c => columnIndexMap(c)))
 67 |     })
 68 |     rdd.map(row => {
 69 |         val rowMap = row.zipWithIndex.map(e => (e._2, e._1)).toMap
 70 |         val label = columnTypeIndex("label").map(e => parseDouble(rowMap(e)))
 71 |         val continuous = columnTypeIndex("continuousColumns").map(e => parseDouble(rowMap(e)))
 72 |         val discreteIndex = columnTypeIndex("discreteColumns").map(e => (e, rowMap(e))) // (index, value) 
 73 |         val discrete = discreteIndex.map(e => oneHotMap(e._1)(e._2)).flatten // (enumIndex, enumValue)
 74 |         rowToLibSVM(label ++ discrete ++ continuous)
 75 |     })
 76 | }
 77 | 
 78 | 
 79 | /**
 80 |  * Simple Demo
 81 |  */
 82 | 
 83 | val columns = Seq("key", "col_1", "col_2", "col_3", "col_4", "col_5", "y")
 84 | 
 85 | // Specify the type of columns
 86 | val columnTypeConfig = Map(
 87 |     ("label", Seq("y")),
 88 |     ("continuousColumns", Seq("col_1", "col_2")),
 89 |     ("discreteColumns", Seq("col_3", "col_4")),
 90 |     ("omittedColumns", Seq("key"))
 91 | )
 92 | 
 93 | // Specify column of discrete feature and the corresponding enumerations
 94 | val oneHotConfig = Map (
 95 |     "col_3" -> Seq("col_3_a", "col_3_b", "col_3_c"),
 96 |     "col_4" -> Seq("col_4_a", "col_4_b"),
 97 |     "col_5" -> Seq("col_5_a", "col_5_b", "col_5_c", "col_5_d")
 98 | )
 99 | 
100 | val filepath = "hdfs:///SOURCE_FILE_PATH"
101 | val savepath = "hdfs:///RESULT_PATH"
102 | val rdd = loadRddFromHdfs(filepath)
103 | val LibSVMRDD = rddToLibSVM(rdd, columns, columnTypeConfig, oneHotConfig)
104 | LibSVMRDD.saveAsTextFile(savepath)
105 | 
106 | 


--------------------------------------------------------------------------------
/test.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | import numpy as np
 3 | import sklearn.metrics as sk
 4 | import sklearn.preprocessing as pp
 5 | import utils
 6 | 
 7 | 
 8 | # TF config, allowing to use GPU by multiple programs
 9 | config = tf.ConfigProto()  
10 | config.gpu_options.per_process_gpu_memory_fraction = 0.5 
11 | config.gpu_options.allow_growth = True     
12 | config.allow_soft_placement = True
13 | 
14 | test_file = 'path/to/test/set/test.libsvm'
15 | model_file = 'path/to/model/file/model.ckpt'
16 | filter_number = 64
17 | feature_num = 20000
18 | gpu_num = 2
19 | 
20 | 
21 | graph = tf.Graph()
22 | 
23 | # Create the model: 2 layer FC NN, same as training
24 | with graph.as_default():
25 |     sp_indice = tf.placeholder(tf.int64)
26 |     sp_value = tf.placeholder(tf.float32)
27 |     x =  tf.SparseTensor(sp_indice, sp_value, [batch_size, feature_num])
28 |     y_ = tf.placeholder(tf.float32, [None, 2])
29 |     keep_prob = tf.placeholder("float32")
30 |     W_fc1 = utils.weight_variable([feature_num, filter_number])
31 |     b_fc1 = utils.bias_variable([filter_number])
32 |     W_fc2 = utils.weight_variable([filter_number, filter_number])
33 |     b_fc2 = utils.bias_variable([filter_number])
34 |     W_out = utils.weight_variable([filter_number, 2])
35 |     b_out = utils.bias_variable([2]) 
36 |     tower_pred = [[] for _ in range(gpu_num)]
37 |     tower_loss = [0.0 for _ in range(gpu_num)]
38 |     for i in range(gpu_num):
39 |         with tf.device("/gpu:%d" % i):
40 |             # We split the data into $gpu_num parts.
41 |             next_batch = tf.sparse_split(split_dim=0, num_split=gpu_num, sp_input=x)[i]
42 |             next_label = y_[i * batch_size / gpu_num: (i+1) * batch_size / gpu_num, :]
43 |             hidden_1 = tf.nn.relu(tf.sparse_tensor_dense_matmul(next_batch, W_fc1) + b_fc1)
44 |             hidden_2 = tf.nn.dropout(tf.nn.relu(tf.matmul(hidden_1, W_fc2) + b_fc2), keep_prob)
45 |             tower_pred[i] = tf.nn.softmax(tf.matmul(hidden_2, W_out) + b_out)
46 |             tower_loss[i] = -tf.reduce_mean(tf.reduce_sum(tf.cast(next_label, "float") * tf.log(tf.clip_by_value(tf.cast(tower_pred[i], "float"), 1e-10, 1.0)), reduction_indices=[1]))   
47 |             params = tf.trainable_variables()
48 |             tf.get_variable_scope().reuse_variables()
49 |     pred = tf.concat(0, [tower_pred[_] for _ in range(gpu_num)])
50 |     loss = tower_loss[0]
51 |     for _ in range(1, gpu_num):
52 |         loss = tf.add(loss, tower_loss[_])
53 |     loss = loss / (gpu_num + 0.0)
54 | 
55 | 
56 | with graph.as_default():
57 |     with tf.Session(config=config, graph=graph) as sess:
58 |         saver = tf.train.Saver()
59 |         sess.run(tf.initialize_all_variables())
60 |         saver.restore(sess, model_file)
61 |         label_v, ids_v, values_v = utils.libsvm_data_read(test_file)
62 |         test_num = int(label_v.shape[0] / batch_size)
63 |         iter = 1
64 |         test_ids = ids_v[0: batch_size]
65 |         test_values = values_v[0: batch_size]
66 |         test_labels = label_v[0: batch_size]
67 |         ids_flatten_v, value_flatten_v = utils.libsvm_convert_sparse_tensor(test_ids, test_values)
68 |         y_prediction = sess.run(pred, feed_dict={sp_indice: ids_flatten_v , sp_value: value_flatten_v, y_: test_labels, keep_prob: 1.0})
69 |        	test_y_real = test_labels
70 |         for index in range(1, test_num):
71 |             print 'Testing Stage ' + str(index) + ' / ' + str(test_num)
72 |             test_ids = ids_v[index * batch_size: (index + 1) * batch_size]
73 |             test_values = values_v[index * batch_size: (index + 1) * batch_size]
74 |             test_labels = label_v[index * batch_size:(index + 1) * batch_size]
75 |             ids_flatten_v, value_flatten_v = utils.libsvm_convert_sparse_tensor(test_ids, test_values)
76 |             test_y_real = np.concatenate((test_y_real, test_labels), 0)
77 |             y_prediction = np.concatenate((y_prediction, sess.run(pred, feed_dict={sp_indice: ids_flatten_v, sp_value: value_flatten_v, y_: test_labels, keep_prob: 1.0})), 0)
78 |         test_y_real = np.argmax(test_y_real, 1)
79 |         for threshold in range(0, 1001, 1):
80 |             test_prediction = np.array(pp.binarize(np.array([y_prediction[:, 1]]), threshold / 1000.0)[0])
81 |             cm = sk.confusion_matrix(test_y_real, test_prediction)
82 |             print cm
83 |             # The former two values are used for ROC, the later two are used for P-R
84 |             print str(cm[0][1] / (cm[0][1] + cm[0][0] + 0.0)) + '\t' + str(cm[1][1] / (cm[1][1] + cm[1][0] + 0.0)) + \
85 |             '\t' + str(cm[1][1] / (cm[1][1] + cm[1][0] + 0.0)) + '\t' + str(cm[1][1] / (cm[1][1] + cm[0][1] + 0.0))
86 | 


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import tensorflow as tf
  3 | import numpy as np
  4 | import utils
  5 | 
  6 | 
  7 | # TF config, allowing to use GPU by multiple programs
  8 | config = tf.ConfigProto()  
  9 | config.gpu_options.per_process_gpu_memory_fraction = 0.5 
 10 | config.gpu_options.allow_growth = True     
 11 | config.allow_soft_placement = True
 12 | 
 13 | # Parameter Definition
 14 | training_sample = os.path.join('/path/to/tfrecord', 'train.tfrecords')
 15 | validation_sample = '/path/to/validation/set/validation.libsvm'
 16 | model_path = './models/fcnn'
 17 | filter_number = 64
 18 | training_iteration = 10000
 19 | batch_size = 2048
 20 | display_step = 100
 21 | test_step = 1000
 22 | save_step = 1000
 23 | shuffle_index = 0
 24 | starter_learning_rate = 1e-3
 25 | feature_num = 20000 # Should be the same as the sparse input shape
 26 | epoch_num = None
 27 | shuffle_or_not = True
 28 | gpu_num = 2
 29 | max_model_num_to_keep = 1000
 30 | capacity = 40960
 31 | min_after_dequeue = 10240
 32 | learning_rate_decay_step = 1000
 33 | learning_rate_decay_rate = 0.5
 34 | 
 35 | # Define TF graph
 36 | graph = tf.Graph()
 37 | 
 38 | # Training sample queue
 39 | with graph.as_default(): 
 40 |     filename_queue = tf.train.string_input_producer([training_sample], num_epochs=epoch_num, shuffle=shuffle_or_not)
 41 |     label_batch, ids_batch, values_batch = utils.read_and_decode_batch(filename_queue, batch_size, capacity, min_after_dequeue)
 42 |     dense_values = tf.sparse_tensor_to_dense(values_batch, -1) # For further process
 43 |     dense_ids = tf.sparse_tensor_to_dense(ids_batch, -1) # For further process
 44 | 
 45 | # Create the model: 2 layer FC NN
 46 | with graph.as_default():
 47 |     global_step = tf.Variable(0, trainable=False)
 48 |     # Here we use the indices and values to reproduce the input SparseTensor
 49 |     sp_indice = tf.placeholder(tf.int64)
 50 |     sp_value = tf.placeholder(tf.float32)
 51 |     x =  tf.SparseTensor(sp_indice, sp_value, [batch_size, feature_num])
 52 |     y_ = tf.placeholder(tf.float32, [None, 2])
 53 |     keep_prob = tf.placeholder("float32")
 54 |     W_fc1 = utils.weight_variable([feature_num, filter_number])
 55 |     b_fc1 = utils.bias_variable([filter_number])
 56 |     W_fc2 = utils.weight_variable([filter_number, filter_number])
 57 |     b_fc2 = utils.bias_variable([filter_number])
 58 |     W_out = utils.weight_variable([filter_number, 2])
 59 |     b_out = utils.bias_variable([2]) 
 60 |     learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, learning_rate_decay_step, learning_rate_decay_rate, staircase=True)
 61 |     opt_= tf.train.AdamOptimizer(learning_rate)
 62 |     tower_grads = []
 63 |     tower_pred = [[] for _ in range(gpu_num)]
 64 |     tower_loss = [0.0 for _ in range(gpu_num)]
 65 |     for i in range(gpu_num):
 66 |         with tf.device("/gpu:%d" % i):
 67 |             # We split the data into $gpu_num parts.
 68 |             next_batch = tf.sparse_split(split_dim=0, num_split=gpu_num, sp_input=x)[i]
 69 |             next_label = y_[i * batch_size / gpu_num: (i+1) * batch_size / gpu_num, :]
 70 |             hidden_1 = tf.nn.relu(tf.sparse_tensor_dense_matmul(next_batch, W_fc1) + b_fc1)
 71 |             hidden_2 = tf.nn.dropout(tf.nn.relu(tf.matmul(hidden_1, W_fc2) + b_fc2), keep_prob)
 72 |             tower_pred[i] = tf.nn.softmax(tf.matmul(hidden_2, W_out) + b_out)
 73 |             tower_loss[i] = -tf.reduce_mean(tf.reduce_sum(tf.cast(next_label, "float") * tf.log(tf.clip_by_value(tf.cast(tower_pred[i], "float"), 1e-10, 1.0)), reduction_indices=[1]))   
 74 |             params = tf.trainable_variables()
 75 |             tf.get_variable_scope().reuse_variables()
 76 |             grads = opt_.compute_gradients(tower_loss[i], var_list = params)
 77 |             tower_grads.append(grads)
 78 |     pred = tf.concat(0, [tower_pred[_] for _ in range(gpu_num)])
 79 |     loss = tower_loss[0]
 80 |     for _ in range(1, gpu_num):
 81 |         loss = tf.add(loss, tower_loss[_])
 82 |     loss = loss / (gpu_num + 0.0)
 83 |     grads_ave = utils.average_gradients(tower_grads)
 84 |     train_step = opt_.apply_gradients(grads_ave, global_step=global_step)
 85 | 
 86 | with graph.as_default():
 87 |     with tf.Session(config=config, graph=graph) as sess:
 88 |         saver = tf.train.Saver(max_to_keep=max_model_num_to_keep)
 89 |         sess.run(tf.initialize_all_variables())
 90 |         coord = tf.train.Coordinator()
 91 |         threads = tf.train.start_queue_runners(coord=coord, sess=sess)
 92 |         print 'Reading From LibSVM file...'
 93 |         label_v, ids_v, values_v = utils.libsvm_data_read(validation_sample)
 94 |         test_num = int(label_v.shape[0] / batch_size)
 95 |         iter = 1
 96 |         print 'Starting Training Process...'
 97 |         try:
 98 |             while not coord.should_stop():
 99 |                 label, ids, values = sess.run([label_batch, dense_ids, dense_values])
100 |                 label, ids_flatten, value_flatten = utils.sparse_tensor_to_train_batch(label, ids, values)
101 |                 sess.run(train_step, feed_dict={sp_indice: ids_flatten, sp_value: value_flatten, y_: label, keep_prob: 0.5})
102 |                 if iter % display_step == 0:
103 |                     print "Iteration:", '%04d' % (iter), ", Training Sample Loss: ", "{:.9f}".format(
104 |                         sess.run(loss, feed_dict={sp_indice: ids_flatten, sp_value: value_flatten, y_: label, keep_prob: 1.0}))
105 |                 if iter % test_step == 0:
106 |                     test_loss = 0.0
107 |                     for index in range(0, test_num):
108 |                         test_ids = ids_v[index * batch_size: (index + 1) * batch_size]
109 |                         test_values = values_v[index * batch_size: (index + 1) * batch_size]
110 |                         ids_flatten_v, value_flatten_v = utils.libsvm_convert_sparse_tensor(test_ids,test_values)
111 |                         test_labels = label_v[index * batch_size:(index + 1) * batch_size]
112 |                         test_loss += sess.run(loss, feed_dict={sp_indice: ids_flatten_v , sp_value: value_flatten_v ,y_: test_labels, keep_prob: 1.0})
113 |                     test_loss /= (test_num + 0.0)
114 |                     print 'Validation Loss: %s' % str(test_loss)
115 |                 if iter % save_step == 0:
116 |                     save_path = saver.save(sess, model_path + '-' + str(iter) + '.ckpt')   
117 |                 iter += 1
118 |         except tf.errors.OutOfRangeError:
119 |             print("Training Done!")
120 |         finally:
121 |             coord.request_stop()
122 |         coord.join(threads)
123 | 


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
  1 | import tensorflow as tf
  2 | import numpy as np
  3 | 
  4 | 
  5 | def read_and_decode_batch(filename_queue, batch_size, capacity, min_after_dequeue):
  6 |     """Dequeue a batch of data from the TFRecord.
  7 |     Args:
  8 |     filename_queue: Filename Queue of the TFRecord.
  9 |     batch_size: How many records dequeued each time.
 10 |     capacity: The capacity of the queue.
 11 |     min_after_dequeue: Ensures a minimum amount of shuffling of examples.
 12 |     Returns:
 13 |      List of the dequeued (batch_label, batch_ids, batch_values).
 14 |     """
 15 |     reader = tf.TFRecordReader()
 16 |     _, serialized_example = reader.read(filename_queue)
 17 |     batch_serialized_example = tf.train.shuffle_batch([serialized_example], 
 18 |         batch_size=batch_size, capacity=capacity, min_after_dequeue=min_after_dequeue)
 19 |     # The feature definition here should BE consistent with LibSVM TO TFRecord process.
 20 |     features = tf.parse_example(batch_serialized_example,
 21 |                                        features={
 22 |                                            "label": tf.FixedLenFeature([], tf.float32),
 23 |                                            "ids": tf.VarLenFeature(tf.int64),
 24 |                                            "values": tf.VarLenFeature(tf.float32)
 25 |                                        })
 26 |     batch_label = features["label"]
 27 |     batch_ids = features["ids"]
 28 |     batch_values = features["values"]
 29 |     return batch_label, batch_ids, batch_values
 30 | 
 31 |     
 32 | def sparse_tensor_to_train_batch(dense_label, dense_ids, dense_values):
 33 |     """Transform the dence ids and values to TF understandable inputs. Meanwhile, one-hot encode the labels.
 34 |     For instance, for dense_ids in the form of
 35 |     [[1, 4, 6, -1],
 36 |      [2, 3, -1, -1],
 37 |      [3, 4, 5, 6], ...
 38 |     ]
 39 |     should be transformed into
 40 |     [[0, 1], [0, 4], [0, 6],
 41 |      [1, 2], [1, 3],
 42 |      [2, 3], [2, 4], [2, 5], [2, 6], ...
 43 |     ]
 44 |     For dense_values in the form of:
 45 |     [[0.01, 0.23, 0.45, -1],
 46 |      [0.34, 0.25, -1, -1],
 47 |      [0.23, 0.78, 0.12, 0.56], ...
 48 |     ]
 49 |     should be transformed into
 50 |     [0.01, 0.23, 0.45, 0.34, 0.25, 0.23, 0.78, 0.12, 0.56, ...]
 51 |     Args:
 52 |     dense_label: Labels.
 53 |     dense_ids: Sparse indices.
 54 |     dense_values: Sparse values.
 55 |     Returns:
 56 |      List of the processed (label, ids, values) ready for training inputs.
 57 |     """
 58 |     indice_flatten = []
 59 |     values_flatten = []
 60 |     label_flatten = []
 61 |     index = 0
 62 |     for i in range(0, dense_label.shape[0]):
 63 |         if int(float(dense_label[i])) == 0:
 64 |             label_flatten.append([1.0, 0.0])
 65 |         else:
 66 |             label_flatten.append([0.0, 1.0])
 67 |         values = list(dense_values)
 68 |         indice = list(dense_ids)
 69 |         for j in range(0,len(indice[i])):
 70 |             if not indice[i][j] == -1:
 71 |                 indice_flatten.append([index,indice[i][j]])
 72 |                 values_flatten.append(values[i][j])
 73 |             else:
 74 |                 break
 75 |         index += 1           
 76 |     return np.array(label_flatten), indice_flatten, values_flatten
 77 | 
 78 | 
 79 | def libsvm_data_read(input_filename):
 80 |     """Read all the data from the LibSVM file.
 81 |     Args:
 82 |     input_filename: Filename of the LibSVM.
 83 |     Returns:
 84 |      List of the acquired (label, ids, values).
 85 |     """
 86 |     labels = []
 87 |     ids_all = [] 
 88 |     values_all = [] 
 89 |     for line in open(input_filename, 'r'):
 90 |         data = line.split(' ')
 91 |         if int(float(data[0])) == 0:
 92 |             labels.append([1.0, 0.0])
 93 |         else: 
 94 |             labels.append([0.0, 1.0]) 
 95 |         ids = []
 96 |         values = [] 
 97 |         for fea in data[1:]:
 98 |             id, value = fea.split(':')
 99 |             ids.append(int(id))
100 |             values.append(float(value))
101 |         ids_all.append(ids)
102 |         values_all.append(values)
103 |     return np.array(labels), np.array(ids_all), np.array(values_all)
104 | 
105 | 
106 | def libsvm_convert_sparse_tensor(array_ids, array_values):
107 |     """Transform the contents into TF understandable formats, which is similar to 
108 |        sparse_tensor_to_train_batch().
109 |     Args:
110 |     array_ids: Sparse indices.
111 |     array_values: Sparse values.
112 |     Returns:
113 |      List of the transformed (ids, values).
114 |     """
115 |     indice_flatten_v = []
116 |     values_flatten_v = []
117 |     index = 0
118 |     for i in range(0, array_ids.shape[0]):
119 |         for j in range(0, len(array_ids[i])):
120 |             indice_flatten_v.append([index, array_ids[i][j]])
121 |             values_flatten_v.append(array_values[i][j]) 
122 |         index += 1        
123 |     return indice_flatten_v, values_flatten_v
124 | 
125 | 
126 | def weight_variable(shape):
127 |     initial = tf.truncated_normal(shape, stddev=0.1)
128 |     return tf.Variable(initial)
129 | 
130 | 
131 | def bias_variable(shape):
132 |     initial = tf.constant(0.1, shape=shape)
133 |     return tf.Variable(initial)
134 | 
135 | def average_gradients(tower_grads):
136 |     """Calculate the average gradient for each shared variable across all towers.
137 |     Note that this function provides a synchronization point across all towers.
138 |     Args:
139 |     tower_grads: List of lists of (gradient, variable) tuples. The outer list
140 |       is over individual gradients. The inner list is over the gradient
141 |       calculation for each tower.
142 |     Returns:
143 |      List of pairs of (gradient, variable) where the gradient has been averaged
144 |      across all towers.
145 |     """
146 |     average_grads = []
147 |     for grad_and_vars in zip(*tower_grads):
148 |         # Note that each grad_and_vars looks like the following:
149 |         #   ((grad0_gpu0, var0_gpu0), ... , (grad0_gpuN, var0_gpuN))
150 |         grads = []
151 |         for g, _ in grad_and_vars:
152 |             # Add 0 dimension to the gradients to represent the tower.
153 |             expanded_g = tf.expand_dims(g, 0)
154 | 
155 |             # Append on a 'tower' dimension which we will average over below.
156 |             grads.append(expanded_g)
157 | 
158 |         # Average over the 'tower' dimension.
159 |         grad = tf.concat(0, grads)
160 |         grad = tf.reduce_mean(grad, 0)
161 | 
162 |         # Keep in mind that the Variables are redundant because they are shared
163 |         # across towers. So .. we will just return the first tower's pointer to
164 |         # the Variable.
165 |         v = grad_and_vars[0][1]
166 |         grad_and_var = (grad, v)
167 |         average_grads.append(grad_and_var)
168 |     return average_grads    
169 | 


--------------------------------------------------------------------------------