├── LICENCE
├── README.md
├── libsvm_to_tfrecord.py
├── resources
└── dataflow.png
├── spark_to_libsvm.scala
├── test.py
├── train.py
└── utils.py
/LICENCE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 Thorough Images
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | [](https://opensource.org/licenses/MIT)
2 |
3 | # EasySparse
4 |
5 | ## Motivation
6 | In production environments, we find TensorFlow poorly deals with sparse learning scenarios. Even when one reads out records from a TFRecord file, feeding the records into a deep learning model can be a hard-bone. Thus, we have open-sourced this project to help researches or engineers build their own model using sparse data while hiding the difficulties.
7 |
8 | This project naturally fits into scenarios when one uses TensorFlow to build deep learning models by data acquired from Spark. Other scenarios can be easily generalized.
9 |
10 | ## Data Flow
11 |

12 |
13 | ## Programs
14 | `spark_to_libsvm.scala` Read data from Spark to a LibSVM file while one-hot encode features by demand.
15 |
16 | `libsvm_to_tfrecord.py` Convert a LibSVM file into a TFRecord.
17 |
18 | `train.py` Training code for fully connected NN with multi-GPU support.
19 |
20 | `test.py` Test the performance of the trained model.
21 |
22 | `utils.py` Contains all the functions used in training and testing.
23 |
24 | ## Usage
25 | 1. Read data from Spark, one-hot encode some features, and write them into a LibSVM file. Be sure to manually split the data into three LibSVM files, each for training, validating and testing (`spark_to_libsvm.scala`).
26 | 2. Transform the training LibSVM file into TFRecord (`libsvm_to_tfrecord.py`).
27 | 3. Run the training program (`train.py`).
28 | 4. Test the trained model (`test.py`).
29 |
30 | ## Environment
31 | 1. Spark v2.0
32 | 2. TensorFlow >= v0.12.1
33 |
34 | ## Python Package Requirements
35 | 1. Numpy (required)
36 | 2. Sci-kit Learn (only required in `test.py`)
37 | 3. TensorFlow (required, >= v0.12.1)
38 |
39 | ## Implementation Notes
40 | 1. In the training process, `train.py` reads all the validation data from the LibSVM file into the memory, and harvests shuffled training batches from the TFRecord file. Meanwhile, in the test process, all the test data is read from the LibSVM file. Therefore, one does not need to convert validation and test LibSVM files to TFRecords. However, this implementation may not work when validation and test sets are too large to fit into the memory. Although it rarely happens since validation and test sets are usually much smaller than the training set. If that happens, one need to write TFRecord file queues for validation and test sets.
41 | 2. All the parameters required in the training process are defined at the top of `train.py`. Here we use a two-layer FC-NN to model our data. Note that we have adopted AdamOptimizer and exponential decayed learning rate.
42 | 3. One may play with varies types of deep learning models by only modifying the model definition in `train.py`.
43 |
44 | ## Contribution
45 | Contributions and comments are welcomed!
46 |
47 | ## Licence
48 | MIT
49 |
--------------------------------------------------------------------------------
/libsvm_to_tfrecord.py:
--------------------------------------------------------------------------------
1 | import os
2 | import numpy as np
3 | import tensorflow as tf
4 |
5 |
6 | def convert_tfrecords(input_filename, output_filename):
7 | """Concert the LibSVM contents to TFRecord.
8 | Args:
9 | input_filename: LibSVM filename.
10 | output_filename: Desired TFRecord filename.
11 | """
12 | print("Starting to convert {} to {}...".format(input_filename, output_filename))
13 | writer = tf.python_io.TFRecordWriter(output_filename)
14 |
15 | for line in open(input_filename, "r"):
16 | data = line.split(" ")
17 | label = float(data[0])
18 | ids = []
19 | values = []
20 | for fea in data[1:]:
21 | id, value = fea.split(":")
22 | ids.append(int(id))
23 | values.append(float(value))
24 | # Write samples one by one
25 | example = tf.train.Example(features=tf.train.Features(feature={
26 | "label":
27 | tf.train.Feature(float_list=tf.train.FloatList(value=[label])),
28 | "ids":
29 | tf.train.Feature(int64_list=tf.train.Int64List(value=ids)),
30 | "values":
31 | tf.train.Feature(float_list=tf.train.FloatList(value=values))
32 | }))
33 | writer.write(example.SerializeToString())
34 | writer.close()
35 | print("Successfully converted {} to {}!".format(input_filename, output_filename))
36 |
37 |
38 | sess = tf.InteractiveSession()
39 | convert_tfrecords("/path/to/libsvm/file/train.libsvm", "/path/to/tfrecord/file/train.tfrecords")
40 | sess.close()
41 |
--------------------------------------------------------------------------------
/resources/dataflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ThoroughImages/EasySparse/ebed5ee3c104523d35ccb62795b3b78b8df51d65/resources/dataflow.png
--------------------------------------------------------------------------------
/spark_to_libsvm.scala:
--------------------------------------------------------------------------------
1 | /*
2 | * Program for:
3 | *
4 | * 1. Data Preprocessing
5 | * one-hot encoding for discrete features
6 | * 2. File Format Transformation
7 | * converting table in HDFS to
8 | * LibSVM format
9 | */
10 | import org.apache.spark.rdd.RDD
11 |
12 | /**
13 | * Converting String to Double,
14 | * otherwise return 0.0.
15 | */
16 | def parseDouble(s: String) = try { s.toDouble } catch { case _ => 0.0 }
17 |
18 | /**
19 | * Load RDD from HDFS and split each row
20 | * using sep as the delimiter string.
21 | *
22 | * @param path the path of source file to be convert in HDFS
23 | * @param sep the delimiter string using "\001" as default
24 | *
25 | * @return rdd of type RDD[Array[String]]
26 | */
27 | def loadRddFromHdfs(path:String, sep:String="\001"): RDD[Array[String]] = {
28 | val rdd = sc.textFile(path)
29 | rdd.map(_.split(sep))
30 | }
31 |
32 | /**
33 | * Converting a row of data to LibSVM format
34 | *
35 | */
36 | def rowToLibSVM(row:Seq[Double]):String = {
37 | val label = row.head
38 | val featureIndex = row.tail.zipWithIndex
39 | val features = featureIndex.foldLeft(Seq.empty[String])((acc, e) =>
40 | if(e._1 != 0) acc :+ s"${e._2 + 1}:${e._1}" else acc
41 | )
42 | s"$label ${features mkString " "}"
43 | }
44 |
45 | /**
46 | * Converting rdd to LibSVM format
47 | *
48 | * @param rdd the target storage level
49 | * @param columns the sequence of column name
50 | * @param columnTypeConfig configuration about the type of columns including lable, continuous
51 | * features, discrete features, and columns to be omitted
52 | * @param oneHotConfig name of columns and the corresponding sequence of enumerations
53 | *
54 | * @return rdd rows of LibSVM format
55 | */
56 |
57 | def rddToLibSVM(rdd:RDD[Array[String]], columns:Seq[String],
58 | columnTypeConfig:Map[String,Seq[String]], oneHotConfig:Map[String,Seq[String]]):RDD[String] = {
59 | val columnIndexMap = columns.zipWithIndex.toMap
60 | val oneHotMap:Map[Int, Map[String, Array[Double]]] = oneHotConfig.map{case (colName, enumSeq) => {
61 | val index = columnIndexMap(colName)
62 | val map = enumSeq.zipWithIndex.map(e => (e._1, Array.fill[Double](enumSeq.length)(0.0).updated(e._2, 1.0))).toMap
63 | (index -> map)
64 | }}
65 | val columnTypeIndex = columnTypeConfig.map(e => {
66 | (e._1, e._2.map(c => columnIndexMap(c)))
67 | })
68 | rdd.map(row => {
69 | val rowMap = row.zipWithIndex.map(e => (e._2, e._1)).toMap
70 | val label = columnTypeIndex("label").map(e => parseDouble(rowMap(e)))
71 | val continuous = columnTypeIndex("continuousColumns").map(e => parseDouble(rowMap(e)))
72 | val discreteIndex = columnTypeIndex("discreteColumns").map(e => (e, rowMap(e))) // (index, value)
73 | val discrete = discreteIndex.map(e => oneHotMap(e._1)(e._2)).flatten // (enumIndex, enumValue)
74 | rowToLibSVM(label ++ discrete ++ continuous)
75 | })
76 | }
77 |
78 |
79 | /**
80 | * Simple Demo
81 | */
82 |
83 | val columns = Seq("key", "col_1", "col_2", "col_3", "col_4", "col_5", "y")
84 |
85 | // Specify the type of columns
86 | val columnTypeConfig = Map(
87 | ("label", Seq("y")),
88 | ("continuousColumns", Seq("col_1", "col_2")),
89 | ("discreteColumns", Seq("col_3", "col_4")),
90 | ("omittedColumns", Seq("key"))
91 | )
92 |
93 | // Specify column of discrete feature and the corresponding enumerations
94 | val oneHotConfig = Map (
95 | "col_3" -> Seq("col_3_a", "col_3_b", "col_3_c"),
96 | "col_4" -> Seq("col_4_a", "col_4_b"),
97 | "col_5" -> Seq("col_5_a", "col_5_b", "col_5_c", "col_5_d")
98 | )
99 |
100 | val filepath = "hdfs:///SOURCE_FILE_PATH"
101 | val savepath = "hdfs:///RESULT_PATH"
102 | val rdd = loadRddFromHdfs(filepath)
103 | val LibSVMRDD = rddToLibSVM(rdd, columns, columnTypeConfig, oneHotConfig)
104 | LibSVMRDD.saveAsTextFile(savepath)
105 |
106 |
--------------------------------------------------------------------------------
/test.py:
--------------------------------------------------------------------------------
1 | import tensorflow as tf
2 | import numpy as np
3 | import sklearn.metrics as sk
4 | import sklearn.preprocessing as pp
5 | import utils
6 |
7 |
8 | # TF config, allowing to use GPU by multiple programs
9 | config = tf.ConfigProto()
10 | config.gpu_options.per_process_gpu_memory_fraction = 0.5
11 | config.gpu_options.allow_growth = True
12 | config.allow_soft_placement = True
13 |
14 | test_file = 'path/to/test/set/test.libsvm'
15 | model_file = 'path/to/model/file/model.ckpt'
16 | filter_number = 64
17 | feature_num = 20000
18 | gpu_num = 2
19 |
20 |
21 | graph = tf.Graph()
22 |
23 | # Create the model: 2 layer FC NN, same as training
24 | with graph.as_default():
25 | sp_indice = tf.placeholder(tf.int64)
26 | sp_value = tf.placeholder(tf.float32)
27 | x = tf.SparseTensor(sp_indice, sp_value, [batch_size, feature_num])
28 | y_ = tf.placeholder(tf.float32, [None, 2])
29 | keep_prob = tf.placeholder("float32")
30 | W_fc1 = utils.weight_variable([feature_num, filter_number])
31 | b_fc1 = utils.bias_variable([filter_number])
32 | W_fc2 = utils.weight_variable([filter_number, filter_number])
33 | b_fc2 = utils.bias_variable([filter_number])
34 | W_out = utils.weight_variable([filter_number, 2])
35 | b_out = utils.bias_variable([2])
36 | tower_pred = [[] for _ in range(gpu_num)]
37 | tower_loss = [0.0 for _ in range(gpu_num)]
38 | for i in range(gpu_num):
39 | with tf.device("/gpu:%d" % i):
40 | # We split the data into $gpu_num parts.
41 | next_batch = tf.sparse_split(split_dim=0, num_split=gpu_num, sp_input=x)[i]
42 | next_label = y_[i * batch_size / gpu_num: (i+1) * batch_size / gpu_num, :]
43 | hidden_1 = tf.nn.relu(tf.sparse_tensor_dense_matmul(next_batch, W_fc1) + b_fc1)
44 | hidden_2 = tf.nn.dropout(tf.nn.relu(tf.matmul(hidden_1, W_fc2) + b_fc2), keep_prob)
45 | tower_pred[i] = tf.nn.softmax(tf.matmul(hidden_2, W_out) + b_out)
46 | tower_loss[i] = -tf.reduce_mean(tf.reduce_sum(tf.cast(next_label, "float") * tf.log(tf.clip_by_value(tf.cast(tower_pred[i], "float"), 1e-10, 1.0)), reduction_indices=[1]))
47 | params = tf.trainable_variables()
48 | tf.get_variable_scope().reuse_variables()
49 | pred = tf.concat(0, [tower_pred[_] for _ in range(gpu_num)])
50 | loss = tower_loss[0]
51 | for _ in range(1, gpu_num):
52 | loss = tf.add(loss, tower_loss[_])
53 | loss = loss / (gpu_num + 0.0)
54 |
55 |
56 | with graph.as_default():
57 | with tf.Session(config=config, graph=graph) as sess:
58 | saver = tf.train.Saver()
59 | sess.run(tf.initialize_all_variables())
60 | saver.restore(sess, model_file)
61 | label_v, ids_v, values_v = utils.libsvm_data_read(test_file)
62 | test_num = int(label_v.shape[0] / batch_size)
63 | iter = 1
64 | test_ids = ids_v[0: batch_size]
65 | test_values = values_v[0: batch_size]
66 | test_labels = label_v[0: batch_size]
67 | ids_flatten_v, value_flatten_v = utils.libsvm_convert_sparse_tensor(test_ids, test_values)
68 | y_prediction = sess.run(pred, feed_dict={sp_indice: ids_flatten_v , sp_value: value_flatten_v, y_: test_labels, keep_prob: 1.0})
69 | test_y_real = test_labels
70 | for index in range(1, test_num):
71 | print 'Testing Stage ' + str(index) + ' / ' + str(test_num)
72 | test_ids = ids_v[index * batch_size: (index + 1) * batch_size]
73 | test_values = values_v[index * batch_size: (index + 1) * batch_size]
74 | test_labels = label_v[index * batch_size:(index + 1) * batch_size]
75 | ids_flatten_v, value_flatten_v = utils.libsvm_convert_sparse_tensor(test_ids, test_values)
76 | test_y_real = np.concatenate((test_y_real, test_labels), 0)
77 | y_prediction = np.concatenate((y_prediction, sess.run(pred, feed_dict={sp_indice: ids_flatten_v, sp_value: value_flatten_v, y_: test_labels, keep_prob: 1.0})), 0)
78 | test_y_real = np.argmax(test_y_real, 1)
79 | for threshold in range(0, 1001, 1):
80 | test_prediction = np.array(pp.binarize(np.array([y_prediction[:, 1]]), threshold / 1000.0)[0])
81 | cm = sk.confusion_matrix(test_y_real, test_prediction)
82 | print cm
83 | # The former two values are used for ROC, the later two are used for P-R
84 | print str(cm[0][1] / (cm[0][1] + cm[0][0] + 0.0)) + '\t' + str(cm[1][1] / (cm[1][1] + cm[1][0] + 0.0)) + \
85 | '\t' + str(cm[1][1] / (cm[1][1] + cm[1][0] + 0.0)) + '\t' + str(cm[1][1] / (cm[1][1] + cm[0][1] + 0.0))
86 |
--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
1 | import os
2 | import tensorflow as tf
3 | import numpy as np
4 | import utils
5 |
6 |
7 | # TF config, allowing to use GPU by multiple programs
8 | config = tf.ConfigProto()
9 | config.gpu_options.per_process_gpu_memory_fraction = 0.5
10 | config.gpu_options.allow_growth = True
11 | config.allow_soft_placement = True
12 |
13 | # Parameter Definition
14 | training_sample = os.path.join('/path/to/tfrecord', 'train.tfrecords')
15 | validation_sample = '/path/to/validation/set/validation.libsvm'
16 | model_path = './models/fcnn'
17 | filter_number = 64
18 | training_iteration = 10000
19 | batch_size = 2048
20 | display_step = 100
21 | test_step = 1000
22 | save_step = 1000
23 | shuffle_index = 0
24 | starter_learning_rate = 1e-3
25 | feature_num = 20000 # Should be the same as the sparse input shape
26 | epoch_num = None
27 | shuffle_or_not = True
28 | gpu_num = 2
29 | max_model_num_to_keep = 1000
30 | capacity = 40960
31 | min_after_dequeue = 10240
32 | learning_rate_decay_step = 1000
33 | learning_rate_decay_rate = 0.5
34 |
35 | # Define TF graph
36 | graph = tf.Graph()
37 |
38 | # Training sample queue
39 | with graph.as_default():
40 | filename_queue = tf.train.string_input_producer([training_sample], num_epochs=epoch_num, shuffle=shuffle_or_not)
41 | label_batch, ids_batch, values_batch = utils.read_and_decode_batch(filename_queue, batch_size, capacity, min_after_dequeue)
42 | dense_values = tf.sparse_tensor_to_dense(values_batch, -1) # For further process
43 | dense_ids = tf.sparse_tensor_to_dense(ids_batch, -1) # For further process
44 |
45 | # Create the model: 2 layer FC NN
46 | with graph.as_default():
47 | global_step = tf.Variable(0, trainable=False)
48 | # Here we use the indices and values to reproduce the input SparseTensor
49 | sp_indice = tf.placeholder(tf.int64)
50 | sp_value = tf.placeholder(tf.float32)
51 | x = tf.SparseTensor(sp_indice, sp_value, [batch_size, feature_num])
52 | y_ = tf.placeholder(tf.float32, [None, 2])
53 | keep_prob = tf.placeholder("float32")
54 | W_fc1 = utils.weight_variable([feature_num, filter_number])
55 | b_fc1 = utils.bias_variable([filter_number])
56 | W_fc2 = utils.weight_variable([filter_number, filter_number])
57 | b_fc2 = utils.bias_variable([filter_number])
58 | W_out = utils.weight_variable([filter_number, 2])
59 | b_out = utils.bias_variable([2])
60 | learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, learning_rate_decay_step, learning_rate_decay_rate, staircase=True)
61 | opt_= tf.train.AdamOptimizer(learning_rate)
62 | tower_grads = []
63 | tower_pred = [[] for _ in range(gpu_num)]
64 | tower_loss = [0.0 for _ in range(gpu_num)]
65 | for i in range(gpu_num):
66 | with tf.device("/gpu:%d" % i):
67 | # We split the data into $gpu_num parts.
68 | next_batch = tf.sparse_split(split_dim=0, num_split=gpu_num, sp_input=x)[i]
69 | next_label = y_[i * batch_size / gpu_num: (i+1) * batch_size / gpu_num, :]
70 | hidden_1 = tf.nn.relu(tf.sparse_tensor_dense_matmul(next_batch, W_fc1) + b_fc1)
71 | hidden_2 = tf.nn.dropout(tf.nn.relu(tf.matmul(hidden_1, W_fc2) + b_fc2), keep_prob)
72 | tower_pred[i] = tf.nn.softmax(tf.matmul(hidden_2, W_out) + b_out)
73 | tower_loss[i] = -tf.reduce_mean(tf.reduce_sum(tf.cast(next_label, "float") * tf.log(tf.clip_by_value(tf.cast(tower_pred[i], "float"), 1e-10, 1.0)), reduction_indices=[1]))
74 | params = tf.trainable_variables()
75 | tf.get_variable_scope().reuse_variables()
76 | grads = opt_.compute_gradients(tower_loss[i], var_list = params)
77 | tower_grads.append(grads)
78 | pred = tf.concat(0, [tower_pred[_] for _ in range(gpu_num)])
79 | loss = tower_loss[0]
80 | for _ in range(1, gpu_num):
81 | loss = tf.add(loss, tower_loss[_])
82 | loss = loss / (gpu_num + 0.0)
83 | grads_ave = utils.average_gradients(tower_grads)
84 | train_step = opt_.apply_gradients(grads_ave, global_step=global_step)
85 |
86 | with graph.as_default():
87 | with tf.Session(config=config, graph=graph) as sess:
88 | saver = tf.train.Saver(max_to_keep=max_model_num_to_keep)
89 | sess.run(tf.initialize_all_variables())
90 | coord = tf.train.Coordinator()
91 | threads = tf.train.start_queue_runners(coord=coord, sess=sess)
92 | print 'Reading From LibSVM file...'
93 | label_v, ids_v, values_v = utils.libsvm_data_read(validation_sample)
94 | test_num = int(label_v.shape[0] / batch_size)
95 | iter = 1
96 | print 'Starting Training Process...'
97 | try:
98 | while not coord.should_stop():
99 | label, ids, values = sess.run([label_batch, dense_ids, dense_values])
100 | label, ids_flatten, value_flatten = utils.sparse_tensor_to_train_batch(label, ids, values)
101 | sess.run(train_step, feed_dict={sp_indice: ids_flatten, sp_value: value_flatten, y_: label, keep_prob: 0.5})
102 | if iter % display_step == 0:
103 | print "Iteration:", '%04d' % (iter), ", Training Sample Loss: ", "{:.9f}".format(
104 | sess.run(loss, feed_dict={sp_indice: ids_flatten, sp_value: value_flatten, y_: label, keep_prob: 1.0}))
105 | if iter % test_step == 0:
106 | test_loss = 0.0
107 | for index in range(0, test_num):
108 | test_ids = ids_v[index * batch_size: (index + 1) * batch_size]
109 | test_values = values_v[index * batch_size: (index + 1) * batch_size]
110 | ids_flatten_v, value_flatten_v = utils.libsvm_convert_sparse_tensor(test_ids,test_values)
111 | test_labels = label_v[index * batch_size:(index + 1) * batch_size]
112 | test_loss += sess.run(loss, feed_dict={sp_indice: ids_flatten_v , sp_value: value_flatten_v ,y_: test_labels, keep_prob: 1.0})
113 | test_loss /= (test_num + 0.0)
114 | print 'Validation Loss: %s' % str(test_loss)
115 | if iter % save_step == 0:
116 | save_path = saver.save(sess, model_path + '-' + str(iter) + '.ckpt')
117 | iter += 1
118 | except tf.errors.OutOfRangeError:
119 | print("Training Done!")
120 | finally:
121 | coord.request_stop()
122 | coord.join(threads)
123 |
--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
1 | import tensorflow as tf
2 | import numpy as np
3 |
4 |
5 | def read_and_decode_batch(filename_queue, batch_size, capacity, min_after_dequeue):
6 | """Dequeue a batch of data from the TFRecord.
7 | Args:
8 | filename_queue: Filename Queue of the TFRecord.
9 | batch_size: How many records dequeued each time.
10 | capacity: The capacity of the queue.
11 | min_after_dequeue: Ensures a minimum amount of shuffling of examples.
12 | Returns:
13 | List of the dequeued (batch_label, batch_ids, batch_values).
14 | """
15 | reader = tf.TFRecordReader()
16 | _, serialized_example = reader.read(filename_queue)
17 | batch_serialized_example = tf.train.shuffle_batch([serialized_example],
18 | batch_size=batch_size, capacity=capacity, min_after_dequeue=min_after_dequeue)
19 | # The feature definition here should BE consistent with LibSVM TO TFRecord process.
20 | features = tf.parse_example(batch_serialized_example,
21 | features={
22 | "label": tf.FixedLenFeature([], tf.float32),
23 | "ids": tf.VarLenFeature(tf.int64),
24 | "values": tf.VarLenFeature(tf.float32)
25 | })
26 | batch_label = features["label"]
27 | batch_ids = features["ids"]
28 | batch_values = features["values"]
29 | return batch_label, batch_ids, batch_values
30 |
31 |
32 | def sparse_tensor_to_train_batch(dense_label, dense_ids, dense_values):
33 | """Transform the dence ids and values to TF understandable inputs. Meanwhile, one-hot encode the labels.
34 | For instance, for dense_ids in the form of
35 | [[1, 4, 6, -1],
36 | [2, 3, -1, -1],
37 | [3, 4, 5, 6], ...
38 | ]
39 | should be transformed into
40 | [[0, 1], [0, 4], [0, 6],
41 | [1, 2], [1, 3],
42 | [2, 3], [2, 4], [2, 5], [2, 6], ...
43 | ]
44 | For dense_values in the form of:
45 | [[0.01, 0.23, 0.45, -1],
46 | [0.34, 0.25, -1, -1],
47 | [0.23, 0.78, 0.12, 0.56], ...
48 | ]
49 | should be transformed into
50 | [0.01, 0.23, 0.45, 0.34, 0.25, 0.23, 0.78, 0.12, 0.56, ...]
51 | Args:
52 | dense_label: Labels.
53 | dense_ids: Sparse indices.
54 | dense_values: Sparse values.
55 | Returns:
56 | List of the processed (label, ids, values) ready for training inputs.
57 | """
58 | indice_flatten = []
59 | values_flatten = []
60 | label_flatten = []
61 | index = 0
62 | for i in range(0, dense_label.shape[0]):
63 | if int(float(dense_label[i])) == 0:
64 | label_flatten.append([1.0, 0.0])
65 | else:
66 | label_flatten.append([0.0, 1.0])
67 | values = list(dense_values)
68 | indice = list(dense_ids)
69 | for j in range(0,len(indice[i])):
70 | if not indice[i][j] == -1:
71 | indice_flatten.append([index,indice[i][j]])
72 | values_flatten.append(values[i][j])
73 | else:
74 | break
75 | index += 1
76 | return np.array(label_flatten), indice_flatten, values_flatten
77 |
78 |
79 | def libsvm_data_read(input_filename):
80 | """Read all the data from the LibSVM file.
81 | Args:
82 | input_filename: Filename of the LibSVM.
83 | Returns:
84 | List of the acquired (label, ids, values).
85 | """
86 | labels = []
87 | ids_all = []
88 | values_all = []
89 | for line in open(input_filename, 'r'):
90 | data = line.split(' ')
91 | if int(float(data[0])) == 0:
92 | labels.append([1.0, 0.0])
93 | else:
94 | labels.append([0.0, 1.0])
95 | ids = []
96 | values = []
97 | for fea in data[1:]:
98 | id, value = fea.split(':')
99 | ids.append(int(id))
100 | values.append(float(value))
101 | ids_all.append(ids)
102 | values_all.append(values)
103 | return np.array(labels), np.array(ids_all), np.array(values_all)
104 |
105 |
106 | def libsvm_convert_sparse_tensor(array_ids, array_values):
107 | """Transform the contents into TF understandable formats, which is similar to
108 | sparse_tensor_to_train_batch().
109 | Args:
110 | array_ids: Sparse indices.
111 | array_values: Sparse values.
112 | Returns:
113 | List of the transformed (ids, values).
114 | """
115 | indice_flatten_v = []
116 | values_flatten_v = []
117 | index = 0
118 | for i in range(0, array_ids.shape[0]):
119 | for j in range(0, len(array_ids[i])):
120 | indice_flatten_v.append([index, array_ids[i][j]])
121 | values_flatten_v.append(array_values[i][j])
122 | index += 1
123 | return indice_flatten_v, values_flatten_v
124 |
125 |
126 | def weight_variable(shape):
127 | initial = tf.truncated_normal(shape, stddev=0.1)
128 | return tf.Variable(initial)
129 |
130 |
131 | def bias_variable(shape):
132 | initial = tf.constant(0.1, shape=shape)
133 | return tf.Variable(initial)
134 |
135 | def average_gradients(tower_grads):
136 | """Calculate the average gradient for each shared variable across all towers.
137 | Note that this function provides a synchronization point across all towers.
138 | Args:
139 | tower_grads: List of lists of (gradient, variable) tuples. The outer list
140 | is over individual gradients. The inner list is over the gradient
141 | calculation for each tower.
142 | Returns:
143 | List of pairs of (gradient, variable) where the gradient has been averaged
144 | across all towers.
145 | """
146 | average_grads = []
147 | for grad_and_vars in zip(*tower_grads):
148 | # Note that each grad_and_vars looks like the following:
149 | # ((grad0_gpu0, var0_gpu0), ... , (grad0_gpuN, var0_gpuN))
150 | grads = []
151 | for g, _ in grad_and_vars:
152 | # Add 0 dimension to the gradients to represent the tower.
153 | expanded_g = tf.expand_dims(g, 0)
154 |
155 | # Append on a 'tower' dimension which we will average over below.
156 | grads.append(expanded_g)
157 |
158 | # Average over the 'tower' dimension.
159 | grad = tf.concat(0, grads)
160 | grad = tf.reduce_mean(grad, 0)
161 |
162 | # Keep in mind that the Variables are redundant because they are shared
163 | # across towers. So .. we will just return the first tower's pointer to
164 | # the Variable.
165 | v = grad_and_vars[0][1]
166 | grad_and_var = (grad, v)
167 | average_grads.append(grad_and_var)
168 | return average_grads
169 |
--------------------------------------------------------------------------------