├── .gitignore ├── README.md ├── pom.xml ├── props ├── screenshots ├── 1_create_hbase_table.png ├── 2_hbase_scan.png ├── 3_hbase_filtered_output_in_spark.png ├── Screen Shot 2016-09-27 at 10.58.13 AM.png ├── hbase_records.png ├── hbase_spark_output.png └── hbase_spark_output_raw.png ├── src └── main │ └── scala │ └── com │ └── github │ └── zaratsian │ └── SparkHBase │ ├── SimulateAndBulkLoadHBaseData.scala │ └── SparkReadHBaseSnapshot.scala └── write_to_hbase.py /.gitignore: -------------------------------------------------------------------------------- 1 | target/ 2 | screenshots/ 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

HBase Snapshot to Spark Example

2 | 3 | This project shows how to analyze an HBase Snapshot using Spark. 4 |
5 |
6 | Why do this? 7 |
8 | The main motivation for writing this code is to reduce the impact on the HBase Region Servers while analyzing HBase records. By creating a snapshot of the HBase table, we can run Spark jobs against the snapshot, eliminating the impact to region servers and reducing the risk to operational systems. 9 |
10 |
At a high-level, here's what the code is doing: 11 | 1. Reads an HBase Snapshot into a Spark 12 | 2. Parses the HBase KeyValue to a Spark Dataframe 13 | 3. Applies arbitrary data processing (timestamp and rowkey filtering) 14 | 4. Saves the results back to an HBase (HFiles / KeyValue) format within HDFS, using HFileOutputFormat. 15 | - The output format maintains the original rowkey, timestamp, column family, qualifier, and value structure. 16 | 5. From here, you can bulkload the HDFS file into HBase. 17 | 18 |
19 | Here's more detail on how to run this project: 20 |
21 | 1. Create an HBase table and populate it with data (or you can use an existing table). I've included two ways to simulate the HBase table within this repo (for testing purposes). Use the SimulateAndBulkLoadHBaseData.scala code (preferred method) or you can use write_to_hbase.py (this is very slow compared to the scala code). 22 |
23 |
24 | 2. Take an HBase Snapshot: snapshot 'hbase_simulated_1m', 'hbase_simulated_1m_ss' 25 |
26 |
27 | 3. (Optional) The HBase Snapshot will already be in HDFS (at /apps/hbase/data), but you can use this if you want to load the HBase Snapshot to an HDFS location of your choice: 28 |
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot hbase_simulated_1m_ss -copy-to /tmp/ -mappers 2 29 |
30 |
31 | 4. Run the included Spark (scala) code against the HBase Snapshot. This code will read the HBase snapshot, filter records based on rowkey range (80001 to 90000) and based on a timestamp threshold (which is set in the props file), then write the results back to HDFS in HBase format (HFiles/KeyValue). 32 |
33 |
34 | a.) Build project: mvn clean package 35 |
36 |
37 | b.) Run Spark job: spark-submit --class com.github.zaratsian.SparkHBase.SparkReadHBaseSnapshot --jars /tmp/SparkHBaseExample-0.0.1-SNAPSHOT.jar /usr/hdp/current/phoenix-client/phoenix-client.jar /tmp/props 38 |
39 |
40 | c.) NOTE: Adjust the properties within the props file (if needed) to match your configuration. 41 | 42 |
43 |
Preliminary Performance Metrics: 44 |
45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 |
Number of RecordsSpark Runtime (without write to HDFS)Spark Runtime (with write to HDFS)HBase Shell Scan (without write to HDFS)
1,000,00027.07 seconds36.05 seconds6.8600 seconds
50,000,000417.38 seconds764.801 seconds7.5970 seconds
100,000,000741.829 seconds1413.001 seconds8.1380 seconds
71 |
NOTE: Here is the HBase Shell scan that was used scan 'hbase_simulated_100m', {STARTROW => "\x00\x01\x38\x81", ENDROW => "\x00\x01\x5F\x90", TIMERANGE => [1474571655001,9999999999999]}. This scan will filtered a 100 Million record HBase table based on rowkey range of 80001-90000 (\x00\x01\x38\x81 - \x00\x01\x5F\x90) and also from an arbitrary timerange, specified in unix timestamp. 72 |
73 |
74 | Sample output of HBase simulated data structure (using SimulateAndBulkLoadHBaseData.scala): 75 | 76 |
77 |
78 | Sample output of HBase simulated data structure (using write_to_hbase.py): 79 | 80 |
81 |
82 |
Versions: 83 |
This code was tested using Hortonworks HDP HDP-2.5.0.0 84 |
HBase version 1.1.2 85 |
Spark version 1.6.2 86 |
Scala version 2.10.5 87 |
88 |
89 |
References: 90 |
HBaseConfiguration Class 91 |
HBase TableSnapshotInputFormat Class 92 |
HBase KeyValue Class 93 |
HBase Bytes Class 94 |
HBase CellUtil Class 95 | -------------------------------------------------------------------------------- /pom.xml: -------------------------------------------------------------------------------- 1 | 2 | 4.0.0 3 | SparkHBaseExample 4 | jar 5 | 6 | com.github.zaratsian 7 | SparkHBaseExample 8 | 0.0.1-SNAPSHOT 9 | 10 | 11 | 2.5.0.0-1245 12 | 2.7.3 13 | 1.6.2 14 | 2.10 15 | 1.1.2 16 | 17 | 18 | 19 | 20 | hortonworks 21 | http://repo.hortonworks.com/content/repositories/releases/ 22 | 23 | 24 | repo.hortonworks.com-jetty 25 | Hortonworks Jetty Maven Repository 26 | http://repo.hortonworks.com/content/repositories/jetty-hadoop/ 27 | 28 | 29 | central2 30 | http://central.maven.org/maven2/ 31 | 32 | 33 | scala-tools.org 34 | Scala-tools Maven2 Repository 35 | http://scala-tools.org/repo-releases 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | com.typesafe 46 | config 47 | 1.2.1 48 | 49 | 50 | 51 | 52 | org.apache.hbase 53 | hbase-client 54 | ${hbase.version} 55 | 56 | 57 | 58 | 59 | org.apache.hbase 60 | hbase-common 61 | ${hbase.version} 62 | 63 | 64 | 65 | 66 | org.apache.hbase 67 | hbase-server 68 | ${hbase.version} 69 | 70 | 71 | 72 | 73 | org.apache.spark 74 | spark-sql_${scala.version} 75 | ${spark.version} 76 | provided 77 | 78 | 79 | 80 | org.apache.spark 81 | spark-core_${scala.version} 82 | ${spark.version} 83 | provided 84 | 85 | 86 | 87 | 91 | 92 | 93 | 94 | org.apache.hadoop 95 | hadoop-common 96 | ${hadoop.version} 97 | provided 98 | 99 | 100 | 101 | org.apache.hadoop 102 | hadoop-hdfs 103 | ${hadoop.version}.${hdp.version} 104 | provided 105 | 106 | 107 | 108 | org.scala-lang 109 | scala-library 110 | 2.10.5 111 | provided 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | scala-tools.org 123 | Scala-tools Maven2 Repository 124 | http://scala-tools.org/repo-releases 125 | 126 | 127 | 128 | 129 | src 130 | 131 | 132 | org.scala-tools 133 | maven-scala-plugin 134 | 2.15.2 135 | 136 | 137 | 138 | compile 139 | 140 | 141 | 142 | 143 | 144 | org.apache.maven.plugins 145 | maven-shade-plugin 146 | 1.4 147 | 148 | 149 | 150 | *:* 151 | 152 | META-INF/*.SF 153 | META-INF/*.DSA 154 | META-INF/*.RSA 155 | 156 | 157 | 158 | 159 | 160 | 161 | package 162 | 163 | shade 164 | 165 | 166 | 167 | 168 | 169 | com.github.zaratsian.SparkHBase.SparkHBase 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | -------------------------------------------------------------------------------- /props: -------------------------------------------------------------------------------- 1 | ############################################################## 2 | # 3 | # General Props 4 | # 5 | ############################################################## 6 | hbase.rootdir=/apps/hbase/data 7 | hbase.zookeeper.quorum=localhost:2181:/hbase-unsecure 8 | hbase.snapshot.path=/user/hbase 9 | 10 | 11 | ############################################################## 12 | # 13 | # Props for SparkReadHBaseSnapshot 14 | # 15 | ############################################################## 16 | hbase.snapshot.name=hbase_simulated_50m_ss 17 | hbase.snapshot.versions=3 18 | datetime_threshold=2016-09-23 06:27:08:000 19 | 20 | 21 | ############################################################## 22 | # 23 | # Props for SimulateAndBulkLoadHBaseData 24 | # 25 | ############################################################## 26 | simulated_tablename=hbase_simulated_50m 27 | simulated_records=50000000 28 | -------------------------------------------------------------------------------- /screenshots/1_create_hbase_table.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zaratsian/SparkHBaseExample/c2d20602cf143ce90cd20ec1d4d083ace2d5ad6e/screenshots/1_create_hbase_table.png -------------------------------------------------------------------------------- /screenshots/2_hbase_scan.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zaratsian/SparkHBaseExample/c2d20602cf143ce90cd20ec1d4d083ace2d5ad6e/screenshots/2_hbase_scan.png -------------------------------------------------------------------------------- /screenshots/3_hbase_filtered_output_in_spark.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zaratsian/SparkHBaseExample/c2d20602cf143ce90cd20ec1d4d083ace2d5ad6e/screenshots/3_hbase_filtered_output_in_spark.png -------------------------------------------------------------------------------- /screenshots/Screen Shot 2016-09-27 at 10.58.13 AM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zaratsian/SparkHBaseExample/c2d20602cf143ce90cd20ec1d4d083ace2d5ad6e/screenshots/Screen Shot 2016-09-27 at 10.58.13 AM.png -------------------------------------------------------------------------------- /screenshots/hbase_records.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zaratsian/SparkHBaseExample/c2d20602cf143ce90cd20ec1d4d083ace2d5ad6e/screenshots/hbase_records.png -------------------------------------------------------------------------------- /screenshots/hbase_spark_output.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zaratsian/SparkHBaseExample/c2d20602cf143ce90cd20ec1d4d083ace2d5ad6e/screenshots/hbase_spark_output.png -------------------------------------------------------------------------------- /screenshots/hbase_spark_output_raw.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zaratsian/SparkHBaseExample/c2d20602cf143ce90cd20ec1d4d083ace2d5ad6e/screenshots/hbase_spark_output_raw.png -------------------------------------------------------------------------------- /src/main/scala/com/github/zaratsian/SparkHBase/SimulateAndBulkLoadHBaseData.scala: -------------------------------------------------------------------------------- 1 | 2 | /******************************************************************************************************* 3 | 4 | This code: 5 | 1) Simulates X number of HBase records (number of simulated records can be defined in props file) 6 | 2) Bulkloads the data into HBase (table name can be defined in props file) 7 | 8 | Usage: 9 | 10 | spark-submit --class com.github.zaratsian.SparkHBase.SimulateAndBulkLoadHBaseData --jars /tmp/SparkHBaseExample-0.0.1-SNAPSHOT.jar /usr/hdp/current/phoenix-client/phoenix-client.jar /tmp/props 11 | 12 | ********************************************************************************************************/ 13 | 14 | package com.github.zaratsian.SparkHBase; 15 | 16 | import org.apache.spark.{SparkContext, SparkConf} 17 | import org.apache.spark.sql.Row 18 | import org.apache.spark.sql.functions.avg 19 | 20 | import scala.collection.mutable.HashMap 21 | import scala.io.Source.fromFile 22 | import scala.collection.JavaConverters._ 23 | 24 | import org.apache.hadoop.hbase.client.Scan 25 | import org.apache.hadoop.hbase.protobuf.ProtobufUtil 26 | import org.apache.hadoop.hbase.mapreduce.{TableInputFormat, TableSnapshotInputFormat} 27 | import org.apache.hadoop.hbase.util.Base64 28 | import org.apache.hadoop.hbase.client.Put 29 | import org.apache.hadoop.hbase.client.Result 30 | import org.apache.hadoop.hbase.io.ImmutableBytesWritable 31 | import org.apache.hadoop.hbase.client.HTable 32 | import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor} 33 | import org.apache.hadoop.hbase.HColumnDescriptor 34 | import org.apache.hadoop.hbase.client.HBaseAdmin 35 | import org.apache.hadoop.hbase.util.Bytes 36 | import org.apache.hadoop.hbase.CellUtil 37 | import org.apache.hadoop.hbase.KeyValue.Type 38 | import org.apache.hadoop.hbase.KeyValue 39 | import org.apache.hadoop.hbase.mapred.TableOutputFormat 40 | import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat 41 | import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles 42 | import org.apache.hadoop.hbase.mapreduce.KeyValueSortReducer 43 | 44 | import org.apache.hadoop.mapred.JobConf 45 | import org.apache.hadoop.mapreduce.Job 46 | import org.apache.hadoop.fs.Path 47 | import org.apache.hadoop.mapred.JobConf 48 | 49 | import org.apache.hadoop.conf._ 50 | import org.apache.hadoop.fs._ 51 | 52 | import java.text.SimpleDateFormat 53 | import java.util.Arrays 54 | import java.util.Date 55 | import java.util.Calendar 56 | import java.lang.String 57 | import util.Random 58 | 59 | object SimulateAndBulkLoadHBaseData{ 60 | 61 | def main(args: Array[String]) { 62 | 63 | val start_time = Calendar.getInstance() 64 | println("[ *** ] Start Time: " + start_time.getTime().toString) 65 | 66 | val props = getProps(args(0)) 67 | val number_of_simulated_records = props.getOrElse("simulated_records", "1000").toInt 68 | 69 | val sparkConf = new SparkConf().setAppName("SimulatedHBaseTable") 70 | val sc = new SparkContext(sparkConf) 71 | 72 | val sqlContext = new org.apache.spark.sql.SQLContext(sc) 73 | import sqlContext.implicits._ 74 | 75 | // HBase table name (if it does not exist, it will be created) 76 | val hTableName = props.getOrElse("simulated_tablename", "hbase_simulated_table") 77 | val columnFamily = "cf" 78 | 79 | println("[ *** ] Creating HBase Configuration") 80 | val hConf = HBaseConfiguration.create() 81 | hConf.set("zookeeper.znode.parent", "/hbase-unsecure") 82 | hConf.set(TableInputFormat.INPUT_TABLE, hTableName) 83 | 84 | val table = new HTable(hConf, hTableName) 85 | 86 | // Create HBase Table 87 | val admin = new HBaseAdmin(hConf) 88 | 89 | if(!admin.isTableAvailable(hTableName)) { 90 | 91 | println("[ *** ] Simulating Data") 92 | val rdd = sc.parallelize(1 to number_of_simulated_records) 93 | 94 | // Setup Random Generator 95 | //val rand = scala.util.Random 96 | 97 | println("[ *** ] Creating KeyValues") 98 | val rdd_out = rdd.map(x => { 99 | val kv: KeyValue = new KeyValue( Bytes.toBytes(x), columnFamily.getBytes(), "c1".getBytes(), x.toString.getBytes() ) 100 | (new ImmutableBytesWritable( Bytes.toBytes(x) ), kv) 101 | }) 102 | 103 | println("[ *** ] Printing simulated data (10 records)") 104 | rdd_out.map(x => x._2.toString).take(10).foreach(x => println(x)) 105 | 106 | println("[ ***] Creating HBase Table (" + hTableName + ")") 107 | val hTableDesc = new HTableDescriptor(hTableName) 108 | hTableDesc.addFamily(new HColumnDescriptor(columnFamily.getBytes())) 109 | admin.createTable(hTableDesc) 110 | 111 | println("[ *** ] Saving data to HDFS as KeyValue/HFileOutputFormat (table name = " + hTableName + ")") 112 | val hConf2 = HBaseConfiguration.create() 113 | hConf2.set("zookeeper.znode.parent", "/hbase-unsecure") 114 | hConf2.set(TableOutputFormat.OUTPUT_TABLE, hTableName) 115 | rdd_out.saveAsNewAPIHadoopFile("/tmp/" + hTableName, classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat], hConf2) 116 | 117 | println("[ *** ] BulkLoading from HDFS (HFileOutputFormat) to HBase Table (" + hTableName + ")") 118 | val bulkLoader = new LoadIncrementalHFiles(hConf2) 119 | bulkLoader.doBulkLoad(new Path("/tmp/" + hTableName), table) 120 | 121 | }else{ 122 | println("[ *** ] HBase Table ( " + hTableName + " ) already exists!") 123 | println("[ *** ] Stopping the simulation. Remove the existing HBase table and try again.") 124 | } 125 | 126 | 127 | sc.stop() 128 | 129 | 130 | // Print Runtime Metric 131 | val end_time = Calendar.getInstance() 132 | println("[ *** ] Created a table (" + hTableName + ") with " + number_of_simulated_records.toString + " records within HDFS and also bulkloaded it into HBase") 133 | println("[ *** ] Total Runtime: " + ((end_time.getTimeInMillis() - start_time.getTimeInMillis()).toFloat/1000).toString + " seconds") 134 | 135 | } 136 | 137 | 138 | def getArrayProp(props: => HashMap[String,String], prop: => String): Array[String] = { 139 | return props.getOrElse(prop, "").split(",").filter(x => !x.equals("")) 140 | } 141 | 142 | 143 | def getProps(file: => String): HashMap[String,String] = { 144 | var props = new HashMap[String,String] 145 | val lines = fromFile(file).getLines 146 | lines.foreach(x => if (x contains "=") props.put(x.split("=")(0), if (x.split("=").size > 1) x.split("=")(1) else null)) 147 | props 148 | } 149 | 150 | } 151 | 152 | //ZEND 153 | -------------------------------------------------------------------------------- /src/main/scala/com/github/zaratsian/SparkHBase/SparkReadHBaseSnapshot.scala: -------------------------------------------------------------------------------- 1 | 2 | /******************************************************************************************************* 3 | This code does the following: 4 | 1) Read an HBase Snapshot, and convert to Spark RDD (snapshot name is defined in props file) 5 | 2) Parse the records / KeyValue (extracting column family, column name, timestamp, value, etc) 6 | 3) Perform general data processing - Filter the data based on rowkey range AND timestamp (timestamp threshold variable defined in props file) 7 | 4) Write the results to HDFS (formatted for HBase BulkLoad, saved as HFileOutputFormat) 8 | 9 | Usage: 10 | 11 | spark-submit --class com.github.zaratsian.SparkHBase.SparkReadHBaseSnapshot --jars /tmp/SparkHBaseExample-0.0.1-SNAPSHOT.jar /usr/hdp/current/phoenix-client/phoenix-client.jar /tmp/props 12 | 13 | ********************************************************************************************************/ 14 | 15 | package com.github.zaratsian.SparkHBase; 16 | 17 | import org.apache.spark.{SparkContext, SparkConf} 18 | import org.apache.spark.sql.Row 19 | import org.apache.spark.sql.functions.avg 20 | 21 | import scala.collection.mutable.HashMap 22 | import scala.io.Source.fromFile 23 | import scala.collection.JavaConverters._ 24 | 25 | import org.apache.hadoop.hbase.client.Scan 26 | import org.apache.hadoop.hbase.protobuf.ProtobufUtil 27 | import org.apache.hadoop.hbase.mapreduce.{TableInputFormat, TableSnapshotInputFormat} 28 | import org.apache.hadoop.hbase.util.Base64 29 | import org.apache.hadoop.hbase.client.Put 30 | import org.apache.hadoop.hbase.client.Result 31 | import org.apache.hadoop.hbase.io.ImmutableBytesWritable 32 | import org.apache.hadoop.hbase.client.HTable 33 | import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor} 34 | import org.apache.hadoop.hbase.HColumnDescriptor 35 | import org.apache.hadoop.hbase.client.HBaseAdmin 36 | import org.apache.hadoop.hbase.util.Bytes 37 | import org.apache.hadoop.hbase.CellUtil 38 | import org.apache.hadoop.hbase.KeyValue.Type 39 | import org.apache.hadoop.hbase.KeyValue 40 | import org.apache.hadoop.hbase.mapred.TableOutputFormat 41 | import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat 42 | import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles 43 | 44 | import org.apache.hadoop.mapred.JobConf 45 | import org.apache.hadoop.mapreduce.Job 46 | import org.apache.hadoop.fs.Path 47 | import org.apache.hadoop.mapred.JobConf 48 | 49 | import org.apache.hadoop.conf._ 50 | import org.apache.hadoop.fs._ 51 | 52 | import java.text.SimpleDateFormat 53 | import java.util.Arrays 54 | import java.util.Date 55 | import java.util.Calendar 56 | import java.lang.String 57 | 58 | object SparkReadHBaseSnapshot{ 59 | 60 | case class hVar(rowkey: Int, colFamily: String, colQualifier: String, colDatetime: Long, colDatetimeStr: String, colType: String, colValue: String) 61 | 62 | def main(args: Array[String]) { 63 | 64 | val start_time = Calendar.getInstance() 65 | println("[ *** ] Start Time: " + start_time.getTime().toString) 66 | 67 | val props = getProps(args(0)) 68 | val max_versions : Int = props.getOrElse("hbase.snapshot.versions","3").toInt 69 | 70 | val sparkConf = new SparkConf().setAppName("SparkReadHBaseSnapshot") 71 | val sc = new SparkContext(sparkConf) 72 | val sqlContext = new org.apache.spark.sql.SQLContext(sc) 73 | import sqlContext.implicits._ 74 | 75 | println("[ *** ] Creating HBase Configuration") 76 | val hConf = HBaseConfiguration.create() 77 | hConf.set("hbase.rootdir", props.getOrElse("hbase.rootdir", "/tmp")) 78 | hConf.set("hbase.zookeeper.quorum", props.getOrElse("hbase.zookeeper.quorum", "localhost:2181:/hbase-unsecure")) 79 | hConf.set(TableInputFormat.SCAN, convertScanToString(new Scan().setMaxVersions(max_versions)) ) 80 | 81 | val job = Job.getInstance(hConf) 82 | 83 | val path = new Path(props.getOrElse("hbase.snapshot.path", "/user/hbase")) 84 | val snapName = props.getOrElse("hbase.snapshot.name", "customer_info_ss") 85 | 86 | TableSnapshotInputFormat.setInput(job, snapName, path) 87 | 88 | val hBaseRDD = sc.newAPIHadoopRDD(job.getConfiguration, 89 | classOf[TableSnapshotInputFormat], 90 | classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], 91 | classOf[org.apache.hadoop.hbase.client.Result]) 92 | 93 | val record_count_raw = hBaseRDD.count() 94 | println("[ *** ] Read in SnapShot (" + snapName.toString + "), which contains " + record_count_raw + " records") 95 | 96 | // Extract the KeyValue element of the tuple 97 | val keyValue = hBaseRDD.map(x => x._2).map(_.list) 98 | 99 | //println("[ *** ] Printing raw SnapShot (10 records) from HBase SnapShot") 100 | //hBaseRDD.map(x => x._1.toString).take(10).foreach(x => println(x)) 101 | //hBaseRDD.map(x => x._2.toString).take(10).foreach(x => println(x)) 102 | //keyValue.map(x => x.toString).take(10).foreach(x => println(x)) 103 | 104 | val df = keyValue.flatMap(x => x.asScala.map(cell => 105 | hVar( 106 | Bytes.toInt(CellUtil.cloneRow(cell)), 107 | Bytes.toStringBinary(CellUtil.cloneFamily(cell)), 108 | Bytes.toStringBinary(CellUtil.cloneQualifier(cell)), 109 | cell.getTimestamp, 110 | new SimpleDateFormat("yyyy-MM-dd HH:mm:ss:SSS").format(new Date(cell.getTimestamp.toLong)), 111 | Type.codeToType(cell.getTypeByte).toString, 112 | Bytes.toStringBinary(CellUtil.cloneValue(cell)) 113 | ) 114 | ) 115 | ).toDF() 116 | 117 | println("[ *** ] Printing parsed SnapShot (10 records) from HBase SnapShot") 118 | df.show(10, false) 119 | 120 | //Get timestamp (from props) that will be used for filtering 121 | val datetime_threshold = props.getOrElse("datetime_threshold", "2016-08-25 14:27:02:001") 122 | val datetime_threshold_long = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss:SSS").parse(datetime_threshold).getTime() 123 | println("[ *** ] Filtering/Keeping all SnapShot records that are more recent (greater) than the datetime_threshold (set in the props file): " + datetime_threshold.toString) 124 | 125 | println("[ *** ] Filtering Dataframe") 126 | val df_filtered = df.filter($"colDatetime" >= datetime_threshold_long && $"rowkey".between(80001, 90000)) 127 | 128 | /* // Filter RDD (alternative, but using a DF is a better option) 129 | val rdd_filtered = keyValue.flatMap(x => x.asScala.map(cell => 130 | {( 131 | Bytes.toStringBinary(CellUtil.cloneRow(cell)), 132 | Bytes.toStringBinary(CellUtil.cloneFamily(cell)), 133 | Bytes.toStringBinary(CellUtil.cloneQualifier(cell)), 134 | cell.getTimestamp, 135 | new SimpleDateFormat("yyyy-MM-dd HH:mm:ss:SSS").format(new Date(cell.getTimestamp.toLong)), 136 | Type.codeToType(cell.getTypeByte).toString, 137 | Bytes.toStringBinary(CellUtil.cloneValue(cell)) 138 | )} 139 | )).filter(x => x._4>=datetime_threshold_long) 140 | */ 141 | 142 | println("[ *** ] Filtered dataframe contains " + df_filtered.count() + " records") 143 | println("[ *** ] Printing filtered HBase SnapShot records (10 records)") 144 | df_filtered.show(10, false) 145 | 146 | // For testing purposes, print datatypes 147 | //df_filtered.dtypes.toList.foreach(x => println(x)) 148 | 149 | // Convert DF to KeyValue 150 | println("[ *** ] Converting dataframe to RDD so that it can be written as HFileOutputFormat using saveAsNewAPIHadoopFile") 151 | val rdd_from_df = df_filtered.rdd.map(x => { 152 | val kv: KeyValue = new KeyValue( Bytes.toBytes(x(0).asInstanceOf[Int]), x(1).toString.getBytes(), x(2).toString.getBytes(), x(3).asInstanceOf[Long], x(6).toString.getBytes() ) 153 | (new ImmutableBytesWritable( Bytes.toBytes(x(0).asInstanceOf[Int]) ), kv) 154 | }) 155 | 156 | /* // Convert RDD to KeyValue 157 | val rdd_to_hbase = rdd_filtered.map(x=>{ 158 | val kv: KeyValue = new KeyValue(Bytes.toBytes(x._1), x._2.getBytes(), x._3.getBytes(), x._7.getBytes() ) 159 | (new ImmutableBytesWritable(Bytes.toBytes(x._1)), kv) 160 | }) 161 | */ 162 | 163 | val time_snapshot_processing = Calendar.getInstance() 164 | println("[ *** ] Runtime for Snapshot Processing: " + ((time_snapshot_processing.getTimeInMillis() - start_time.getTimeInMillis()).toFloat/1000).toString + " seconds") 165 | 166 | // Configure HBase output settings 167 | val hTableName = snapName + "_filtered" 168 | val hConf2 = HBaseConfiguration.create() 169 | hConf2.set("zookeeper.znode.parent", "/hbase-unsecure") 170 | hConf2.set(TableOutputFormat.OUTPUT_TABLE, hTableName) 171 | 172 | //println("[ *** ] Saving results to HDFS as HBase KeyValue HFileOutputFormat. This makes it easy to BulkLoad into HBase (see SparkHBaseBulkLoad.scala for bulkload code)") 173 | //rdd_from_df.map(x => x._2.toString).take(10).foreach(x => println(x)) 174 | rdd_from_df.saveAsNewAPIHadoopFile("/tmp/" + hTableName, classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat], hConf2) 175 | 176 | // Print Total Runtime 177 | val end_time = Calendar.getInstance() 178 | println("[ *** ] End Time: " + end_time.getTime().toString) 179 | println("[ *** ] Saved " + rdd_from_df.count() + " records to HDFS, located in /tmp/" + hTableName.toString) 180 | println("[ *** ] Runtime for Snapshot Processing: " + ((time_snapshot_processing.getTimeInMillis() - start_time.getTimeInMillis()).toFloat/1000).toString + " seconds") 181 | println("[ *** ] Runtime for Snapshot Processing, saving to HDFS: " + ((end_time.getTimeInMillis() - start_time.getTimeInMillis()).toFloat/1000).toString + " seconds") 182 | 183 | 184 | sc.stop() 185 | 186 | 187 | } 188 | 189 | 190 | def convertScanToString(scan : Scan) = { 191 | val proto = ProtobufUtil.toScan(scan); 192 | Base64.encodeBytes(proto.toByteArray()); 193 | } 194 | 195 | 196 | def getArrayProp(props: => HashMap[String,String], prop: => String): Array[String] = { 197 | return props.getOrElse(prop, "").split(",").filter(x => !x.equals("")) 198 | } 199 | 200 | 201 | def getProps(file: => String): HashMap[String,String] = { 202 | var props = new HashMap[String,String] 203 | val lines = fromFile(file).getLines 204 | lines.foreach(x => if (x contains "=") props.put(x.split("=")(0), if (x.split("=").size > 1) x.split("=")(1) else null)) 205 | props 206 | } 207 | 208 | } 209 | 210 | //ZEND 211 | -------------------------------------------------------------------------------- /write_to_hbase.py: -------------------------------------------------------------------------------- 1 | 2 | ################################################################################################################################################### 3 | # 4 | # This script will create an HBase table and populate X number of records (based on simulated data) 5 | # 6 | # http://happybase.readthedocs.io/en/latest/api.html#table 7 | # https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_installing_manually_book/content/ref-2a6efe32-d0e1-4e84-9068-4361b8c36dc8.1.html 8 | # 9 | # NOTE: HBase Thrift Server must be running for this to work 10 | # 11 | # Starting the HBase Thrift Server: 12 | # Foreground: hbase thrift start -p --infoport 13 | # Background: /usr/hdp/current/hbase-master/bin/hbase-daemon.sh start thrift -p --infoport 14 | # 15 | ################################################################################################################################################### 16 | 17 | 18 | import happybase 19 | import os,sys,csv 20 | import random 21 | import time, datetime 22 | 23 | 24 | # User Inputs (could be moved to arguments) 25 | hostname = 'localhost' 26 | port = 9999 27 | table_name = 'customer_info' 28 | columnfamily = 'demographics' 29 | number_of_records = 1000000 30 | batch_size = 2000 31 | 32 | 33 | print '[ INFO ] Trying to connect to the HBase Thrift server at ' + str(hostname) + ':' + str(port) 34 | try: 35 | connection = happybase.Connection(hostname, port=port, timeout=40000) 36 | table = connection.table(table_name) 37 | print '[ INFO ] Successfully connected to the HBase Thrift server at ' + str(hostname) + ':' + str(port) 38 | except: 39 | print '[ ERROR ] Could not connect to HBase Thrift Server at ' + str(hostname) + ':' + str(port) + '. Make sure that the HBase Thrift server is running and the host and port number is correct.' 40 | sys.exit() 41 | 42 | 43 | # https://happybase.readthedocs.io/en/latest/api.html#happybase.Connection.create_table 44 | print '[ INFO ] Creating HBase table: ' + str(table_name) 45 | time.sleep(1) 46 | families = { 47 | columnfamily: dict(), # use defaults 48 | } 49 | 50 | connection.create_table(table_name,families) 51 | print '[ INFO ] Successfully created HBase table: ' + str(table_name) 52 | 53 | 54 | print '[ INFO ] Inserting ' + str(number_of_records) + ' records into ' + str(table_name) 55 | start_time = datetime.datetime.now() 56 | with table.batch(batch_size=batch_size) as b: 57 | for i in range(number_of_records): 58 | rowkey = i 59 | custid = random.randint(1000000,9999999) 60 | gender = ['male','female'][random.randint(0,1)] 61 | age = random.randint(18,100) 62 | level = ['silver','gold','platimum','diamond'][random.randint(0,3)] 63 | 64 | b.put(str(rowkey), {b'demographics:custid': str(custid), 65 | b'demographics:gender': str(gender), 66 | b'demographics:age': str(age), 67 | b'demographics:level': str(level)}) 68 | 69 | print '[ INFO ] Successfully inserted ' + str(number_of_records) + ' records into ' + str(table_name) + ' in ' + str((datetime.datetime.now() - start_time).seconds) + ' seconds' 70 | 71 | 72 | print '[ INFO ] Printing data records from the generated HBase table' 73 | for key, data in table.rows([b'1', b'2', b'3', b'4', b'5']): 74 | try: 75 | print(key, data) # prints row key and data for each row 76 | except: 77 | pass 78 | 79 | #ZEND 80 | --------------------------------------------------------------------------------