├── .gitignore
├── README.md
├── pom.xml
├── props
├── screenshots
├── 1_create_hbase_table.png
├── 2_hbase_scan.png
├── 3_hbase_filtered_output_in_spark.png
├── Screen Shot 2016-09-27 at 10.58.13 AM.png
├── hbase_records.png
├── hbase_spark_output.png
└── hbase_spark_output_raw.png
├── src
└── main
│ └── scala
│ └── com
│ └── github
│ └── zaratsian
│ └── SparkHBase
│ ├── SimulateAndBulkLoadHBaseData.scala
│ └── SparkReadHBaseSnapshot.scala
└── write_to_hbase.py
/.gitignore:
--------------------------------------------------------------------------------
1 | target/
2 | screenshots/
3 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
HBase Snapshot to Spark Example
2 |
3 | This project shows how to analyze an HBase Snapshot using Spark.
4 |
5 |
6 | Why do this?
7 |
8 | The main motivation for writing this code is to reduce the impact on the HBase Region Servers while analyzing HBase records. By creating a snapshot of the HBase table, we can run Spark jobs against the snapshot, eliminating the impact to region servers and reducing the risk to operational systems.
9 |
10 |
At a high-level, here's what the code is doing:
11 | 1. Reads an HBase Snapshot into a Spark
12 | 2. Parses the HBase KeyValue to a Spark Dataframe
13 | 3. Applies arbitrary data processing (timestamp and rowkey filtering)
14 | 4. Saves the results back to an HBase (HFiles / KeyValue) format within HDFS, using HFileOutputFormat.
15 | - The output format maintains the original rowkey, timestamp, column family, qualifier, and value structure.
16 | 5. From here, you can bulkload the HDFS file into HBase.
17 |
18 |
19 | Here's more detail on how to run this project:
20 |
21 | 1. Create an HBase table and populate it with data (or you can use an existing table). I've included two ways to simulate the HBase table within this repo (for testing purposes). Use the SimulateAndBulkLoadHBaseData.scala code (preferred method) or you can use write_to_hbase.py (this is very slow compared to the scala code).
22 |
23 |
24 | 2. Take an HBase Snapshot: snapshot 'hbase_simulated_1m', 'hbase_simulated_1m_ss'
25 |
26 |
27 | 3. (Optional) The HBase Snapshot will already be in HDFS (at /apps/hbase/data), but you can use this if you want to load the HBase Snapshot to an HDFS location of your choice:
28 |
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot hbase_simulated_1m_ss -copy-to /tmp/ -mappers 2
29 |
30 |
31 | 4. Run the included Spark (scala) code against the HBase Snapshot. This code will read the HBase snapshot, filter records based on rowkey range (80001 to 90000) and based on a timestamp threshold (which is set in the props file), then write the results back to HDFS in HBase format (HFiles/KeyValue).
32 |
33 |
34 | a.) Build project: mvn clean package
35 |
36 |
37 | b.) Run Spark job: spark-submit --class com.github.zaratsian.SparkHBase.SparkReadHBaseSnapshot --jars /tmp/SparkHBaseExample-0.0.1-SNAPSHOT.jar /usr/hdp/current/phoenix-client/phoenix-client.jar /tmp/props
38 |
39 |
40 | c.) NOTE: Adjust the properties within the props file (if needed) to match your configuration.
41 |
42 |
43 |
Preliminary Performance Metrics:
44 |
45 |
46 |
47 | Number of Records |
48 | Spark Runtime (without write to HDFS) |
49 | Spark Runtime (with write to HDFS) |
50 | HBase Shell Scan (without write to HDFS) |
51 |
52 |
53 | 1,000,000 |
54 | 27.07 seconds |
55 | 36.05 seconds |
56 | 6.8600 seconds |
57 |
58 |
59 | 50,000,000 |
60 | 417.38 seconds |
61 | 764.801 seconds |
62 | 7.5970 seconds |
63 |
64 |
65 | 100,000,000 |
66 | 741.829 seconds |
67 | 1413.001 seconds |
68 | 8.1380 seconds |
69 |
70 |
71 |
NOTE: Here is the HBase Shell scan that was used scan 'hbase_simulated_100m', {STARTROW => "\x00\x01\x38\x81", ENDROW => "\x00\x01\x5F\x90", TIMERANGE => [1474571655001,9999999999999]}
. This scan will filtered a 100 Million record HBase table based on rowkey range of 80001-90000 (\x00\x01\x38\x81 - \x00\x01\x5F\x90) and also from an arbitrary timerange, specified in unix timestamp.
72 |
73 |
74 | Sample output of HBase simulated data structure (using SimulateAndBulkLoadHBaseData.scala):
75 |
76 |
77 |
78 | Sample output of HBase simulated data structure (using write_to_hbase.py):
79 |
80 |
81 |
82 |
Versions:
83 |
This code was tested using Hortonworks HDP HDP-2.5.0.0
84 |
HBase version 1.1.2
85 |
Spark version 1.6.2
86 |
Scala version 2.10.5
87 |
88 |
89 |
References:
90 |
HBaseConfiguration Class
91 |
HBase TableSnapshotInputFormat Class
92 |
HBase KeyValue Class
93 |
HBase Bytes Class
94 |
HBase CellUtil Class
95 |
--------------------------------------------------------------------------------
/pom.xml:
--------------------------------------------------------------------------------
1 |
2 | 4.0.0
3 | SparkHBaseExample
4 | jar
5 |
6 | com.github.zaratsian
7 | SparkHBaseExample
8 | 0.0.1-SNAPSHOT
9 |
10 |
11 | 2.5.0.0-1245
12 | 2.7.3
13 | 1.6.2
14 | 2.10
15 | 1.1.2
16 |
17 |
18 |
19 |
20 | hortonworks
21 | http://repo.hortonworks.com/content/repositories/releases/
22 |
23 |
24 | repo.hortonworks.com-jetty
25 | Hortonworks Jetty Maven Repository
26 | http://repo.hortonworks.com/content/repositories/jetty-hadoop/
27 |
28 |
29 | central2
30 | http://central.maven.org/maven2/
31 |
32 |
33 | scala-tools.org
34 | Scala-tools Maven2 Repository
35 | http://scala-tools.org/repo-releases
36 |
37 |
38 |
39 |
40 |
41 |
42 |
43 |
44 |
45 | com.typesafe
46 | config
47 | 1.2.1
48 |
49 |
50 |
51 |
52 | org.apache.hbase
53 | hbase-client
54 | ${hbase.version}
55 |
56 |
57 |
58 |
59 | org.apache.hbase
60 | hbase-common
61 | ${hbase.version}
62 |
63 |
64 |
65 |
66 | org.apache.hbase
67 | hbase-server
68 | ${hbase.version}
69 |
70 |
71 |
72 |
73 | org.apache.spark
74 | spark-sql_${scala.version}
75 | ${spark.version}
76 | provided
77 |
78 |
79 |
80 | org.apache.spark
81 | spark-core_${scala.version}
82 | ${spark.version}
83 | provided
84 |
85 |
86 |
87 |
91 |
92 |
93 |
94 | org.apache.hadoop
95 | hadoop-common
96 | ${hadoop.version}
97 | provided
98 |
99 |
100 |
101 | org.apache.hadoop
102 | hadoop-hdfs
103 | ${hadoop.version}.${hdp.version}
104 | provided
105 |
106 |
107 |
108 | org.scala-lang
109 | scala-library
110 | 2.10.5
111 | provided
112 |
113 |
114 |
115 |
116 |
117 |
118 |
119 |
120 |
121 |
122 | scala-tools.org
123 | Scala-tools Maven2 Repository
124 | http://scala-tools.org/repo-releases
125 |
126 |
127 |
128 |
129 | src
130 |
131 |
132 | org.scala-tools
133 | maven-scala-plugin
134 | 2.15.2
135 |
136 |
137 |
138 | compile
139 |
140 |
141 |
142 |
143 |
144 | org.apache.maven.plugins
145 | maven-shade-plugin
146 | 1.4
147 |
148 |
149 |
150 | *:*
151 |
152 | META-INF/*.SF
153 | META-INF/*.DSA
154 | META-INF/*.RSA
155 |
156 |
157 |
158 |
159 |
160 |
161 | package
162 |
163 | shade
164 |
165 |
166 |
167 |
168 |
169 | com.github.zaratsian.SparkHBase.SparkHBase
170 |
171 |
172 |
173 |
174 |
175 |
176 |
177 |
178 |
179 |
--------------------------------------------------------------------------------
/props:
--------------------------------------------------------------------------------
1 | ##############################################################
2 | #
3 | # General Props
4 | #
5 | ##############################################################
6 | hbase.rootdir=/apps/hbase/data
7 | hbase.zookeeper.quorum=localhost:2181:/hbase-unsecure
8 | hbase.snapshot.path=/user/hbase
9 |
10 |
11 | ##############################################################
12 | #
13 | # Props for SparkReadHBaseSnapshot
14 | #
15 | ##############################################################
16 | hbase.snapshot.name=hbase_simulated_50m_ss
17 | hbase.snapshot.versions=3
18 | datetime_threshold=2016-09-23 06:27:08:000
19 |
20 |
21 | ##############################################################
22 | #
23 | # Props for SimulateAndBulkLoadHBaseData
24 | #
25 | ##############################################################
26 | simulated_tablename=hbase_simulated_50m
27 | simulated_records=50000000
28 |
--------------------------------------------------------------------------------
/screenshots/1_create_hbase_table.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zaratsian/SparkHBaseExample/c2d20602cf143ce90cd20ec1d4d083ace2d5ad6e/screenshots/1_create_hbase_table.png
--------------------------------------------------------------------------------
/screenshots/2_hbase_scan.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zaratsian/SparkHBaseExample/c2d20602cf143ce90cd20ec1d4d083ace2d5ad6e/screenshots/2_hbase_scan.png
--------------------------------------------------------------------------------
/screenshots/3_hbase_filtered_output_in_spark.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zaratsian/SparkHBaseExample/c2d20602cf143ce90cd20ec1d4d083ace2d5ad6e/screenshots/3_hbase_filtered_output_in_spark.png
--------------------------------------------------------------------------------
/screenshots/Screen Shot 2016-09-27 at 10.58.13 AM.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zaratsian/SparkHBaseExample/c2d20602cf143ce90cd20ec1d4d083ace2d5ad6e/screenshots/Screen Shot 2016-09-27 at 10.58.13 AM.png
--------------------------------------------------------------------------------
/screenshots/hbase_records.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zaratsian/SparkHBaseExample/c2d20602cf143ce90cd20ec1d4d083ace2d5ad6e/screenshots/hbase_records.png
--------------------------------------------------------------------------------
/screenshots/hbase_spark_output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zaratsian/SparkHBaseExample/c2d20602cf143ce90cd20ec1d4d083ace2d5ad6e/screenshots/hbase_spark_output.png
--------------------------------------------------------------------------------
/screenshots/hbase_spark_output_raw.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zaratsian/SparkHBaseExample/c2d20602cf143ce90cd20ec1d4d083ace2d5ad6e/screenshots/hbase_spark_output_raw.png
--------------------------------------------------------------------------------
/src/main/scala/com/github/zaratsian/SparkHBase/SimulateAndBulkLoadHBaseData.scala:
--------------------------------------------------------------------------------
1 |
2 | /*******************************************************************************************************
3 |
4 | This code:
5 | 1) Simulates X number of HBase records (number of simulated records can be defined in props file)
6 | 2) Bulkloads the data into HBase (table name can be defined in props file)
7 |
8 | Usage:
9 |
10 | spark-submit --class com.github.zaratsian.SparkHBase.SimulateAndBulkLoadHBaseData --jars /tmp/SparkHBaseExample-0.0.1-SNAPSHOT.jar /usr/hdp/current/phoenix-client/phoenix-client.jar /tmp/props
11 |
12 | ********************************************************************************************************/
13 |
14 | package com.github.zaratsian.SparkHBase;
15 |
16 | import org.apache.spark.{SparkContext, SparkConf}
17 | import org.apache.spark.sql.Row
18 | import org.apache.spark.sql.functions.avg
19 |
20 | import scala.collection.mutable.HashMap
21 | import scala.io.Source.fromFile
22 | import scala.collection.JavaConverters._
23 |
24 | import org.apache.hadoop.hbase.client.Scan
25 | import org.apache.hadoop.hbase.protobuf.ProtobufUtil
26 | import org.apache.hadoop.hbase.mapreduce.{TableInputFormat, TableSnapshotInputFormat}
27 | import org.apache.hadoop.hbase.util.Base64
28 | import org.apache.hadoop.hbase.client.Put
29 | import org.apache.hadoop.hbase.client.Result
30 | import org.apache.hadoop.hbase.io.ImmutableBytesWritable
31 | import org.apache.hadoop.hbase.client.HTable
32 | import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
33 | import org.apache.hadoop.hbase.HColumnDescriptor
34 | import org.apache.hadoop.hbase.client.HBaseAdmin
35 | import org.apache.hadoop.hbase.util.Bytes
36 | import org.apache.hadoop.hbase.CellUtil
37 | import org.apache.hadoop.hbase.KeyValue.Type
38 | import org.apache.hadoop.hbase.KeyValue
39 | import org.apache.hadoop.hbase.mapred.TableOutputFormat
40 | import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
41 | import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
42 | import org.apache.hadoop.hbase.mapreduce.KeyValueSortReducer
43 |
44 | import org.apache.hadoop.mapred.JobConf
45 | import org.apache.hadoop.mapreduce.Job
46 | import org.apache.hadoop.fs.Path
47 | import org.apache.hadoop.mapred.JobConf
48 |
49 | import org.apache.hadoop.conf._
50 | import org.apache.hadoop.fs._
51 |
52 | import java.text.SimpleDateFormat
53 | import java.util.Arrays
54 | import java.util.Date
55 | import java.util.Calendar
56 | import java.lang.String
57 | import util.Random
58 |
59 | object SimulateAndBulkLoadHBaseData{
60 |
61 | def main(args: Array[String]) {
62 |
63 | val start_time = Calendar.getInstance()
64 | println("[ *** ] Start Time: " + start_time.getTime().toString)
65 |
66 | val props = getProps(args(0))
67 | val number_of_simulated_records = props.getOrElse("simulated_records", "1000").toInt
68 |
69 | val sparkConf = new SparkConf().setAppName("SimulatedHBaseTable")
70 | val sc = new SparkContext(sparkConf)
71 |
72 | val sqlContext = new org.apache.spark.sql.SQLContext(sc)
73 | import sqlContext.implicits._
74 |
75 | // HBase table name (if it does not exist, it will be created)
76 | val hTableName = props.getOrElse("simulated_tablename", "hbase_simulated_table")
77 | val columnFamily = "cf"
78 |
79 | println("[ *** ] Creating HBase Configuration")
80 | val hConf = HBaseConfiguration.create()
81 | hConf.set("zookeeper.znode.parent", "/hbase-unsecure")
82 | hConf.set(TableInputFormat.INPUT_TABLE, hTableName)
83 |
84 | val table = new HTable(hConf, hTableName)
85 |
86 | // Create HBase Table
87 | val admin = new HBaseAdmin(hConf)
88 |
89 | if(!admin.isTableAvailable(hTableName)) {
90 |
91 | println("[ *** ] Simulating Data")
92 | val rdd = sc.parallelize(1 to number_of_simulated_records)
93 |
94 | // Setup Random Generator
95 | //val rand = scala.util.Random
96 |
97 | println("[ *** ] Creating KeyValues")
98 | val rdd_out = rdd.map(x => {
99 | val kv: KeyValue = new KeyValue( Bytes.toBytes(x), columnFamily.getBytes(), "c1".getBytes(), x.toString.getBytes() )
100 | (new ImmutableBytesWritable( Bytes.toBytes(x) ), kv)
101 | })
102 |
103 | println("[ *** ] Printing simulated data (10 records)")
104 | rdd_out.map(x => x._2.toString).take(10).foreach(x => println(x))
105 |
106 | println("[ ***] Creating HBase Table (" + hTableName + ")")
107 | val hTableDesc = new HTableDescriptor(hTableName)
108 | hTableDesc.addFamily(new HColumnDescriptor(columnFamily.getBytes()))
109 | admin.createTable(hTableDesc)
110 |
111 | println("[ *** ] Saving data to HDFS as KeyValue/HFileOutputFormat (table name = " + hTableName + ")")
112 | val hConf2 = HBaseConfiguration.create()
113 | hConf2.set("zookeeper.znode.parent", "/hbase-unsecure")
114 | hConf2.set(TableOutputFormat.OUTPUT_TABLE, hTableName)
115 | rdd_out.saveAsNewAPIHadoopFile("/tmp/" + hTableName, classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat], hConf2)
116 |
117 | println("[ *** ] BulkLoading from HDFS (HFileOutputFormat) to HBase Table (" + hTableName + ")")
118 | val bulkLoader = new LoadIncrementalHFiles(hConf2)
119 | bulkLoader.doBulkLoad(new Path("/tmp/" + hTableName), table)
120 |
121 | }else{
122 | println("[ *** ] HBase Table ( " + hTableName + " ) already exists!")
123 | println("[ *** ] Stopping the simulation. Remove the existing HBase table and try again.")
124 | }
125 |
126 |
127 | sc.stop()
128 |
129 |
130 | // Print Runtime Metric
131 | val end_time = Calendar.getInstance()
132 | println("[ *** ] Created a table (" + hTableName + ") with " + number_of_simulated_records.toString + " records within HDFS and also bulkloaded it into HBase")
133 | println("[ *** ] Total Runtime: " + ((end_time.getTimeInMillis() - start_time.getTimeInMillis()).toFloat/1000).toString + " seconds")
134 |
135 | }
136 |
137 |
138 | def getArrayProp(props: => HashMap[String,String], prop: => String): Array[String] = {
139 | return props.getOrElse(prop, "").split(",").filter(x => !x.equals(""))
140 | }
141 |
142 |
143 | def getProps(file: => String): HashMap[String,String] = {
144 | var props = new HashMap[String,String]
145 | val lines = fromFile(file).getLines
146 | lines.foreach(x => if (x contains "=") props.put(x.split("=")(0), if (x.split("=").size > 1) x.split("=")(1) else null))
147 | props
148 | }
149 |
150 | }
151 |
152 | //ZEND
153 |
--------------------------------------------------------------------------------
/src/main/scala/com/github/zaratsian/SparkHBase/SparkReadHBaseSnapshot.scala:
--------------------------------------------------------------------------------
1 |
2 | /*******************************************************************************************************
3 | This code does the following:
4 | 1) Read an HBase Snapshot, and convert to Spark RDD (snapshot name is defined in props file)
5 | 2) Parse the records / KeyValue (extracting column family, column name, timestamp, value, etc)
6 | 3) Perform general data processing - Filter the data based on rowkey range AND timestamp (timestamp threshold variable defined in props file)
7 | 4) Write the results to HDFS (formatted for HBase BulkLoad, saved as HFileOutputFormat)
8 |
9 | Usage:
10 |
11 | spark-submit --class com.github.zaratsian.SparkHBase.SparkReadHBaseSnapshot --jars /tmp/SparkHBaseExample-0.0.1-SNAPSHOT.jar /usr/hdp/current/phoenix-client/phoenix-client.jar /tmp/props
12 |
13 | ********************************************************************************************************/
14 |
15 | package com.github.zaratsian.SparkHBase;
16 |
17 | import org.apache.spark.{SparkContext, SparkConf}
18 | import org.apache.spark.sql.Row
19 | import org.apache.spark.sql.functions.avg
20 |
21 | import scala.collection.mutable.HashMap
22 | import scala.io.Source.fromFile
23 | import scala.collection.JavaConverters._
24 |
25 | import org.apache.hadoop.hbase.client.Scan
26 | import org.apache.hadoop.hbase.protobuf.ProtobufUtil
27 | import org.apache.hadoop.hbase.mapreduce.{TableInputFormat, TableSnapshotInputFormat}
28 | import org.apache.hadoop.hbase.util.Base64
29 | import org.apache.hadoop.hbase.client.Put
30 | import org.apache.hadoop.hbase.client.Result
31 | import org.apache.hadoop.hbase.io.ImmutableBytesWritable
32 | import org.apache.hadoop.hbase.client.HTable
33 | import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
34 | import org.apache.hadoop.hbase.HColumnDescriptor
35 | import org.apache.hadoop.hbase.client.HBaseAdmin
36 | import org.apache.hadoop.hbase.util.Bytes
37 | import org.apache.hadoop.hbase.CellUtil
38 | import org.apache.hadoop.hbase.KeyValue.Type
39 | import org.apache.hadoop.hbase.KeyValue
40 | import org.apache.hadoop.hbase.mapred.TableOutputFormat
41 | import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
42 | import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
43 |
44 | import org.apache.hadoop.mapred.JobConf
45 | import org.apache.hadoop.mapreduce.Job
46 | import org.apache.hadoop.fs.Path
47 | import org.apache.hadoop.mapred.JobConf
48 |
49 | import org.apache.hadoop.conf._
50 | import org.apache.hadoop.fs._
51 |
52 | import java.text.SimpleDateFormat
53 | import java.util.Arrays
54 | import java.util.Date
55 | import java.util.Calendar
56 | import java.lang.String
57 |
58 | object SparkReadHBaseSnapshot{
59 |
60 | case class hVar(rowkey: Int, colFamily: String, colQualifier: String, colDatetime: Long, colDatetimeStr: String, colType: String, colValue: String)
61 |
62 | def main(args: Array[String]) {
63 |
64 | val start_time = Calendar.getInstance()
65 | println("[ *** ] Start Time: " + start_time.getTime().toString)
66 |
67 | val props = getProps(args(0))
68 | val max_versions : Int = props.getOrElse("hbase.snapshot.versions","3").toInt
69 |
70 | val sparkConf = new SparkConf().setAppName("SparkReadHBaseSnapshot")
71 | val sc = new SparkContext(sparkConf)
72 | val sqlContext = new org.apache.spark.sql.SQLContext(sc)
73 | import sqlContext.implicits._
74 |
75 | println("[ *** ] Creating HBase Configuration")
76 | val hConf = HBaseConfiguration.create()
77 | hConf.set("hbase.rootdir", props.getOrElse("hbase.rootdir", "/tmp"))
78 | hConf.set("hbase.zookeeper.quorum", props.getOrElse("hbase.zookeeper.quorum", "localhost:2181:/hbase-unsecure"))
79 | hConf.set(TableInputFormat.SCAN, convertScanToString(new Scan().setMaxVersions(max_versions)) )
80 |
81 | val job = Job.getInstance(hConf)
82 |
83 | val path = new Path(props.getOrElse("hbase.snapshot.path", "/user/hbase"))
84 | val snapName = props.getOrElse("hbase.snapshot.name", "customer_info_ss")
85 |
86 | TableSnapshotInputFormat.setInput(job, snapName, path)
87 |
88 | val hBaseRDD = sc.newAPIHadoopRDD(job.getConfiguration,
89 | classOf[TableSnapshotInputFormat],
90 | classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
91 | classOf[org.apache.hadoop.hbase.client.Result])
92 |
93 | val record_count_raw = hBaseRDD.count()
94 | println("[ *** ] Read in SnapShot (" + snapName.toString + "), which contains " + record_count_raw + " records")
95 |
96 | // Extract the KeyValue element of the tuple
97 | val keyValue = hBaseRDD.map(x => x._2).map(_.list)
98 |
99 | //println("[ *** ] Printing raw SnapShot (10 records) from HBase SnapShot")
100 | //hBaseRDD.map(x => x._1.toString).take(10).foreach(x => println(x))
101 | //hBaseRDD.map(x => x._2.toString).take(10).foreach(x => println(x))
102 | //keyValue.map(x => x.toString).take(10).foreach(x => println(x))
103 |
104 | val df = keyValue.flatMap(x => x.asScala.map(cell =>
105 | hVar(
106 | Bytes.toInt(CellUtil.cloneRow(cell)),
107 | Bytes.toStringBinary(CellUtil.cloneFamily(cell)),
108 | Bytes.toStringBinary(CellUtil.cloneQualifier(cell)),
109 | cell.getTimestamp,
110 | new SimpleDateFormat("yyyy-MM-dd HH:mm:ss:SSS").format(new Date(cell.getTimestamp.toLong)),
111 | Type.codeToType(cell.getTypeByte).toString,
112 | Bytes.toStringBinary(CellUtil.cloneValue(cell))
113 | )
114 | )
115 | ).toDF()
116 |
117 | println("[ *** ] Printing parsed SnapShot (10 records) from HBase SnapShot")
118 | df.show(10, false)
119 |
120 | //Get timestamp (from props) that will be used for filtering
121 | val datetime_threshold = props.getOrElse("datetime_threshold", "2016-08-25 14:27:02:001")
122 | val datetime_threshold_long = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss:SSS").parse(datetime_threshold).getTime()
123 | println("[ *** ] Filtering/Keeping all SnapShot records that are more recent (greater) than the datetime_threshold (set in the props file): " + datetime_threshold.toString)
124 |
125 | println("[ *** ] Filtering Dataframe")
126 | val df_filtered = df.filter($"colDatetime" >= datetime_threshold_long && $"rowkey".between(80001, 90000))
127 |
128 | /* // Filter RDD (alternative, but using a DF is a better option)
129 | val rdd_filtered = keyValue.flatMap(x => x.asScala.map(cell =>
130 | {(
131 | Bytes.toStringBinary(CellUtil.cloneRow(cell)),
132 | Bytes.toStringBinary(CellUtil.cloneFamily(cell)),
133 | Bytes.toStringBinary(CellUtil.cloneQualifier(cell)),
134 | cell.getTimestamp,
135 | new SimpleDateFormat("yyyy-MM-dd HH:mm:ss:SSS").format(new Date(cell.getTimestamp.toLong)),
136 | Type.codeToType(cell.getTypeByte).toString,
137 | Bytes.toStringBinary(CellUtil.cloneValue(cell))
138 | )}
139 | )).filter(x => x._4>=datetime_threshold_long)
140 | */
141 |
142 | println("[ *** ] Filtered dataframe contains " + df_filtered.count() + " records")
143 | println("[ *** ] Printing filtered HBase SnapShot records (10 records)")
144 | df_filtered.show(10, false)
145 |
146 | // For testing purposes, print datatypes
147 | //df_filtered.dtypes.toList.foreach(x => println(x))
148 |
149 | // Convert DF to KeyValue
150 | println("[ *** ] Converting dataframe to RDD so that it can be written as HFileOutputFormat using saveAsNewAPIHadoopFile")
151 | val rdd_from_df = df_filtered.rdd.map(x => {
152 | val kv: KeyValue = new KeyValue( Bytes.toBytes(x(0).asInstanceOf[Int]), x(1).toString.getBytes(), x(2).toString.getBytes(), x(3).asInstanceOf[Long], x(6).toString.getBytes() )
153 | (new ImmutableBytesWritable( Bytes.toBytes(x(0).asInstanceOf[Int]) ), kv)
154 | })
155 |
156 | /* // Convert RDD to KeyValue
157 | val rdd_to_hbase = rdd_filtered.map(x=>{
158 | val kv: KeyValue = new KeyValue(Bytes.toBytes(x._1), x._2.getBytes(), x._3.getBytes(), x._7.getBytes() )
159 | (new ImmutableBytesWritable(Bytes.toBytes(x._1)), kv)
160 | })
161 | */
162 |
163 | val time_snapshot_processing = Calendar.getInstance()
164 | println("[ *** ] Runtime for Snapshot Processing: " + ((time_snapshot_processing.getTimeInMillis() - start_time.getTimeInMillis()).toFloat/1000).toString + " seconds")
165 |
166 | // Configure HBase output settings
167 | val hTableName = snapName + "_filtered"
168 | val hConf2 = HBaseConfiguration.create()
169 | hConf2.set("zookeeper.znode.parent", "/hbase-unsecure")
170 | hConf2.set(TableOutputFormat.OUTPUT_TABLE, hTableName)
171 |
172 | //println("[ *** ] Saving results to HDFS as HBase KeyValue HFileOutputFormat. This makes it easy to BulkLoad into HBase (see SparkHBaseBulkLoad.scala for bulkload code)")
173 | //rdd_from_df.map(x => x._2.toString).take(10).foreach(x => println(x))
174 | rdd_from_df.saveAsNewAPIHadoopFile("/tmp/" + hTableName, classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat], hConf2)
175 |
176 | // Print Total Runtime
177 | val end_time = Calendar.getInstance()
178 | println("[ *** ] End Time: " + end_time.getTime().toString)
179 | println("[ *** ] Saved " + rdd_from_df.count() + " records to HDFS, located in /tmp/" + hTableName.toString)
180 | println("[ *** ] Runtime for Snapshot Processing: " + ((time_snapshot_processing.getTimeInMillis() - start_time.getTimeInMillis()).toFloat/1000).toString + " seconds")
181 | println("[ *** ] Runtime for Snapshot Processing, saving to HDFS: " + ((end_time.getTimeInMillis() - start_time.getTimeInMillis()).toFloat/1000).toString + " seconds")
182 |
183 |
184 | sc.stop()
185 |
186 |
187 | }
188 |
189 |
190 | def convertScanToString(scan : Scan) = {
191 | val proto = ProtobufUtil.toScan(scan);
192 | Base64.encodeBytes(proto.toByteArray());
193 | }
194 |
195 |
196 | def getArrayProp(props: => HashMap[String,String], prop: => String): Array[String] = {
197 | return props.getOrElse(prop, "").split(",").filter(x => !x.equals(""))
198 | }
199 |
200 |
201 | def getProps(file: => String): HashMap[String,String] = {
202 | var props = new HashMap[String,String]
203 | val lines = fromFile(file).getLines
204 | lines.foreach(x => if (x contains "=") props.put(x.split("=")(0), if (x.split("=").size > 1) x.split("=")(1) else null))
205 | props
206 | }
207 |
208 | }
209 |
210 | //ZEND
211 |
--------------------------------------------------------------------------------
/write_to_hbase.py:
--------------------------------------------------------------------------------
1 |
2 | ###################################################################################################################################################
3 | #
4 | # This script will create an HBase table and populate X number of records (based on simulated data)
5 | #
6 | # http://happybase.readthedocs.io/en/latest/api.html#table
7 | # https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_installing_manually_book/content/ref-2a6efe32-d0e1-4e84-9068-4361b8c36dc8.1.html
8 | #
9 | # NOTE: HBase Thrift Server must be running for this to work
10 | #
11 | # Starting the HBase Thrift Server:
12 | # Foreground: hbase thrift start -p --infoport
13 | # Background: /usr/hdp/current/hbase-master/bin/hbase-daemon.sh start thrift -p --infoport
14 | #
15 | ###################################################################################################################################################
16 |
17 |
18 | import happybase
19 | import os,sys,csv
20 | import random
21 | import time, datetime
22 |
23 |
24 | # User Inputs (could be moved to arguments)
25 | hostname = 'localhost'
26 | port = 9999
27 | table_name = 'customer_info'
28 | columnfamily = 'demographics'
29 | number_of_records = 1000000
30 | batch_size = 2000
31 |
32 |
33 | print '[ INFO ] Trying to connect to the HBase Thrift server at ' + str(hostname) + ':' + str(port)
34 | try:
35 | connection = happybase.Connection(hostname, port=port, timeout=40000)
36 | table = connection.table(table_name)
37 | print '[ INFO ] Successfully connected to the HBase Thrift server at ' + str(hostname) + ':' + str(port)
38 | except:
39 | print '[ ERROR ] Could not connect to HBase Thrift Server at ' + str(hostname) + ':' + str(port) + '. Make sure that the HBase Thrift server is running and the host and port number is correct.'
40 | sys.exit()
41 |
42 |
43 | # https://happybase.readthedocs.io/en/latest/api.html#happybase.Connection.create_table
44 | print '[ INFO ] Creating HBase table: ' + str(table_name)
45 | time.sleep(1)
46 | families = {
47 | columnfamily: dict(), # use defaults
48 | }
49 |
50 | connection.create_table(table_name,families)
51 | print '[ INFO ] Successfully created HBase table: ' + str(table_name)
52 |
53 |
54 | print '[ INFO ] Inserting ' + str(number_of_records) + ' records into ' + str(table_name)
55 | start_time = datetime.datetime.now()
56 | with table.batch(batch_size=batch_size) as b:
57 | for i in range(number_of_records):
58 | rowkey = i
59 | custid = random.randint(1000000,9999999)
60 | gender = ['male','female'][random.randint(0,1)]
61 | age = random.randint(18,100)
62 | level = ['silver','gold','platimum','diamond'][random.randint(0,3)]
63 |
64 | b.put(str(rowkey), {b'demographics:custid': str(custid),
65 | b'demographics:gender': str(gender),
66 | b'demographics:age': str(age),
67 | b'demographics:level': str(level)})
68 |
69 | print '[ INFO ] Successfully inserted ' + str(number_of_records) + ' records into ' + str(table_name) + ' in ' + str((datetime.datetime.now() - start_time).seconds) + ' seconds'
70 |
71 |
72 | print '[ INFO ] Printing data records from the generated HBase table'
73 | for key, data in table.rows([b'1', b'2', b'3', b'4', b'5']):
74 | try:
75 | print(key, data) # prints row key and data for each row
76 | except:
77 | pass
78 |
79 | #ZEND
80 |
--------------------------------------------------------------------------------