├── .gitignore
├── README.md
├── data-validator.ipynb
├── flow.jpeg
├── generated.json
├── guide.md
├── output
└── data-validator.ipynb
├── papermill_notebook_runner.py
└── schema.avsc
/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Bigdata profiler
2 |
3 | This is a tool to profile your incoming data, check if it adheres to registered schema and do custom data quality checks. At the end of all this, a human readable report is autogenerated that can be sent over to stakeholders.
4 |
5 | ## Features
6 |
7 | * Config driven data profiling and schema validation
8 | * Autogeneration of report after every run
9 | * Integration with datadog monitoring system
10 | * Extensible and highly customizable.
11 | * Very little boiler plate code.
12 | * Support for versioned schema validation.
13 |
14 | ### Dataformats currently supported
15 |
16 | * CSV
17 | * JSON
18 | * Parquet
19 |
20 | can easily be extended to all the formats that Apache Spark supports for reads.
21 |
22 | ### SQL support for custom data quality checks
23 |
24 | Supports both ANSI-SQL as well as Hive QL. List of all supported SQL functions can be found [here](https://spark.apache.org/docs/2.3.1/api/sql/index.html)
25 |
26 | ## Contents
27 |
28 | * Datavalidator [notebook tool](data-validator.ipynb)
29 | * Sample dataset [dataset](generated.json)
30 | * Sample dataset [schema](schema.avsc)
31 | * Sample [result report](output/data-validator.ipynb)
32 | * [Runner script](papermill_notebook_runner.py)
33 |
34 | ## Run Instructions
35 |
36 | All one has to do is execute a python script `papermill_notebook_runner.py`. This script takes in the following arguments in order:
37 |
38 | * Path to the notebook to be run.
39 | * Path to the output notebook.
40 | * JSON configuration that will drive the notebook.
41 |
42 | ```bash
43 | python papermill_notebook_runner.py data-validator.ipynb output/data-validator.ipynb '{"dataFormat":"json","inputDataLocation":"s3a://bucket/prefix/generated.json","appName":"cust-profile-data-validation","schemaRepoUrl":"http://schemarepohostaddress","scheRepoSubjectName":"cust-profile","schemaVersionId":"0","customQ1":"select CAST(count(_id) - count(distinct _id) as Long) as diff from dataset","customQ1ResultThreshold":0,"customQ1Operator":"=","customQ2":"select CAST(length(phone) as Long) from dataset","customQ2ResultThreshold":17,"customQ2Operator":"=","customQ3":"select CAST(count(distinct gender) as Long) from dataset","customQ3ResultThreshold":3,"customQ3Operator":"<="}'
44 | ```
45 |
46 | ## Install Instructions
47 |
48 | There are several pieces involved.
49 |
50 | * First install jupyter notebooks. Install instructions [here](https://jupyter.org/install).
51 | * Next install spark magic. Install instructions [here](https://github.com/jupyter-incubator/sparkmagic)
52 | * Configure sparkmagic with your own Apache Livy endpoints. Config file should look like [this](https://github.com/jupyter-incubator/sparkmagic/blob/634ee0d356b8e9685fe006739b7543149cfef374/sparkmagic/example_config.json)
53 | * Install papermill from source after adding spark-magic kernels. Clone papermill project from [here](https://github.com/nteract/papermill).
54 | * Update the [translators file](https://github.com/nteract/papermill/blob/master/papermill/translators.py) to add sparkmagic kernels at the very end of the file.
55 |
56 | ```python
57 | papermill_translators.register("sparkkernel", ScalaTranslator)
58 | papermill_translators.register("pysparkkernel", PythonTranslator)
59 | papermill_translators.register("sparkrkernel", RTranslator)
60 | ```
61 |
62 | * Next install schema repo. Install instructions [here](https://github.com/schema-repo/schema-repo).
63 |
64 |
65 | ## More details
66 |
67 | Find more details on this [guide](guide.md)
68 |
69 | That should be it. Enjoy Profiling !!
70 |
--------------------------------------------------------------------------------
/data-validator.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Data Profiler and Schema Validation"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "Profiles given input data based on the custom queries you provide, and validates its schema against schema repository. \n",
15 | "You can find how to insert a schema to schema-repository in README.md"
16 | ]
17 | },
18 | {
19 | "cell_type": "code",
20 | "execution_count": null,
21 | "metadata": {},
22 | "outputs": [],
23 | "source": [
24 | "%%help"
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "## Spark job configuration parameters like memory and cores may vary from one job to other"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": null,
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "%%configure -f\n",
41 | "{\"name\":\"data-profiler\", \n",
42 | " \"executorMemory\": \"2GB\", \n",
43 | " \"executorCores\": 4, \n",
44 | " \"conf\": {\"spark.jars.packages\": \"com.databricks:spark-avro_2.11:4.0.0,com.github.gphat:censorinus_2.11:2.1.13\"} \n",
45 | "}"
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "metadata": {},
51 | "source": [
52 | "## Set parameters that will be overwritten by values passed externally"
53 | ]
54 | },
55 | {
56 | "cell_type": "code",
57 | "execution_count": null,
58 | "metadata": {
59 | "tags": [
60 | "parameters"
61 | ]
62 | },
63 | "outputs": [],
64 | "source": [
65 | "val dataFormat = \"data-format\"\n",
66 | "val delimiter = \"\"\n",
67 | "val inputDataLocation = \"input-data-location\"\n",
68 | "val appName = \"app-name\" \n",
69 | "val schemaRepoUrl = \"schema-repo-url\"\n",
70 | "val scheRepoSubjectName = \"subject-name\"\n",
71 | "val schemaVersionId = \"schema-version\"\n",
72 | "val customQ1 = \"custom-query-1\"\n",
73 | "val customQ1ResultThreshold = 0\n",
74 | "val customQ1Operator = \"custom-operator-1\"\n",
75 | "val customQ2 = \"custom-query-2\"\n",
76 | "val customQ2ResultThreshold = 0\n",
77 | "val customQ2Operator = \"custom-operator-2\"\n",
78 | "val customQ3 = \"custom-query-3\"\n",
79 | "val customQ3ResultThreshold = 0\n",
80 | "val customQ3Operator = \"custom-query-3\""
81 | ]
82 | },
83 | {
84 | "cell_type": "markdown",
85 | "metadata": {},
86 | "source": [
87 | "## Setup datadog statsd interface"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": null,
93 | "metadata": {},
94 | "outputs": [],
95 | "source": [
96 | "import github.gphat.censorinus.DogStatsDClient\n",
97 | "\n",
98 | "val statsd = new DogStatsDClient(hostname = \"localhost\", port = 8125, prefix = \"mlp.validator\")"
99 | ]
100 | },
101 | {
102 | "cell_type": "markdown",
103 | "metadata": {},
104 | "source": [
105 | "## Read data, if data being read is CSV, it needs to have a header"
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "execution_count": null,
111 | "metadata": {},
112 | "outputs": [],
113 | "source": [
114 | "val df = dataFormat match {\n",
115 | " case \"parquet\" => spark.read.parquet(inputDataLocation)\n",
116 | " case \"json\" => spark.read.json(inputDataLocation)\n",
117 | " case \"csv\" => spark.read.option(\"mode\", \"DROPMALFORMED\").option(\"header\", \"true\").option(\"delimiter\", delimiter).csv(inputDataLocation)\n",
118 | " case _ => throw new Exception(s\"$dataFormat, as a dataformat is not supported \")\n",
119 | "}"
120 | ]
121 | },
122 | {
123 | "cell_type": "markdown",
124 | "metadata": {},
125 | "source": [
126 | "### Publish some basic stats about the data. This can be extended further"
127 | ]
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": null,
132 | "metadata": {},
133 | "outputs": [],
134 | "source": [
135 | "val recordCount = df.count()\n",
136 | "val numColumns = df.columns.size\n",
137 | "statsd.histogram(name = \"recordCount\", value = recordCount, tags = Seq(s\"appName:$appName\", \"data-validation\", \"env:dev\"));\n",
138 | "statsd.histogram(name = \"numColumns\", value = numColumns, tags = Seq(s\"appName:$appName\", \"data-validation\",\"env:dev\"));"
139 | ]
140 | },
141 | {
142 | "cell_type": "markdown",
143 | "metadata": {},
144 | "source": [
145 | "## Read registered schema from schema repository"
146 | ]
147 | },
148 | {
149 | "cell_type": "markdown",
150 | "metadata": {},
151 | "source": [
152 | "### Utility method to call rest endpoint for schema"
153 | ]
154 | },
155 | {
156 | "cell_type": "code",
157 | "execution_count": null,
158 | "metadata": {},
159 | "outputs": [],
160 | "source": [
161 | "import java.io.IOException;\n",
162 | "\n",
163 | "import org.apache.http.HttpEntity;\n",
164 | "import org.apache.http.HttpResponse;\n",
165 | "import org.apache.http.client.ClientProtocolException;\n",
166 | "import org.apache.http.client.ResponseHandler;\n",
167 | "import org.apache.http.client.methods.HttpGet;\n",
168 | "import org.apache.http.impl.client.CloseableHttpClient;\n",
169 | "import org.apache.http.impl.client.HttpClients;\n",
170 | "import org.apache.http.util.EntityUtils;\n",
171 | "\n",
172 | "def getSchema(url: String) : String = {\n",
173 | " val httpclient: CloseableHttpClient = HttpClients.createDefault()\n",
174 | " try {\n",
175 | " val httpget: HttpGet = new HttpGet(url)\n",
176 | " println(\"Executing request \" + httpget.getRequestLine)\n",
177 | " val responseHandler: ResponseHandler[String] =\n",
178 | " new ResponseHandler[String]() {\n",
179 | " override def handleResponse(response: HttpResponse): String = {\n",
180 | " var status: Int = response.getStatusLine.getStatusCode\n",
181 | " if (status >= 200 && status < 300) {\n",
182 | " var entity: HttpEntity = response.getEntity\n",
183 | " if (entity != null) EntityUtils.toString(entity) else null\n",
184 | " } else {\n",
185 | " throw new ClientProtocolException(\n",
186 | " \"Unexpected response status: \" + status);\n",
187 | " }\n",
188 | " }\n",
189 | " }\n",
190 | " httpclient.execute(httpget, responseHandler) \n",
191 | " } finally {\n",
192 | " httpclient.close()\n",
193 | " None\n",
194 | " }\n",
195 | "}"
196 | ]
197 | },
198 | {
199 | "cell_type": "markdown",
200 | "metadata": {},
201 | "source": [
202 | "#### Create url from input parameters and feth schema for specified version"
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": null,
208 | "metadata": {},
209 | "outputs": [],
210 | "source": [
211 | "val schema_url = s\"$schemaRepoUrl/schema-repo/$scheRepoSubjectName/id/$schemaVersionId\"\n",
212 | "val publishedSchema = getSchema(schema_url) "
213 | ]
214 | },
215 | {
216 | "cell_type": "markdown",
217 | "metadata": {},
218 | "source": [
219 | "### Convert Avro schema registered to Spark SQL Schema."
220 | ]
221 | },
222 | {
223 | "cell_type": "code",
224 | "execution_count": null,
225 | "metadata": {},
226 | "outputs": [],
227 | "source": [
228 | "import com.databricks.spark.avro._\n",
229 | "import org.apache.avro.Schema.Parser\n",
230 | "val schema = new Parser().parse(publishedSchema)\n",
231 | "\n",
232 | "import com.databricks.spark.avro.SchemaConverters\n",
233 | "val structSchema = SchemaConverters.toSqlType(schema).dataType"
234 | ]
235 | },
236 | {
237 | "cell_type": "markdown",
238 | "metadata": {},
239 | "source": [
240 | "### Utility method to traverse schema tree and find the leaf node names"
241 | ]
242 | },
243 | {
244 | "cell_type": "code",
245 | "execution_count": null,
246 | "metadata": {},
247 | "outputs": [],
248 | "source": [
249 | "import scala.collection.mutable.ListBuffer\n",
250 | "import org.apache.spark.sql.types._\n",
251 | "\n",
252 | "def findFields(path: String, dt: DataType, columnNames: ListBuffer[String]): Unit = dt match {\n",
253 | " case s: StructType =>\n",
254 | " s.fields.foreach(f => findFields(path + \".\" + f.name, f.dataType, columnNames))\n",
255 | " case s: ArrayType => findFields(path, s.elementType, columnNames)\n",
256 | " case other =>\n",
257 | " columnNames += path.substring(1)\n",
258 | "}"
259 | ]
260 | },
261 | {
262 | "cell_type": "code",
263 | "execution_count": null,
264 | "metadata": {},
265 | "outputs": [],
266 | "source": [
267 | "var dfColumnNames = new ListBuffer[String]()\n",
268 | "findFields(\"\", df.schema, dfColumnNames)\n",
269 | "\n",
270 | "print(dfColumnNames.toList)"
271 | ]
272 | },
273 | {
274 | "cell_type": "code",
275 | "execution_count": null,
276 | "metadata": {},
277 | "outputs": [],
278 | "source": [
279 | "var publishedSchemaDataColumnNames = new ListBuffer[String]()\n",
280 | "findFields(\"\", structSchema, publishedSchemaDataColumnNames)\n",
281 | "\n",
282 | "print(publishedSchemaDataColumnNames.toList)"
283 | ]
284 | },
285 | {
286 | "cell_type": "code",
287 | "execution_count": null,
288 | "metadata": {},
289 | "outputs": [],
290 | "source": [
291 | "val sourceColumns = dfColumnNames.toSet\n",
292 | "val publishedColumns = publishedSchemaDataColumnNames.toSet\n",
293 | "val differenceColumns = publishedColumns.diff(sourceColumns)\n",
294 | "val numDiffColumns = differenceColumns.size\n",
295 | "print(s\"Number of columns not matching the schema are: $numDiffColumns\")\n",
296 | "statsd.histogram(name = \"numDiffColumns\", value = numDiffColumns, tags = Seq(s\"appName:$appName\", \"data-validation\", \"env:dev\"));"
297 | ]
298 | },
299 | {
300 | "cell_type": "markdown",
301 | "metadata": {},
302 | "source": [
303 | "### Custom data quality checks"
304 | ]
305 | },
306 | {
307 | "cell_type": "markdown",
308 | "metadata": {},
309 | "source": [
310 | "#### Utility function to assert results"
311 | ]
312 | },
313 | {
314 | "cell_type": "code",
315 | "execution_count": null,
316 | "metadata": {},
317 | "outputs": [],
318 | "source": [
319 | "def customCheck(val1 : Long, operator : String, threshold : Long) : Unit = {\n",
320 | " operator match {\n",
321 | " case \">\" => try { assert(val1 > threshold) } catch { case e: AssertionError => print(e);System.exit(1)}\n",
322 | " case \">=\" => try { assert(val1 >= threshold) } catch { case e: AssertionError => print(e);System.exit(1)}\n",
323 | " case \"=\" => try { assert(val1 == threshold) } catch { case e: AssertionError => print(e);System.exit(1)}\n",
324 | " case \"<\" => try { assert(val1 < threshold) } catch { case e: AssertionError => print(e);System.exit(1)}\n",
325 | " case \"<=\" => try { assert(val1 <= threshold) } catch { case e: AssertionError => print(e);System.exit(1)}\n",
326 | " }\n",
327 | "}"
328 | ]
329 | },
330 | {
331 | "cell_type": "markdown",
332 | "metadata": {},
333 | "source": [
334 | "#### Create a temporary table, make sure that sql statements return a Long value, to be sure cast results to Long in the queries"
335 | ]
336 | },
337 | {
338 | "cell_type": "code",
339 | "execution_count": null,
340 | "metadata": {},
341 | "outputs": [],
342 | "source": [
343 | "df.createOrReplaceTempView(\"dataset\")\n",
344 | "\n",
345 | "val res1 = spark.sql(customQ1).collect().toList(0).getAs[Long](0)\n",
346 | "customCheck(res1, customQ1Operator, customQ1ResultThreshold)\n",
347 | "\n",
348 | "val res2 = spark.sql(customQ2).collect().toList(0).getAs[Long](0)\n",
349 | "customCheck(res2, customQ2Operator, customQ2ResultThreshold)\n",
350 | "\n",
351 | "val res3 = spark.sql(customQ3).collect().toList(0).getAs[Long](0)\n",
352 | "customCheck(res3, customQ3Operator, customQ3ResultThreshold)"
353 | ]
354 | }
355 | ],
356 | "metadata": {
357 | "celltoolbar": "Tags",
358 | "kernelspec": {
359 | "display_name": "Spark",
360 | "language": "",
361 | "name": "sparkkernel"
362 | },
363 | "language_info": {
364 | "codemirror_mode": "text/x-scala",
365 | "mimetype": "text/x-scala",
366 | "name": "scala",
367 | "pygments_lexer": "scala"
368 | }
369 | },
370 | "nbformat": 4,
371 | "nbformat_minor": 2
372 | }
373 |
--------------------------------------------------------------------------------
/flow.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Nordstrom/bigdata-profiler/b71ee276834f0b9c89274828406f9082316ac889/flow.jpeg
--------------------------------------------------------------------------------
/generated.json:
--------------------------------------------------------------------------------
1 | {"_id":"5ca04d4f4979775a094fbd80","index":0,"guid":"dd55c788-b2c9-4821-a736-cf1923fe9361","isActive":true,"balance":"$3,002.49","picture":"http://placehold.it/32x32","age":30,"eyeColor":"brown","name":"Elizabeth Schwartz","gender":"female","company":"OPTICOM","email":"elizabethschwartz@opticom.com","phone":"+1 (878) 403-3098","address":"217 Hull Street, Orason, Connecticut, 353","about":"Ea eu irure qui sint culpa quis reprehenderit irure. Commodo anim labore aliqua in proident deserunt non duis reprehenderit mollit dolore minim incididunt minim. Incididunt laborum non ut Lorem velit qui proident do aliquip. Dolore tempor cupidatat amet eiusmod cillum. Aliquip Lorem ullamco aute duis nisi consectetur duis ut in aute proident amet.\r\n","registered":"2019-02-14T09:48:05 +08:00","latitude":25.91036,"longitude":87.342076,"tags":["cillum","elit","dolore","magna","occaecat","id","elit"],"friends":[{"id":0,"name":"Baker Calhoun"},{"id":1,"name":"Alvarez Saunders"},{"id":2,"name":"Stokes Sykes"}],"greeting":"Hello, Elizabeth Schwartz! You have 2 unread messages.","favoriteFruit":"banana"}
2 | {"_id":"5ca04d4fc6617f8da118ac7a","index":1,"guid":"b732a21f-0068-4e30-8b63-b8fe63ad8189","isActive":false,"balance":"$1,517.65","picture":"http://placehold.it/32x32","age":39,"eyeColor":"green","name":"Kelly Mayo","gender":"female","company":"ZOGAK","email":"kellymayo@zogak.com","phone":"+1 (827) 540-3091","address":"374 Brevoort Place, Hailesboro, Wyoming, 8101","about":"Ut labore tempor aute ad ex. Consectetur culpa ea aliqua officia occaecat magna. Magna consequat id veniam sint ipsum Lorem ad ea minim sit eu. Ut deserunt reprehenderit enim nostrud culpa ut enim do culpa. Excepteur incididunt cupidatat fugiat irure tempor mollit ex labore.\r\n","registered":"2016-10-20T02:50:58 +07:00","latitude":19.097854,"longitude":-39.512998,"tags":["proident","et","cupidatat","magna","reprehenderit","laborum","adipisicing"],"friends":[{"id":0,"name":"Magdalena Carey"},{"id":1,"name":"Dorothy Brooks"},{"id":2,"name":"Olsen Bray"}],"greeting":"Hello, Kelly Mayo! You have 10 unread messages.","favoriteFruit":"banana"}
3 | {"_id":"5ca04d4f7da6e81a95f5e079","index":2,"guid":"cce845f5-a481-46bb-9a4b-9d93d5500dd0","isActive":false,"balance":"$3,625.21","picture":"http://placehold.it/32x32","age":38,"eyeColor":"blue","name":"Adrienne Wiggins","gender":"female","company":"REPETWIRE","email":"adriennewiggins@repetwire.com","phone":"+1 (989) 528-3915","address":"554 Marconi Place, Blue, Montana, 1268","about":"Qui aute consectetur dolor nulla dolore amet aliqua. Enim cupidatat id elit qui occaecat labore qui incididunt aute dolor ut nostrud adipisicing. Sunt tempor magna occaecat do do. Dolor non nisi officia commodo commodo cillum id irure enim nulla esse ea laboris velit.\r\n","registered":"2017-04-27T05:13:26 +07:00","latitude":89.625986,"longitude":-0.0457,"tags":["dolore","esse","sit","tempor","proident","ea","esse"],"friends":[{"id":0,"name":"Boyle Best"},{"id":1,"name":"Mack Duke"},{"id":2,"name":"Christine Andrews"}],"greeting":"Hello, Adrienne Wiggins! You have 9 unread messages.","favoriteFruit":"apple"}
4 | {"_id":"5ca04d4fb6d13c4b07926baf","index":3,"guid":"f3aed5c1-e169-49bf-becb-34b002f10b0f","isActive":true,"balance":"$2,116.56","picture":"http://placehold.it/32x32","age":27,"eyeColor":"green","name":"Tamika Caldwell","gender":"female","company":"MOBILDATA","email":"tamikacaldwell@mobildata.com","phone":"+1 (915) 551-3295","address":"960 Bethel Loop, Fedora, Delaware, 4775","about":"Officia sit aute voluptate dolor eu ex eu ea ullamco pariatur aute sint cupidatat. Nostrud exercitation proident pariatur excepteur. Voluptate est culpa ad irure duis pariatur. Eiusmod ea officia nulla est adipisicing.\r\n","registered":"2015-12-25T09:17:43 +08:00","latitude":-75.736755,"longitude":-66.215539,"tags":["commodo","cillum","labore","reprehenderit","dolor","ad","id"],"friends":[{"id":0,"name":"Espinoza Simon"},{"id":1,"name":"Haley Larsen"},{"id":2,"name":"Dejesus Talley"}],"greeting":"Hello, Tamika Caldwell! You have 9 unread messages.","favoriteFruit":"apple"}
5 | {"_id":"5ca04d4f3149e7ec744fc589","index":4,"guid":"b1f1e9a7-8753-4501-91ce-9a40dea7dffa","isActive":true,"balance":"$1,632.84","picture":"http://placehold.it/32x32","age":29,"eyeColor":"green","name":"Wilma Blanchard","gender":"female","company":"ORBIFLEX","email":"wilmablanchard@orbiflex.com","phone":"+1 (821) 591-2303","address":"647 Cornelia Street, Cobbtown, Vermont, 2761","about":"Excepteur proident eiusmod esse quis elit tempor. Eu non exercitation commodo culpa deserunt sint. Cillum culpa sint quis do. In aliqua irure esse veniam deserunt esse quis et elit amet Lorem laboris incididunt.\r\n","registered":"2016-06-26T08:54:31 +07:00","latitude":4.000836,"longitude":134.6375,"tags":["mollit","irure","ullamco","amet","anim","Lorem","officia"],"friends":[{"id":0,"name":"Rosa Nelson"},{"id":1,"name":"Tessa Wiley"},{"id":2,"name":"Greene Singleton"}],"greeting":"Hello, Wilma Blanchard! You have 5 unread messages.","favoriteFruit":"banana"}
6 | {"_id":"5ca04d4fb32986a6d2ce581c","index":5,"guid":"d4152b51-2cef-4072-bb08-78810cb51103","isActive":true,"balance":"$2,731.33","picture":"http://placehold.it/32x32","age":29,"eyeColor":"green","name":"Nellie Porter","gender":"female","company":"PROWASTE","email":"nellieporter@prowaste.com","phone":"+1 (887) 594-3793","address":"407 Miller Place, Dixie, Mississippi, 9318","about":"Sit non exercitation qui laboris commodo magna et pariatur. Pariatur ad Lorem nulla quis enim deserunt do tempor esse est nulla. Non velit adipisicing duis consectetur laborum labore officia elit.\r\n","registered":"2019-01-22T07:27:52 +08:00","latitude":-67.975616,"longitude":53.033207,"tags":["ad","mollit","esse","culpa","labore","magna","Lorem"],"friends":[{"id":0,"name":"Debbie Kent"},{"id":1,"name":"Silva Pratt"},{"id":2,"name":"Thomas Mcdowell"}],"greeting":"Hello, Nellie Porter! You have 8 unread messages.","favoriteFruit":"strawberry"}
7 | {"_id":"5ca04d4fda2b23edd55d8825","index":6,"guid":"3db08907-8cef-44c9-b462-c938cdcb1e09","isActive":true,"balance":"$1,550.55","picture":"http://placehold.it/32x32","age":30,"eyeColor":"green","name":"Rae Shepherd","gender":"female","company":"DIGIGEN","email":"raeshepherd@digigen.com","phone":"+1 (979) 401-3385","address":"731 Sharon Street, Loma, Washington, 459","about":"Adipisicing laboris laboris consectetur cillum commodo. Lorem adipisicing laborum proident sint proident laborum quis fugiat fugiat in cupidatat veniam aliqua proident. Magna aliquip labore amet irure culpa in est tempor in. Pariatur ipsum adipisicing exercitation nisi quis incididunt.\r\n","registered":"2014-01-31T04:47:37 +08:00","latitude":-39.929515,"longitude":-79.826403,"tags":["est","fugiat","elit","dolore","occaecat","mollit","reprehenderit"],"friends":[{"id":0,"name":"Fleming Mcgowan"},{"id":1,"name":"Cervantes Hess"},{"id":2,"name":"Matilda Holmes"}],"greeting":"Hello, Rae Shepherd! You have 10 unread messages.","favoriteFruit":"apple"}
--------------------------------------------------------------------------------
/guide.md:
--------------------------------------------------------------------------------
1 | ## What's the purpose of this documentation ?
2 |
3 | This document walks through how the profiler profiles the data.
4 |
5 | ### Whats expected from the end user ?
6 |
7 | End user is expected to pass in the a configuration which drives the utility.
8 |
9 | The infomation is provided in the form of JSON when the program runs. This json passed is then translated to language specific variables.
10 |
11 | An example on how to run this program by passing the JSON configuration can be seen [here](https://github.com/Nordstrom/bigdata-profiler#run-instructions).
12 |
13 |
14 | #### Configuration description
15 |
16 | ```bash
17 | "dataFormat":"Whats the format of the data ? Currently supported formats include JSON, CSV and PARQUET"
18 | "inputDataLocation":"s3 or hdfs location of data"
19 | "appName":"Meaningful name to this application"
20 | "schemaRepoUrl":"host name of schema repository"
21 | "scheRepoSubjectName":"name of the subject for which data is being validated"
22 | "schemaVersionId":"numerical version of the schema"
23 | "customQ1":"custom sql, make sure that this returns Long value"
24 | "customQ1ResultThreshold": 0
25 | "customQ1Operator":" = | > | < | <= | >= ",
26 | "customQ2":"custom sql, make sure that this returns Long value",
27 | "customQ2ResultThreshold": 0,
28 | "customQ2Operator":" = | > | < | <= | >= ",
29 | "customQ3":"custom sql, make sure that this returns Long value",
30 | "customQ3ResultThreshold": 0,
31 | "customQ3Operator":" = | > | < | <= | >= "
32 | ```
33 |
34 | ### What does the profiler exactly do ?
35 |
36 | 
37 |
38 | * First we launch the jupyter notebook which is scala based.
39 | * We configure the notebook to submit a spark job to an external spark cluster. This is where we set details like number of cores required, memory and some dependencies like `spark-avro` and `datadog-statsd` client. You should be able to use any `spark-submit` configuration here.
40 | * Next we define default values for changing parameters. These values will be updated by `papermill` everytime the notebook is run based on what parameters are passed in using the `papermill` script.
41 | * Now we will initialize the `datadog` statsd client and forward metrics to local statsd port.
42 | * We will now read the data that needs to be profiled in using apache spark. We will select options based on what the data format is.
43 | * Once we have read the data in as a `dataframe` we will report some basic stats around the dataframe to `datadog`.
44 | * Now we will query the `schema repository` and fetch the registered schema. Schema is defined using `Avro` formar.
45 | * This `Avro` schema will be converted to `spark sql` schema.
46 | * We will be infering the `spark sql` schema from the incoming data and then `comparing` the registered schema with the inferred schema from the data.
47 | * We will publish amount of matches and mis-matches to datadog.
48 | * Next, we are going to perform `custom data quality` checks based on `sql` statements fired on the dataset.
49 | * We will assert that the result of the `sql` statements meet the thresholds set.
50 |
--------------------------------------------------------------------------------
/output/data-validator.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "papermill": {
7 | "duration": 0.016849,
8 | "end_time": "2019-03-31T06:34:01.534057",
9 | "exception": false,
10 | "start_time": "2019-03-31T06:34:01.517208",
11 | "status": "completed"
12 | },
13 | "tags": []
14 | },
15 | "source": [
16 | "# Data Profiler and Schema Validation"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {
22 | "papermill": {
23 | "duration": 0.021525,
24 | "end_time": "2019-03-31T06:34:01.574574",
25 | "exception": false,
26 | "start_time": "2019-03-31T06:34:01.553049",
27 | "status": "completed"
28 | },
29 | "tags": []
30 | },
31 | "source": [
32 | "Profiles given input data based on the custom queries you provide, and validates its schema against schema repository. \n",
33 | "You can find how to insert a schema to schema-repository in README.md"
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": 1,
39 | "metadata": {
40 | "papermill": {
41 | "duration": 0.057776,
42 | "end_time": "2019-03-31T06:34:01.650639",
43 | "exception": false,
44 | "start_time": "2019-03-31T06:34:01.592863",
45 | "status": "completed"
46 | },
47 | "tags": []
48 | },
49 | "outputs": [
50 | {
51 | "data": {
52 | "text/html": [
53 | "\n",
54 | "
\n",
55 | " \n",
56 | " Magic | \n",
57 | " Example | \n",
58 | " Explanation | \n",
59 | "
\n",
60 | " \n",
61 | " info | \n",
62 | " %%info | \n",
63 | " Outputs session information for the current Livy endpoint. | \n",
64 | "
\n",
65 | " \n",
66 | " cleanup | \n",
67 | " %%cleanup -f | \n",
68 | " Deletes all sessions for the current Livy endpoint, including this notebook's session. The force flag is mandatory. | \n",
69 | "
\n",
70 | " \n",
71 | " delete | \n",
72 | " %%delete -f -s 0 | \n",
73 | " Deletes a session by number for the current Livy endpoint. Cannot delete this kernel's session. | \n",
74 | "
\n",
75 | " \n",
76 | " logs | \n",
77 | " %%logs | \n",
78 | " Outputs the current session's Livy logs. | \n",
79 | "
\n",
80 | " \n",
81 | " configure | \n",
82 | " %%configure -f {\"executorMemory\": \"1000M\", \"executorCores\": 4} | \n",
83 | " Configure the session creation parameters. The force flag is mandatory if a session has already been\n",
84 | " created and the session will be dropped and recreated. Look at \n",
85 | " Livy's POST /sessions Request Body for a list of valid parameters. Parameters must be passed in as a JSON string. | \n",
86 | "
\n",
87 | " \n",
88 | " spark | \n",
89 | " %%spark -o df df = spark.read.parquet('... | \n",
90 | " Executes spark commands.\n",
91 | " Parameters:\n",
92 | " \n",
93 | " - -o VAR_NAME: The Spark dataframe of name VAR_NAME will be available in the %%local Python context as a\n",
94 | " Pandas dataframe with the same name.
\n",
95 | " - -m METHOD: Sample method, either take or sample.
\n",
96 | " - -n MAXROWS: The maximum number of rows of a dataframe that will be pulled from Livy to Jupyter.\n",
97 | " If this number is negative, then the number of rows will be unlimited.
\n",
98 | " - -r FRACTION: Fraction used for sampling.
\n",
99 | " \n",
100 | " | \n",
101 | "
\n",
102 | " \n",
103 | " sql | \n",
104 | " %%sql -o tables -q SHOW TABLES | \n",
105 | " Executes a SQL query against the variable sqlContext (Spark v1.x) or spark (Spark v2.x).\n",
106 | " Parameters:\n",
107 | " \n",
108 | " - -o VAR_NAME: The result of the SQL query will be available in the %%local Python context as a\n",
109 | " Pandas dataframe.
\n",
110 | " - -q: The magic will return None instead of the dataframe (no visualization).
\n",
111 | " - -m, -n, -r are the same as the %%spark parameters above.
\n",
112 | " \n",
113 | " | \n",
114 | "
\n",
115 | " \n",
116 | " local | \n",
117 | " %%local a = 1 | \n",
118 | " All the code in subsequent lines will be executed locally. Code must be valid Python code. | \n",
119 | "
\n",
120 | "
\n"
121 | ],
122 | "text/plain": [
123 | ""
124 | ]
125 | },
126 | "metadata": {},
127 | "output_type": "display_data"
128 | }
129 | ],
130 | "source": [
131 | "%%help"
132 | ]
133 | },
134 | {
135 | "cell_type": "markdown",
136 | "metadata": {
137 | "papermill": {
138 | "duration": 0.026591,
139 | "end_time": "2019-03-31T06:34:01.702115",
140 | "exception": false,
141 | "start_time": "2019-03-31T06:34:01.675524",
142 | "status": "completed"
143 | },
144 | "tags": []
145 | },
146 | "source": [
147 | "## Spark job configuration parameters like memory and cores may vary from one job to other"
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": 2,
153 | "metadata": {
154 | "papermill": {
155 | "duration": 1.946674,
156 | "end_time": "2019-03-31T06:34:03.674598",
157 | "exception": false,
158 | "start_time": "2019-03-31T06:34:01.727924",
159 | "status": "completed"
160 | },
161 | "tags": []
162 | },
163 | "outputs": [
164 | {
165 | "data": {
166 | "text/html": [
167 | "Current session configs: {'name': 'data-profiler', 'executorMemory': '2GB', 'executorCores': 4, 'conf': {'spark.jars.packages': 'com.databricks:spark-avro_2.11:4.0.0,com.github.gphat:censorinus_2.11:2.1.13'}, 'kind': 'spark'}
"
168 | ],
169 | "text/plain": [
170 | ""
171 | ]
172 | },
173 | "metadata": {},
174 | "output_type": "display_data"
175 | },
176 | {
177 | "data": {
178 | "text/html": [
179 | "No active sessions."
180 | ],
181 | "text/plain": [
182 | ""
183 | ]
184 | },
185 | "metadata": {},
186 | "output_type": "display_data"
187 | }
188 | ],
189 | "source": [
190 | "%%configure -f\n",
191 | "{\"name\":\"data-profiler\", \n",
192 | " \"executorMemory\": \"2GB\", \n",
193 | " \"executorCores\": 4, \n",
194 | " \"conf\": {\"spark.jars.packages\": \"com.databricks:spark-avro_2.11:4.0.0,com.github.gphat:censorinus_2.11:2.1.13\"} \n",
195 | "}"
196 | ]
197 | },
198 | {
199 | "cell_type": "markdown",
200 | "metadata": {
201 | "papermill": {
202 | "duration": 0.028603,
203 | "end_time": "2019-03-31T06:34:03.721268",
204 | "exception": false,
205 | "start_time": "2019-03-31T06:34:03.692665",
206 | "status": "completed"
207 | },
208 | "tags": []
209 | },
210 | "source": [
211 | "## Set parameters that will be overwritten by values passed externally"
212 | ]
213 | },
214 | {
215 | "cell_type": "code",
216 | "execution_count": 3,
217 | "metadata": {
218 | "papermill": {
219 | "duration": 35.798513,
220 | "end_time": "2019-03-31T06:34:39.550115",
221 | "exception": false,
222 | "start_time": "2019-03-31T06:34:03.751602",
223 | "status": "completed"
224 | },
225 | "tags": [
226 | "parameters"
227 | ]
228 | },
229 | "outputs": [
230 | {
231 | "name": "stdout",
232 | "output_type": "stream",
233 | "text": [
234 | "Starting Spark application\n"
235 | ]
236 | },
237 | {
238 | "data": {
239 | "text/html": [
240 | "\n",
241 | "ID | YARN Application ID | Kind | State | Spark UI | Driver log | Current session? |
---|
278 | application_1550710474644_0438 | spark | idle | Link | Link | ✔ |
"
242 | ],
243 | "text/plain": [
244 | ""
245 | ]
246 | },
247 | "metadata": {},
248 | "output_type": "display_data"
249 | },
250 | {
251 | "name": "stdout",
252 | "output_type": "stream",
253 | "text": [
254 | "SparkSession available as 'spark'.\n"
255 | ]
256 | },
257 | {
258 | "name": "stdout",
259 | "output_type": "stream",
260 | "text": [
261 | "dataFormat: String = data-format\n",
262 | "delimiter: String = \"\"\n",
263 | "inputDataLocation: String = input-data-location\n",
264 | "appName: String = app-name\n",
265 | "schemaRepoUrl: String = schema-repo-url\n",
266 | "scheRepoSubjectName: String = subject-name\n",
267 | "schemaVersionId: String = schema-version\n",
268 | "customQ1: String = custom-query-1\n",
269 | "customQ1ResultThreshold: Int = 0\n",
270 | "customQ1Operator: String = custom-operator-1\n",
271 | "customQ2: String = custom-query-2\n",
272 | "customQ2ResultThreshold: Int = 0\n",
273 | "customQ2Operator: String = custom-operator-2\n",
274 | "customQ3: String = custom-query-3\n",
275 | "customQ3ResultThreshold: Int = 0\n",
276 | "customQ3Operator: String = custom-query-3\n"
277 | ]
278 | }
279 | ],
280 | "source": [
281 | "val dataFormat = \"data-format\"\n",
282 | "val delimiter = \"\"\n",
283 | "val inputDataLocation = \"input-data-location\"\n",
284 | "val appName = \"app-name\" \n",
285 | "val schemaRepoUrl = \"schema-repo-url\"\n",
286 | "val scheRepoSubjectName = \"subject-name\"\n",
287 | "val schemaVersionId = \"schema-version\"\n",
288 | "val customQ1 = \"custom-query-1\"\n",
289 | "val customQ1ResultThreshold = 0\n",
290 | "val customQ1Operator = \"custom-operator-1\"\n",
291 | "val customQ2 = \"custom-query-2\"\n",
292 | "val customQ2ResultThreshold = 0\n",
293 | "val customQ2Operator = \"custom-operator-2\"\n",
294 | "val customQ3 = \"custom-query-3\"\n",
295 | "val customQ3ResultThreshold = 0\n",
296 | "val customQ3Operator = \"custom-query-3\""
297 | ]
298 | },
299 | {
300 | "cell_type": "code",
301 | "execution_count": 4,
302 | "metadata": {
303 | "papermill": {
304 | "duration": 3.268919,
305 | "end_time": "2019-03-31T06:34:42.838363",
306 | "exception": false,
307 | "start_time": "2019-03-31T06:34:39.569444",
308 | "status": "completed"
309 | },
310 | "tags": [
311 | "injected-parameters"
312 | ]
313 | },
314 | "outputs": [
315 | {
316 | "name": "stdout",
317 | "output_type": "stream",
318 | "text": [
319 | "dataFormat: String = json\n",
320 | "inputDataLocation: String = s3a://bucket/prefix/generated.json\n",
321 | "appName: String = cust-profile-data-validation\n",
322 | "schemaRepoUrl: String = http://schemarepohostaddress\n",
323 | "scheRepoSubjectName: String = cust-profile\n",
324 | "schemaVersionId: String = 0\n",
325 | "customQ1: String = select CAST(count(_id) - count(distinct _id) as Long) as diff from dataset\n",
326 | "customQ1ResultThreshold: Int = 0\n",
327 | "customQ1Operator: String = =\n",
328 | "customQ2: String = select CAST(length(phone) as Long) from dataset\n",
329 | "customQ2ResultThreshold: Int = 17\n",
330 | "customQ2Operator: String = =\n",
331 | "customQ3: String = select CAST(count(distinct gender) as Long) from dataset\n",
332 | "customQ3ResultThreshold: Int = 3\n",
333 | "customQ3Operator: String = <=\n"
334 | ]
335 | }
336 | ],
337 | "source": [
338 | "// Parameters\n",
339 | "val dataFormat = \"json\"\n",
340 | "val inputDataLocation = \"s3a://bucket/prefix/generated.json\"\n",
341 | "val appName = \"cust-profile-data-validation\"\n",
342 | "val schemaRepoUrl = \"http://schemarepohostaddress\"\n",
343 | "val scheRepoSubjectName = \"cust-profile\"\n",
344 | "val schemaVersionId = \"0\"\n",
345 | "val customQ1 = \"select CAST(count(_id) - count(distinct _id) as Long) as diff from dataset\"\n",
346 | "val customQ1ResultThreshold = 0\n",
347 | "val customQ1Operator = \"=\"\n",
348 | "val customQ2 = \"select CAST(length(phone) as Long) from dataset\"\n",
349 | "val customQ2ResultThreshold = 17\n",
350 | "val customQ2Operator = \"=\"\n",
351 | "val customQ3 = \"select CAST(count(distinct gender) as Long) from dataset\"\n",
352 | "val customQ3ResultThreshold = 3\n",
353 | "val customQ3Operator = \"<=\"\n"
354 | ]
355 | },
356 | {
357 | "cell_type": "markdown",
358 | "metadata": {
359 | "papermill": {
360 | "duration": 0.018232,
361 | "end_time": "2019-03-31T06:34:42.874800",
362 | "exception": false,
363 | "start_time": "2019-03-31T06:34:42.856568",
364 | "status": "completed"
365 | },
366 | "tags": []
367 | },
368 | "source": [
369 | "## Setup datadog statsd interface"
370 | ]
371 | },
372 | {
373 | "cell_type": "code",
374 | "execution_count": 5,
375 | "metadata": {
376 | "papermill": {
377 | "duration": 1.540569,
378 | "end_time": "2019-03-31T06:34:44.433086",
379 | "exception": false,
380 | "start_time": "2019-03-31T06:34:42.892517",
381 | "status": "completed"
382 | },
383 | "tags": []
384 | },
385 | "outputs": [
386 | {
387 | "name": "stdout",
388 | "output_type": "stream",
389 | "text": [
390 | "import github.gphat.censorinus.DogStatsDClient\n",
391 | "statsd: github.gphat.censorinus.DogStatsDClient = github.gphat.censorinus.DogStatsDClient@34f583f5\n"
392 | ]
393 | }
394 | ],
395 | "source": [
396 | "import github.gphat.censorinus.DogStatsDClient\n",
397 | "\n",
398 | "val statsd = new DogStatsDClient(hostname = \"localhost\", port = 8125, prefix = \"mlp.validator\")"
399 | ]
400 | },
401 | {
402 | "cell_type": "markdown",
403 | "metadata": {
404 | "papermill": {
405 | "duration": 0.018379,
406 | "end_time": "2019-03-31T06:34:44.470167",
407 | "exception": false,
408 | "start_time": "2019-03-31T06:34:44.451788",
409 | "status": "completed"
410 | },
411 | "tags": []
412 | },
413 | "source": [
414 | "## Read data, if data being read is CSV, it needs to have a header"
415 | ]
416 | },
417 | {
418 | "cell_type": "code",
419 | "execution_count": 6,
420 | "metadata": {
421 | "papermill": {
422 | "duration": 9.78802,
423 | "end_time": "2019-03-31T06:34:54.277048",
424 | "exception": false,
425 | "start_time": "2019-03-31T06:34:44.489028",
426 | "status": "completed"
427 | },
428 | "tags": []
429 | },
430 | "outputs": [
431 | {
432 | "name": "stdout",
433 | "output_type": "stream",
434 | "text": [
435 | "df: org.apache.spark.sql.DataFrame = [_id: string, about: string ... 20 more fields]\n"
436 | ]
437 | }
438 | ],
439 | "source": [
440 | "val df = dataFormat match {\n",
441 | " case \"parquet\" => spark.read.parquet(inputDataLocation)\n",
442 | " case \"json\" => spark.read.json(inputDataLocation)\n",
443 | " case \"csv\" => spark.read.option(\"mode\", \"DROPMALFORMED\").option(\"header\", \"true\").option(\"delimiter\", delimiter).csv(inputDataLocation)\n",
444 | " case _ => throw new Exception(s\"$dataFormat, as a dataformat is not supported \")\n",
445 | "}"
446 | ]
447 | },
448 | {
449 | "cell_type": "markdown",
450 | "metadata": {
451 | "papermill": {
452 | "duration": 0.018455,
453 | "end_time": "2019-03-31T06:34:54.314874",
454 | "exception": false,
455 | "start_time": "2019-03-31T06:34:54.296419",
456 | "status": "completed"
457 | },
458 | "tags": []
459 | },
460 | "source": [
461 | "### Publish some basic stats about the data. This can be extended further"
462 | ]
463 | },
464 | {
465 | "cell_type": "code",
466 | "execution_count": 7,
467 | "metadata": {
468 | "papermill": {
469 | "duration": 3.322344,
470 | "end_time": "2019-03-31T06:34:57.656056",
471 | "exception": false,
472 | "start_time": "2019-03-31T06:34:54.333712",
473 | "status": "completed"
474 | },
475 | "tags": []
476 | },
477 | "outputs": [
478 | {
479 | "name": "stdout",
480 | "output_type": "stream",
481 | "text": [
482 | "recordCount: Long = 7\n",
483 | "numColumns: Int = 22\n"
484 | ]
485 | }
486 | ],
487 | "source": [
488 | "val recordCount = df.count()\n",
489 | "val numColumns = df.columns.size\n",
490 | "statsd.histogram(name = \"recordCount\", value = recordCount, tags = Seq(s\"appName:$appName\", \"data-validation\", \"env:dev\"));\n",
491 | "statsd.histogram(name = \"numColumns\", value = numColumns, tags = Seq(s\"appName:$appName\", \"data-validation\",\"env:dev\"));"
492 | ]
493 | },
494 | {
495 | "cell_type": "markdown",
496 | "metadata": {
497 | "papermill": {
498 | "duration": 0.017359,
499 | "end_time": "2019-03-31T06:34:57.691555",
500 | "exception": false,
501 | "start_time": "2019-03-31T06:34:57.674196",
502 | "status": "completed"
503 | },
504 | "tags": []
505 | },
506 | "source": [
507 | "## Read registered schema from schema repository"
508 | ]
509 | },
510 | {
511 | "cell_type": "markdown",
512 | "metadata": {
513 | "papermill": {
514 | "duration": 0.023315,
515 | "end_time": "2019-03-31T06:34:57.733195",
516 | "exception": false,
517 | "start_time": "2019-03-31T06:34:57.709880",
518 | "status": "completed"
519 | },
520 | "tags": []
521 | },
522 | "source": [
523 | "### Utility method to call rest endpoint for schema"
524 | ]
525 | },
526 | {
527 | "cell_type": "code",
528 | "execution_count": 8,
529 | "metadata": {
530 | "papermill": {
531 | "duration": 2.334079,
532 | "end_time": "2019-03-31T06:35:00.086641",
533 | "exception": false,
534 | "start_time": "2019-03-31T06:34:57.752562",
535 | "status": "completed"
536 | },
537 | "tags": []
538 | },
539 | "outputs": [
540 | {
541 | "name": "stdout",
542 | "output_type": "stream",
543 | "text": [
544 | "import java.io.IOException\n",
545 | "import org.apache.http.HttpEntity\n",
546 | "import org.apache.http.HttpResponse\n",
547 | "import org.apache.http.client.ClientProtocolException\n",
548 | "import org.apache.http.client.ResponseHandler\n",
549 | "import org.apache.http.client.methods.HttpGet\n",
550 | "import org.apache.http.impl.client.CloseableHttpClient\n",
551 | "import org.apache.http.impl.client.HttpClients\n",
552 | "import org.apache.http.util.EntityUtils\n",
553 | "getSchema: (url: String)String\n"
554 | ]
555 | }
556 | ],
557 | "source": [
558 | "import java.io.IOException;\n",
559 | "\n",
560 | "import org.apache.http.HttpEntity;\n",
561 | "import org.apache.http.HttpResponse;\n",
562 | "import org.apache.http.client.ClientProtocolException;\n",
563 | "import org.apache.http.client.ResponseHandler;\n",
564 | "import org.apache.http.client.methods.HttpGet;\n",
565 | "import org.apache.http.impl.client.CloseableHttpClient;\n",
566 | "import org.apache.http.impl.client.HttpClients;\n",
567 | "import org.apache.http.util.EntityUtils;\n",
568 | "\n",
569 | "def getSchema(url: String) : String = {\n",
570 | " val httpclient: CloseableHttpClient = HttpClients.createDefault()\n",
571 | " try {\n",
572 | " val httpget: HttpGet = new HttpGet(url)\n",
573 | " println(\"Executing request \" + httpget.getRequestLine)\n",
574 | " val responseHandler: ResponseHandler[String] =\n",
575 | " new ResponseHandler[String]() {\n",
576 | " override def handleResponse(response: HttpResponse): String = {\n",
577 | " var status: Int = response.getStatusLine.getStatusCode\n",
578 | " if (status >= 200 && status < 300) {\n",
579 | " var entity: HttpEntity = response.getEntity\n",
580 | " if (entity != null) EntityUtils.toString(entity) else null\n",
581 | " } else {\n",
582 | " throw new ClientProtocolException(\n",
583 | " \"Unexpected response status: \" + status);\n",
584 | " }\n",
585 | " }\n",
586 | " }\n",
587 | " httpclient.execute(httpget, responseHandler) \n",
588 | " } finally {\n",
589 | " httpclient.close()\n",
590 | " None\n",
591 | " }\n",
592 | "}"
593 | ]
594 | },
595 | {
596 | "cell_type": "markdown",
597 | "metadata": {
598 | "papermill": {
599 | "duration": 0.019669,
600 | "end_time": "2019-03-31T06:35:00.126482",
601 | "exception": false,
602 | "start_time": "2019-03-31T06:35:00.106813",
603 | "status": "completed"
604 | },
605 | "tags": []
606 | },
607 | "source": [
608 | "#### Create url from input parameters and feth schema for specified version"
609 | ]
610 | },
611 | {
612 | "cell_type": "code",
613 | "execution_count": 9,
614 | "metadata": {
615 | "papermill": {
616 | "duration": 2.256894,
617 | "end_time": "2019-03-31T06:35:02.401766",
618 | "exception": false,
619 | "start_time": "2019-03-31T06:35:00.144872",
620 | "status": "completed"
621 | },
622 | "tags": []
623 | },
624 | "outputs": [
625 | {
626 | "name": "stdout",
627 | "output_type": "stream",
628 | "text": [
629 | "schema_url: String = http://schemarepohostaddress/schema-repo/cust-profile/id/0\n",
630 | "Executing request GET http://schemarepohostaddress/schema-repo/cust-profile/id/0 HTTP/1.1\n",
631 | "publishedSchema: String =\n",
632 | "{\n",
633 | " \"type\" : \"record\",\n",
634 | " \"name\" : \"MyClass\",\n",
635 | " \"namespace\" : \"com.test.avro\",\n",
636 | " \"fields\" : [ {\n",
637 | " \"name\" : \"_id\",\n",
638 | " \"type\" : \"string\"\n",
639 | " }, {\n",
640 | " \"name\" : \"index\",\n",
641 | " \"type\" : \"long\"\n",
642 | " }, {\n",
643 | " \"name\" : \"guid\",\n",
644 | " \"type\" : \"string\"\n",
645 | " }, {\n",
646 | " \"name\" : \"isActive\",\n",
647 | " \"type\" : \"boolean\"\n",
648 | " }, {\n",
649 | " \"name\" : \"balance\",\n",
650 | " \"type\" : \"string\"\n",
651 | " }, {\n",
652 | " \"name\" : \"picture\",\n",
653 | " \"type\" : \"string\"\n",
654 | " }, {\n",
655 | " \"name\" : \"age\",\n",
656 | " \"type\" : \"long\"\n",
657 | " }, {\n",
658 | " \"name\" : \"eyeColor\",\n",
659 | " \"type\" : \"string\"\n",
660 | " }, {\n",
661 | " \"name\" : \"name\",\n",
662 | " \"type\" : \"string\"\n",
663 | " }, {\n",
664 | " \"name\" : \"gender\",\n",
665 | " \"type\" : \"string\"\n",
666 | " }, {\n",
667 | " \"name\" : \"company\",\n",
668 | " \"type\" : \"string\"\n",
669 | " }, {\n",
670 | " \"name\" : \"email\",\n",
671 | " \"type\" : \"string\"\n",
672 | " }, {\n",
673 | " \"name\" : \"phone\",\n",
674 | " \"type\" : \"string\"\n",
675 | " }, {\n",
676 | " \"name..."
677 | ]
678 | }
679 | ],
680 | "source": [
681 | "val schema_url = s\"$schemaRepoUrl/schema-repo/$scheRepoSubjectName/id/$schemaVersionId\"\n",
682 | "val publishedSchema = getSchema(schema_url) "
683 | ]
684 | },
685 | {
686 | "cell_type": "markdown",
687 | "metadata": {
688 | "papermill": {
689 | "duration": 0.019668,
690 | "end_time": "2019-03-31T06:35:02.441447",
691 | "exception": false,
692 | "start_time": "2019-03-31T06:35:02.421779",
693 | "status": "completed"
694 | },
695 | "tags": []
696 | },
697 | "source": [
698 | "### Convert Avro schema registered to Spark SQL Schema."
699 | ]
700 | },
701 | {
702 | "cell_type": "code",
703 | "execution_count": 10,
704 | "metadata": {
705 | "papermill": {
706 | "duration": 2.227334,
707 | "end_time": "2019-03-31T06:35:04.687955",
708 | "exception": false,
709 | "start_time": "2019-03-31T06:35:02.460621",
710 | "status": "completed"
711 | },
712 | "tags": []
713 | },
714 | "outputs": [
715 | {
716 | "name": "stdout",
717 | "output_type": "stream",
718 | "text": [
719 | "import com.databricks.spark.avro._\n",
720 | "import org.apache.avro.Schema.Parser\n",
721 | "schema: org.apache.avro.Schema = {\"type\":\"record\",\"name\":\"MyClass\",\"namespace\":\"com.test.avro\",\"fields\":[{\"name\":\"_id\",\"type\":\"string\"},{\"name\":\"index\",\"type\":\"long\"},{\"name\":\"guid\",\"type\":\"string\"},{\"name\":\"isActive\",\"type\":\"boolean\"},{\"name\":\"balance\",\"type\":\"string\"},{\"name\":\"picture\",\"type\":\"string\"},{\"name\":\"age\",\"type\":\"long\"},{\"name\":\"eyeColor\",\"type\":\"string\"},{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"gender\",\"type\":\"string\"},{\"name\":\"company\",\"type\":\"string\"},{\"name\":\"email\",\"type\":\"string\"},{\"name\":\"phone\",\"type\":\"string\"},{\"name\":\"address\",\"type\":\"string\"},{\"name\":\"about\",\"type\":\"string\"},{\"name\":\"registered\",\"type\":\"string\"},{\"name\":\"latitude\",\"type\":\"double\"},{\"name\":\"longitude\",\"type\":\"double\"},{\"name\":\"tags\",\"type\":{\"type\":\"array\",\"items\":\"string\"}},{\"name\":\"friends\",\"type...import com.databricks.spark.avro.SchemaConverters\n",
722 | "structSchema: org.apache.spark.sql.types.DataType = StructType(StructField(_id,StringType,false), StructField(index,LongType,false), StructField(guid,StringType,false), StructField(isActive,BooleanType,false), StructField(balance,StringType,false), StructField(picture,StringType,false), StructField(age,LongType,false), StructField(eyeColor,StringType,false), StructField(name,StringType,false), StructField(gender,StringType,false), StructField(company,StringType,false), StructField(email,StringType,false), StructField(phone,StringType,false), StructField(address,StringType,false), StructField(about,StringType,false), StructField(registered,StringType,false), StructField(latitude,DoubleType,false), StructField(longitude,DoubleType,false), StructField(tags,ArrayType(StringType,false),false..."
723 | ]
724 | }
725 | ],
726 | "source": [
727 | "import com.databricks.spark.avro._\n",
728 | "import org.apache.avro.Schema.Parser\n",
729 | "val schema = new Parser().parse(publishedSchema)\n",
730 | "\n",
731 | "import com.databricks.spark.avro.SchemaConverters\n",
732 | "val structSchema = SchemaConverters.toSqlType(schema).dataType"
733 | ]
734 | },
735 | {
736 | "cell_type": "markdown",
737 | "metadata": {
738 | "papermill": {
739 | "duration": 0.019599,
740 | "end_time": "2019-03-31T06:35:04.727145",
741 | "exception": false,
742 | "start_time": "2019-03-31T06:35:04.707546",
743 | "status": "completed"
744 | },
745 | "tags": []
746 | },
747 | "source": [
748 | "### Utility method to traverse schema tree and find the leaf node names"
749 | ]
750 | },
751 | {
752 | "cell_type": "code",
753 | "execution_count": 11,
754 | "metadata": {
755 | "papermill": {
756 | "duration": 2.286688,
757 | "end_time": "2019-03-31T06:35:07.033193",
758 | "exception": false,
759 | "start_time": "2019-03-31T06:35:04.746505",
760 | "status": "completed"
761 | },
762 | "tags": []
763 | },
764 | "outputs": [
765 | {
766 | "name": "stdout",
767 | "output_type": "stream",
768 | "text": [
769 | "import scala.collection.mutable.ListBuffer\n",
770 | "import org.apache.spark.sql.types._\n",
771 | "findFields: (path: String, dt: org.apache.spark.sql.types.DataType, columnNames: scala.collection.mutable.ListBuffer[String])Unit\n"
772 | ]
773 | }
774 | ],
775 | "source": [
776 | "import scala.collection.mutable.ListBuffer\n",
777 | "import org.apache.spark.sql.types._\n",
778 | "\n",
779 | "def findFields(path: String, dt: DataType, columnNames: ListBuffer[String]): Unit = dt match {\n",
780 | " case s: StructType =>\n",
781 | " s.fields.foreach(f => findFields(path + \".\" + f.name, f.dataType, columnNames))\n",
782 | " case s: ArrayType => findFields(path, s.elementType, columnNames)\n",
783 | " case other =>\n",
784 | " columnNames += path.substring(1)\n",
785 | "}"
786 | ]
787 | },
788 | {
789 | "cell_type": "code",
790 | "execution_count": 12,
791 | "metadata": {
792 | "papermill": {
793 | "duration": 2.299086,
794 | "end_time": "2019-03-31T06:35:09.350469",
795 | "exception": false,
796 | "start_time": "2019-03-31T06:35:07.051383",
797 | "status": "completed"
798 | },
799 | "tags": []
800 | },
801 | "outputs": [
802 | {
803 | "name": "stdout",
804 | "output_type": "stream",
805 | "text": [
806 | "dfColumnNames: scala.collection.mutable.ListBuffer[String] = ListBuffer()\n",
807 | "List(_id, about, address, age, balance, company, email, eyeColor, favoriteFruit, friends.id, friends.name, gender, greeting, guid, index, isActive, latitude, longitude, name, phone, picture, registered, tags)"
808 | ]
809 | }
810 | ],
811 | "source": [
812 | "var dfColumnNames = new ListBuffer[String]()\n",
813 | "findFields(\"\", df.schema, dfColumnNames)\n",
814 | "\n",
815 | "print(dfColumnNames.toList)"
816 | ]
817 | },
818 | {
819 | "cell_type": "code",
820 | "execution_count": 13,
821 | "metadata": {
822 | "papermill": {
823 | "duration": 2.312576,
824 | "end_time": "2019-03-31T06:35:11.687080",
825 | "exception": false,
826 | "start_time": "2019-03-31T06:35:09.374504",
827 | "status": "completed"
828 | },
829 | "tags": []
830 | },
831 | "outputs": [
832 | {
833 | "name": "stdout",
834 | "output_type": "stream",
835 | "text": [
836 | "publishedSchemaDataColumnNames: scala.collection.mutable.ListBuffer[String] = ListBuffer()\n",
837 | "List(_id, index, guid, isActive, balance, picture, age, eyeColor, name, gender, company, email, phone, address, about, registered, latitude, longitude, tags, friends.id, friends.name, greeting, favoriteFruit)"
838 | ]
839 | }
840 | ],
841 | "source": [
842 | "var publishedSchemaDataColumnNames = new ListBuffer[String]()\n",
843 | "findFields(\"\", structSchema, publishedSchemaDataColumnNames)\n",
844 | "\n",
845 | "print(publishedSchemaDataColumnNames.toList)"
846 | ]
847 | },
848 | {
849 | "cell_type": "code",
850 | "execution_count": 14,
851 | "metadata": {
852 | "papermill": {
853 | "duration": 3.295563,
854 | "end_time": "2019-03-31T06:35:15.004001",
855 | "exception": false,
856 | "start_time": "2019-03-31T06:35:11.708438",
857 | "status": "completed"
858 | },
859 | "tags": []
860 | },
861 | "outputs": [
862 | {
863 | "name": "stdout",
864 | "output_type": "stream",
865 | "text": [
866 | "sourceColumns: scala.collection.immutable.Set[String] = Set(friends.id, registered, name, latitude, email, guid, _id, tags, balance, age, longitude, company, favoriteFruit, friends.name, isActive, greeting, address, picture, about, eyeColor, phone, index, gender)\n",
867 | "publishedColumns: scala.collection.immutable.Set[String] = Set(friends.id, registered, name, latitude, email, guid, _id, tags, balance, age, longitude, company, favoriteFruit, friends.name, isActive, greeting, address, picture, about, eyeColor, phone, index, gender)\n",
868 | "differenceColumns: scala.collection.immutable.Set[String] = Set()\n",
869 | "numDiffColumns: Int = 0\n",
870 | "Number of columns not matching the schema are: 0"
871 | ]
872 | }
873 | ],
874 | "source": [
875 | "val sourceColumns = dfColumnNames.toSet\n",
876 | "val publishedColumns = publishedSchemaDataColumnNames.toSet\n",
877 | "val differenceColumns = publishedColumns.diff(sourceColumns)\n",
878 | "val numDiffColumns = differenceColumns.size\n",
879 | "print(s\"Number of columns not matching the schema are: $numDiffColumns\")\n",
880 | "statsd.histogram(name = \"numDiffColumns\", value = numDiffColumns, tags = Seq(s\"appName:$appName\", \"data-validation\", \"env:dev\"));"
881 | ]
882 | },
883 | {
884 | "cell_type": "markdown",
885 | "metadata": {
886 | "papermill": {
887 | "duration": 0.020864,
888 | "end_time": "2019-03-31T06:35:15.046792",
889 | "exception": false,
890 | "start_time": "2019-03-31T06:35:15.025928",
891 | "status": "completed"
892 | },
893 | "tags": []
894 | },
895 | "source": [
896 | "### Custom data quality checks"
897 | ]
898 | },
899 | {
900 | "cell_type": "markdown",
901 | "metadata": {
902 | "papermill": {
903 | "duration": 0.029276,
904 | "end_time": "2019-03-31T06:35:15.096832",
905 | "exception": false,
906 | "start_time": "2019-03-31T06:35:15.067556",
907 | "status": "completed"
908 | },
909 | "tags": []
910 | },
911 | "source": [
912 | "#### Utility function to assert results"
913 | ]
914 | },
915 | {
916 | "cell_type": "code",
917 | "execution_count": 15,
918 | "metadata": {
919 | "papermill": {
920 | "duration": 1.565888,
921 | "end_time": "2019-03-31T06:35:16.684387",
922 | "exception": false,
923 | "start_time": "2019-03-31T06:35:15.118499",
924 | "status": "completed"
925 | },
926 | "tags": []
927 | },
928 | "outputs": [
929 | {
930 | "name": "stdout",
931 | "output_type": "stream",
932 | "text": [
933 | "customCheck: (val1: Long, operator: String, threshold: Long)Unit\n"
934 | ]
935 | }
936 | ],
937 | "source": [
938 | "def customCheck(val1 : Long, operator : String, threshold : Long) : Unit = {\n",
939 | " operator match {\n",
940 | " case \">\" => try { assert(val1 > threshold) } catch { case e: AssertionError => print(e);System.exit(1)}\n",
941 | " case \">=\" => try { assert(val1 >= threshold) } catch { case e: AssertionError => print(e);System.exit(1)}\n",
942 | " case \"=\" => try { assert(val1 == threshold) } catch { case e: AssertionError => print(e);System.exit(1)}\n",
943 | " case \"<\" => try { assert(val1 < threshold) } catch { case e: AssertionError => print(e);System.exit(1)}\n",
944 | " case \"<=\" => try { assert(val1 <= threshold) } catch { case e: AssertionError => print(e);System.exit(1)}\n",
945 | " }\n",
946 | "}"
947 | ]
948 | },
949 | {
950 | "cell_type": "markdown",
951 | "metadata": {
952 | "papermill": {
953 | "duration": 0.021649,
954 | "end_time": "2019-03-31T06:35:16.727162",
955 | "exception": false,
956 | "start_time": "2019-03-31T06:35:16.705513",
957 | "status": "completed"
958 | },
959 | "tags": []
960 | },
961 | "source": [
962 | "#### Create a temporary table, make sure that sql statements return a Long value, to be sure cast results to Long in the queries"
963 | ]
964 | },
965 | {
966 | "cell_type": "code",
967 | "execution_count": 16,
968 | "metadata": {
969 | "papermill": {
970 | "duration": 11.933366,
971 | "end_time": "2019-03-31T06:35:28.681401",
972 | "exception": false,
973 | "start_time": "2019-03-31T06:35:16.748035",
974 | "status": "completed"
975 | },
976 | "tags": []
977 | },
978 | "outputs": [
979 | {
980 | "name": "stdout",
981 | "output_type": "stream",
982 | "text": [
983 | "res1: Long = 0\n",
984 | "res2: Long = 17\n",
985 | "res3: Long = 1\n"
986 | ]
987 | }
988 | ],
989 | "source": [
990 | "df.createOrReplaceTempView(\"dataset\")\n",
991 | "\n",
992 | "val res1 = spark.sql(customQ1).collect().toList(0).getAs[Long](0)\n",
993 | "customCheck(res1, customQ1Operator, customQ1ResultThreshold)\n",
994 | "\n",
995 | "val res2 = spark.sql(customQ2).collect().toList(0).getAs[Long](0)\n",
996 | "customCheck(res2, customQ2Operator, customQ2ResultThreshold)\n",
997 | "\n",
998 | "val res3 = spark.sql(customQ3).collect().toList(0).getAs[Long](0)\n",
999 | "customCheck(res3, customQ3Operator, customQ3ResultThreshold)"
1000 | ]
1001 | }
1002 | ],
1003 | "metadata": {
1004 | "celltoolbar": "Tags",
1005 | "kernelspec": {
1006 | "display_name": "Spark",
1007 | "language": "",
1008 | "name": "sparkkernel"
1009 | },
1010 | "language_info": {
1011 | "codemirror_mode": "text/x-scala",
1012 | "mimetype": "text/x-scala",
1013 | "name": "scala",
1014 | "pygments_lexer": "scala"
1015 | },
1016 | "papermill": {
1017 | "duration": 90.051482,
1018 | "end_time": "2019-03-31T06:35:30.080425",
1019 | "environment_variables": {},
1020 | "exception": null,
1021 | "input_path": "data-validator.ipynb",
1022 | "output_path": "output/data-validator.ipynb",
1023 | "parameters": {
1024 | "appName": "cust-profile-data-validation",
1025 | "customQ1": "select CAST(count(_id) - count(distinct _id) as Long) as diff from dataset",
1026 | "customQ1Operator": "=",
1027 | "customQ1ResultThreshold": 0,
1028 | "customQ2": "select CAST(length(phone) as Long) from dataset",
1029 | "customQ2Operator": "=",
1030 | "customQ2ResultThreshold": 17,
1031 | "customQ3": "select CAST(count(distinct gender) as Long) from dataset",
1032 | "customQ3Operator": "<=",
1033 | "customQ3ResultThreshold": 3,
1034 | "dataFormat": "json",
1035 | "inputDataLocation": "s3a://ml-sort-summarize-clickstream-parquet/data-validation-test-dataset/generated.json",
1036 | "scheRepoSubjectName": "cust-profile",
1037 | "schemaRepoUrl": "http://schemarepohostaddress",
1038 | "schemaVersionId": "0"
1039 | },
1040 | "start_time": "2019-03-31T06:34:00.028943",
1041 | "version": "0+untagged.5.gec97e17"
1042 | }
1043 | },
1044 | "nbformat": 4,
1045 | "nbformat_minor": 2
1046 | }
--------------------------------------------------------------------------------
/papermill_notebook_runner.py:
--------------------------------------------------------------------------------
1 | import papermill as pm
2 | import sys
3 | import json
4 |
5 | print("Entering python papermill runner script !!")
6 | print("arguments are: {}, {}, {}".format(sys.argv[1], sys.argv[2], sys.argv[3]))
7 | # Check if parameters exist
8 |
9 | if sys.argv[3]:
10 | pm.execute_notebook(
11 | sys.argv[1],
12 | sys.argv[2],
13 | parameters = json.loads(sys.argv[3])
14 | )
15 | else:
16 | pm.execute_notebook(
17 | sys.argv[1],
18 | sys.argv[2]
19 | )
--------------------------------------------------------------------------------
/schema.avsc:
--------------------------------------------------------------------------------
1 | {
2 | "type" : "record",
3 | "name" : "CustProfile",
4 | "namespace" : "com.nordstrom.cust.profile",
5 | "fields" : [ {
6 | "name" : "_id",
7 | "type" : "string"
8 | }, {
9 | "name" : "index",
10 | "type" : "long"
11 | }, {
12 | "name" : "guid",
13 | "type" : "string"
14 | }, {
15 | "name" : "isActive",
16 | "type" : "boolean"
17 | }, {
18 | "name" : "balance",
19 | "type" : "string"
20 | }, {
21 | "name" : "picture",
22 | "type" : "string"
23 | }, {
24 | "name" : "age",
25 | "type" : "long"
26 | }, {
27 | "name" : "eyeColor",
28 | "type" : "string"
29 | }, {
30 | "name" : "name",
31 | "type" : "string"
32 | }, {
33 | "name" : "gender",
34 | "type" : "string"
35 | }, {
36 | "name" : "company",
37 | "type" : "string"
38 | }, {
39 | "name" : "email",
40 | "type" : "string"
41 | }, {
42 | "name" : "phone",
43 | "type" : "string"
44 | }, {
45 | "name" : "address",
46 | "type" : "string"
47 | }, {
48 | "name" : "about",
49 | "type" : "string"
50 | }, {
51 | "name" : "registered",
52 | "type" : "string"
53 | }, {
54 | "name" : "latitude",
55 | "type" : "double"
56 | }, {
57 | "name" : "longitude",
58 | "type" : "double"
59 | }, {
60 | "name" : "tags",
61 | "type" : {
62 | "type" : "array",
63 | "items" : "string"
64 | }
65 | }, {
66 | "name" : "friends",
67 | "type" : {
68 | "type" : "array",
69 | "items" : {
70 | "type" : "record",
71 | "name" : "friends",
72 | "fields" : [ {
73 | "name" : "id",
74 | "type" : "long"
75 | }, {
76 | "name" : "name",
77 | "type" : "string"
78 | } ]
79 | }
80 | }
81 | }, {
82 | "name" : "greeting",
83 | "type" : "string"
84 | }, {
85 | "name" : "favoriteFruit",
86 | "type" : "string"
87 | } ]
88 | }
--------------------------------------------------------------------------------