├── Chapter01 ├── Ch-01-IntroductionToSpark.ipynb └── README.md ├── Chapter02 ├── Ch-02-ComplexDataTypesInSpark.ipynb └── README.md ├── Chapter03 ├── Ch-03-DeltaCRUD.ipynb └── README.md ├── Chapter04 ├── Ch-04-DeltaForBatch&Streaming.ipynb └── README.md ├── Chapter05 ├── Ch-05-DataConsolidationInDeltaLake.ipynb └── README.md ├── Chapter06 ├── Ch-06-HandlingCommonDataPatternsWithDelta.ipynb └── README.md ├── Chapter07 ├── Ch-07-DeltaForDataWarehouseUseCases.ipynb └── README.md ├── Chapter08 ├── Ch-08-AtypicalScenarios.ipynb └── README.md ├── Chapter09 ├── Ch-09-DeltaForReproducibleMachineLearning.ipynb └── README.md ├── Chapter10 └── README.md ├── Chapter11 └── README.md ├── Chapter12 ├── Ch-12-DeltaPerformance.ipynb └── README.md ├── Chapter13 ├── Ch-13-MappingYourDeltaJourney.ipynb └── README.md ├── LICENSE └── README.md /Chapter01/Ch-01-IntroductionToSpark.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","source":["# Introduction to Spark\n* Spark is available both as a standalone installation or packaged with other offerings such as Hadoop\n* Please follow the instructios to install it for your environment \n* These examples are run on the Databricks managed platform - Databricks is the original creator of Spark - no extra installation is necessary\n* Some commands like display() are convenience functions and run on Databricks, you can always substitute it with show()\n### Pre-requisites\n* https://spark.apache.org/downloads.html\n* https://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm\n* pip install pyspark\n### Example References\n* https://sparkbyexamples.com/pyspark-tutorial/"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"398adf1b-62e9-4356-8fba-f1bfb836c8ff"}}},{"cell_type":"code","source":["# Spark Context allows your Spark Application to access Spark Cluster & is the entry point of all Spark functionality\nspark"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Access Spark and check version","showTitle":true,"inputWidgets":{},"nuid":"4d1d0358-8e4e-43d1-b3ec-aa232a5243c4"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
Out[90]:
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
Out[90]:
"]}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"\n
\n

SparkSession - hive

\n \n
\n

SparkContext

\n\n

Spark UI

\n\n
\n
Version
\n
v3.2.1
\n
Master
\n
spark://10.0.0.246:7077
\n
AppName
\n
Databricks Shell
\n
\n
\n \n
\n ","textData":null,"removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"htmlSandbox","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
\n

SparkSession - hive

\n \n
\n

SparkContext

\n\n

Spark UI

\n\n
\n
Version
\n
v3.2.1
\n
Master
\n
spark://10.0.0.246:7077
\n
AppName
\n
Databricks Shell
\n
\n
\n \n
\n "]}}],"execution_count":0},{"cell_type":"markdown","source":["### Spark Data types\n* https://spark.apache.org/docs/latest/sql-ref-datatypes.html"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"77e020e0-cb16-4f5b-a720-254f2b255d87"}}},{"cell_type":"code","source":["#dbutils is a databricks utility function \ndbutils.fs.rm('/tmp/ch1', True)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"cleanup generated data from previous runs ","showTitle":true,"inputWidgets":{},"nuid":"d1ebda2b-0576-4b99-96a2-8c0ad0f9eb37"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
Out[91]: True
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
Out[91]: True
"]}}],"execution_count":0},{"cell_type":"code","source":["columns = [\"State\",\"Name\", \"Age\"]\ndata = [(\"TX\",\"Jack\", 25), (\"NV\",\"Jane\",66), (\"CO\",\"Bill\",79),(\"CA\",\"Tom\",53), (\"WY\",\"Shawn\",45)]\n\nage_df = spark.sparkContext.parallelize(data).toDF(columns)\nage_df.printSchema()\ndisplay(age_df)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Create sample data using rdd/data frame, check schema ","showTitle":true,"inputWidgets":{},"nuid":"eb5f2ea4-fc9b-484e-9ba4-e24c0d8edee1"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
root\n |-- State: string (nullable = true)\n |-- Name: string (nullable = true)\n |-- Age: long (nullable = true)\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
root\n-- State: string (nullable = true)\n-- Name: string (nullable = true)\n-- Age: long (nullable = true)\n\n
"]}},{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[["TX","Jack",25],["NV","Jane",66],["CO","Bill",79],["CA","Tom",53],["WY","Shawn",45]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"State","type":"\"string\"","metadata":"{}"},{"name":"Name","type":"\"string\"","metadata":"{}"},{"name":"Age","type":"\"long\"","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
StateNameAge
TXJack25
NVJane66
COBill79
CATom53
WYShawn45
"]}}],"execution_count":0},{"cell_type":"markdown","source":["### Persist dataframe"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"21f8addd-3a63-4b47-9acf-5bc9d1d328f1"}}},{"cell_type":"code","source":["age_df.write.format('parquet').save('/tmp/ch1/demographic')"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Specify file format and data location","showTitle":true,"inputWidgets":{},"nuid":"c91e2e27-890e-4e13-9d09-57ea8c980302"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["### Create external table"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"cfd22169-50a2-4548-9f40-eaeb616d70fc"}}},{"cell_type":"code","source":["%sql\nDROP DATABASE IF EXISTS ch1 CASCADE;\n\nCREATE DATABASE IF NOT EXISTS ch1;\n\nCREATE TABLE IF NOT EXISTS ch1.demographic\nUSING parquet \nLOCATION '/tmp/ch1/demographic';"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"38883179-9de6-4b82-8a2c-1990ae90d7c2"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
"]}}],"execution_count":0},{"cell_type":"markdown","source":["### Read data from table"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"7ea7a87d-40c3-4c49-aadc-4875bb38d404"}}},{"cell_type":"code","source":["df = spark.read.table('ch1.demographic')\ndisplay(df)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Using Python","showTitle":true,"inputWidgets":{},"nuid":"1258e43d-4c13-4745-8869-0cffefc6674c"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[["WY","Shawn",45],["CA","Tom",53],["NV","Jane",66],["TX","Jack",25],["CO","Bill",79]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"State","type":"\"string\"","metadata":"{}"},{"name":"Name","type":"\"string\"","metadata":"{}"},{"name":"Age","type":"\"long\"","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
StateNameAge
WYShawn45
CATom53
NVJane66
TXJack25
COBill79
"]}}],"execution_count":0},{"cell_type":"code","source":["%sql\nSELECT count(*) from ch1.demographic"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Using SQL","showTitle":true,"inputWidgets":{},"nuid":"f19a25cb-c6d6-4cb3-9f41-11574308bbb2"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[[5]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"count(1)","type":"\"long\"","metadata":"{\"__autoGeneratedAlias\":\"true\"}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
count(1)
5
"]}}],"execution_count":0},{"cell_type":"markdown","source":["### Analyze data"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"2f8d615a-ce47-4c26-816b-9ed85ecb6e31"}}},{"cell_type":"code","source":["df.describe().show()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"2b224fc1-5b51-4a85-85e0-a28f211018fa"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
+-------+-----+----+----------------+\n|summary|State|Name| Age|\n+-------+-----+----+----------------+\n| count| 5| 5| 5|\n| mean| null|null| 53.6|\n| stddev| null|null|20.5621010599598|\n| min| CA|Bill| 25|\n| max| WY| Tom| 79|\n+-------+-----+----+----------------+\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
+-------+-----+----+----------------+\nsummary|State|Name| Age|\n+-------+-----+----+----------------+\n count| 5| 5| 5|\n mean| null|null| 53.6|\n stddev| null|null|20.5621010599598|\n min| CA|Bill| 25|\n max| WY| Tom| 79|\n+-------+-----+----+----------------+\n\n
"]}}],"execution_count":0},{"cell_type":"code","source":["display(df.summary())"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"1b2fa201-4556-46e4-b73a-7e6822cfaf1e"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[["count","5","5","5"],["mean",null,null,"53.6"],["stddev",null,null,"20.5621010599598"],["min","CA","Bill","25"],["25%",null,null,"45"],["50%",null,null,"53"],["75%",null,null,"66"],["max","WY","Tom","79"]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"summary","type":"\"string\"","metadata":"{}"},{"name":"State","type":"\"string\"","metadata":"{}"},{"name":"Name","type":"\"string\"","metadata":"{}"},{"name":"Age","type":"\"string\"","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
summaryStateNameAge
count555
meannullnull53.6
stddevnullnull20.5621010599598
minCABill25
25%nullnull45
50%nullnull53
75%nullnull66
maxWYTom79
"]}}],"execution_count":0},{"cell_type":"markdown","source":["### Transformations"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"af001120-d111-49ac-8bfa-641cb63c3458"}}},{"cell_type":"code","source":["columns = [\"FullName\", \"SSN\"]\ndata = [(\"Jack\", '011-123-2345'), (\"Jane\",'022-123-2345'), (\"Bill\",'033-123-2345'),(\"Tom\",'044-123-2345'), (\"Shawn\",'055-123-2345')]\n\nidentity_df = spark.sparkContext.parallelize(data).toDF(columns)\nspark.sql(\"DROP TABLE IF EXISTS ch1.identity\")\nidentity_df.write.format('parquet').saveAsTable('ch1.identity')\nidentity_df = spark.sql(\"SELECT * FROM ch1.identity\")\ndisplay(identity_df)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Identity table","showTitle":true,"inputWidgets":{},"nuid":"2083d844-02ef-44b2-834c-f692dc054248"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[["Shawn","055-123-2345"],["Bill","033-123-2345"],["Jack","011-123-2345"],["Tom","044-123-2345"],["Jane","022-123-2345"]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"FullName","type":"\"string\"","metadata":"{}"},{"name":"SSN","type":"\"string\"","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
FullNameSSN
Shawn055-123-2345
Bill033-123-2345
Jack011-123-2345
Tom044-123-2345
Jane022-123-2345
"]}}],"execution_count":0},{"cell_type":"markdown","source":["#### Filters"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"b1264cc0-4c1c-4d05-911a-baa3b2e876e0"}}},{"cell_type":"code","source":["age_df.select(\"Name\").show()\nage_df.filter(age_df.Name.like('J%')).show()\n\nfrom pyspark.sql.functions import *\nage_df.where(col('Name').like('J%')).show()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"bbb67740-e7a6-4dac-8537-200210940c09"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
+-----+\n| Name|\n+-----+\n| Jack|\n| Jane|\n| Bill|\n| Tom|\n|Shawn|\n+-----+\n\n+-----+----+---+\n|State|Name|Age|\n+-----+----+---+\n| TX|Jack| 25|\n| NV|Jane| 66|\n+-----+----+---+\n\n+-----+----+---+\n|State|Name|Age|\n+-----+----+---+\n| TX|Jack| 25|\n| NV|Jane| 66|\n+-----+----+---+\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
+-----+\n Name|\n+-----+\n Jack|\n Jane|\n Bill|\n Tom|\nShawn|\n+-----+\n\n+-----+----+---+\nState|Name|Age|\n+-----+----+---+\n TX|Jack| 25|\n NV|Jane| 66|\n+-----+----+---+\n\n+-----+----+---+\nState|Name|Age|\n+-----+----+---+\n TX|Jack| 25|\n NV|Jane| 66|\n+-----+----+---+\n\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["#### Add/Drop columns"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"369e56ba-6e0f-45ef-a498-00d3d15ee015"}}},{"cell_type":"code","source":["new_df = age_df.withColumn('newField1', lit('X')).show()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Use lit to add a constant","showTitle":true,"inputWidgets":{},"nuid":"8413ec77-0ce1-495c-83c0-36852dcef360"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
+-----+-----+---+---------+\n|State| Name|Age|newField1|\n+-----+-----+---+---------+\n| TX| Jack| 25| X|\n| NV| Jane| 66| X|\n| CO| Bill| 79| X|\n| CA| Tom| 53| X|\n| WY|Shawn| 45| X|\n+-----+-----+---+---------+\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
+-----+-----+---+---------+\nState| Name|Age|newField1|\n+-----+-----+---+---------+\n TX| Jack| 25| X|\n NV| Jane| 66| X|\n CO| Bill| 79| X|\n CA| Tom| 53| X|\n WY|Shawn| 45| X|\n+-----+-----+---+---------+\n\n
"]}}],"execution_count":0},{"cell_type":"code","source":["from pyspark.sql.functions import when, col\nnew_df = age_df.withColumn('newField2', when(col('Name') == 'Jane', 'J').otherwise('Other'))\nnew_df.show()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Use when & otherwise for conditional","showTitle":true,"inputWidgets":{},"nuid":"482eea69-b1f8-4d65-bfb0-59ac4f384a71"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
+-----+-----+---+---------+\n|State| Name|Age|newField2|\n+-----+-----+---+---------+\n| TX| Jack| 25| Other|\n| NV| Jane| 66| J|\n| CO| Bill| 79| Other|\n| CA| Tom| 53| Other|\n| WY|Shawn| 45| Other|\n+-----+-----+---+---------+\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
+-----+-----+---+---------+\nState| Name|Age|newField2|\n+-----+-----+---+---------+\n TX| Jack| 25| Other|\n NV| Jane| 66| J|\n CO| Bill| 79| Other|\n CA| Tom| 53| Other|\n WY|Shawn| 45| Other|\n+-----+-----+---+---------+\n\n
"]}}],"execution_count":0},{"cell_type":"code","source":["new_df.drop('newField2', 'Age').show()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"drop column","showTitle":true,"inputWidgets":{},"nuid":"e2e2bbe2-2a53-46ae-8a39-244ea1fe5f7c"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
+-----+-----+\n|State| Name|\n+-----+-----+\n| TX| Jack|\n| NV| Jane|\n| CO| Bill|\n| CA| Tom|\n| WY|Shawn|\n+-----+-----+\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
+-----+-----+\nState| Name|\n+-----+-----+\n TX| Jack|\n NV| Jane|\n CO| Bill|\n CA| Tom|\n WY|Shawn|\n+-----+-----+\n\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["#### Aggregates"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"5d88a430-55d6-4e03-ac48-f4131f14c1ad"}}},{"cell_type":"code","source":["age_df.select(count(\"State\").alias('NumStates')).show()\nage_df.select(countDistinct(\"Name\", \"Age\")).show()\n\nage_df.select(avg(\"Age\")).show()\nage_df.select(stddev(\"Age\"), sum(\"Age\"), max(\"Age\")).show(truncate=False)\n\nage_df.groupBy('State').max('Age').show()\nage_df.orderBy(\"Name\", ascending=False).show()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"a911488e-eff9-4adb-93d8-12e462149ef5"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
+---------+\n|NumStates|\n+---------+\n| 5|\n+---------+\n\n+-------------------------+\n|count(DISTINCT Name, Age)|\n+-------------------------+\n| 5|\n+-------------------------+\n\n+--------+\n|avg(Age)|\n+--------+\n| 53.6|\n+--------+\n\n+----------------+--------+--------+\n|stddev_samp(Age)|sum(Age)|max(Age)|\n+----------------+--------+--------+\n|20.5621010599598|268 |79 |\n+----------------+--------+--------+\n\n+-----+--------+\n|State|max(Age)|\n+-----+--------+\n| CO| 79|\n| TX| 25|\n| WY| 45|\n| CA| 53|\n| NV| 66|\n+-----+--------+\n\n+-----+-----+---+\n|State| Name|Age|\n+-----+-----+---+\n| CA| Tom| 53|\n| WY|Shawn| 45|\n| NV| Jane| 66|\n| TX| Jack| 25|\n| CO| Bill| 79|\n+-----+-----+---+\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
+---------+\nNumStates|\n+---------+\n 5|\n+---------+\n\n+-------------------------+\ncount(DISTINCT Name, Age)|\n+-------------------------+\n 5|\n+-------------------------+\n\n+--------+\navg(Age)|\n+--------+\n 53.6|\n+--------+\n\n+----------------+--------+--------+\nstddev_samp(Age)|sum(Age)|max(Age)|\n+----------------+--------+--------+\n20.5621010599598|268 |79 |\n+----------------+--------+--------+\n\n+-----+--------+\nState|max(Age)|\n+-----+--------+\n CO| 79|\n TX| 25|\n WY| 45|\n CA| 53|\n NV| 66|\n+-----+--------+\n\n+-----+-----+---+\nState| Name|Age|\n+-----+-----+---+\n CA| Tom| 53|\n WY|Shawn| 45|\n NV| Jane| 66|\n TX| Jack| 25|\n CO| Bill| 79|\n+-----+-----+---+\n\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["#### Joins"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"0829d447-e70c-45e5-ba30-767ea62c69ee"}}},{"cell_type":"code","source":["join_df = age_df.join(identity_df, age_df.Name==identity_df.FullName).show()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"4db1364e-36d3-4242-a649-eb5077498f18"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
+-----+-----+---+--------+------------+\n|State| Name|Age|FullName| SSN|\n+-----+-----+---+--------+------------+\n| TX| Jack| 25| Jack|011-123-2345|\n| NV| Jane| 66| Jane|022-123-2345|\n| CO| Bill| 79| Bill|033-123-2345|\n| CA| Tom| 53| Tom|044-123-2345|\n| WY|Shawn| 45| Shawn|055-123-2345|\n+-----+-----+---+--------+------------+\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
+-----+-----+---+--------+------------+\nState| Name|Age|FullName| SSN|\n+-----+-----+---+--------+------------+\n TX| Jack| 25| Jack|011-123-2345|\n NV| Jane| 66| Jane|022-123-2345|\n CO| Bill| 79| Bill|033-123-2345|\n CA| Tom| 53| Tom|044-123-2345|\n WY|Shawn| 45| Shawn|055-123-2345|\n+-----+-----+---+--------+------------+\n\n
"]}}],"execution_count":0},{"cell_type":"code","source":[""],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"3bbaf830-c206-4e39-9faf-2d54762a0004"}},"outputs":[],"execution_count":0}],"metadata":{"application/vnd.databricks.v1+notebook":{"notebookName":"Ch 01 - Introduction to Spark","dashboards":[],"notebookMetadata":{"pythonIndentUnit":2},"language":"python","widgets":{},"notebookOrigID":3533331031122581}},"nbformat":4,"nbformat_minor":0} 2 | -------------------------------------------------------------------------------- /Chapter01/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/5a427fe53b2372030161a57e3485b44757a90149/Chapter01/README.md -------------------------------------------------------------------------------- /Chapter02/Ch-02-ComplexDataTypesInSpark.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","source":["# Chapter-2"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"33f16623-c2cb-4802-9339-de08d3e418f0"}}},{"cell_type":"code","source":["from pyspark.sql.functions import *\nfrom pyspark.sql.types import *"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"fb87f14a-9b94-47d3-998f-754aedde6a49"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["# Structured Data\n* Typically csv data that has a given schema and order & is human readable"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"24fee472-e553-48e5-89c9-63e81cdda25d"}}},{"cell_type":"markdown","source":["## Providing Schema"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"d9df2098-d1ec-45bc-aaa7-d118514b1361"}}},{"cell_type":"code","source":["csv_data = [(\"Jim\",\"\",\"Smith\",\"36636\",\"M\",3000),\n (\"Mike\",\"Rose\",\"\",\"40288\",\"M\", 5000),\n (\"Bob\",\"\",\"Williams\",\"42114\",\"M\", 6000),\n (\"Marie\",\"Anne\",\"Jones\",\"39192\",\"F\",7000),\n ]\n\nschema = StructType([ \\\n StructField(\"firstname\",StringType(),True), \\\n StructField(\"middlename\",StringType(),True), \\\n StructField(\"lastname\",StringType(),True), \\\n StructField(\"id\", StringType(), True), \\\n StructField(\"gender\", StringType(), True), \\\n StructField(\"wages\", IntegerType(), True) \\\n ])\n\ndf = spark.createDataFrame(data=csv_data,schema=schema)\ndf.printSchema()\ndf.show(truncate=False)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Providing Schema","showTitle":true,"inputWidgets":{},"nuid":"ab0fb0e6-87d6-4c6f-87e7-24c6cb10ec6c"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
root\n |-- firstname: string (nullable = true)\n |-- middlename: string (nullable = true)\n |-- lastname: string (nullable = true)\n |-- id: string (nullable = true)\n |-- gender: string (nullable = true)\n |-- wages: integer (nullable = true)\n\n+---------+----------+--------+-----+------+-----+\n|firstname|middlename|lastname|id |gender|wages|\n+---------+----------+--------+-----+------+-----+\n|Jim | |Smith |36636|M |3000 |\n|Mike |Rose | |40288|M |5000 |\n|Bob | |Williams|42114|M |6000 |\n|Marie |Anne |Jones |39192|F |7000 |\n+---------+----------+--------+-----+------+-----+\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
root\n-- firstname: string (nullable = true)\n-- middlename: string (nullable = true)\n-- lastname: string (nullable = true)\n-- id: string (nullable = true)\n-- gender: string (nullable = true)\n-- wages: integer (nullable = true)\n\n+---------+----------+--------+-----+------+-----+\nfirstname|middlename|lastname|id |gender|wages|\n+---------+----------+--------+-----+------+-----+\nJim | |Smith |36636|M |3000 |\nMike |Rose | |40288|M |5000 |\nBob | |Williams|42114|M |6000 |\nMarie |Anne |Jones |39192|F |7000 |\n+---------+----------+--------+-----+------+-----+\n\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## Save\n* overwrite – mode is used to overwrite the existing file.\n* append – To add the data to the existing file\n* ignore – Ignores write operation when the file already exists.\n* error – This is a default option when the file already exists, it returns an error."],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"1cddfde8-27ed-48c6-9f9f-0ae259b63119"}}},{"cell_type":"code","source":["df.write.mode('overwrite').csv('/tmp/ch2/csv_data')"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"9e6440e0-1f66-43db-a497-9b8f706bce6c"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## Infer schema"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"0161b3bc-ed1b-4820-a97a-285c4853e78d"}}},{"cell_type":"code","source":["df = spark.read.format(\"csv\") \\\n .option(\"header\", False) \\\n .option(\"inferSchema\", True) \\\n .load(\"/tmp/ch2/csv_data\").show()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"217b01a7-8044-4414-be53-8870cea55797"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
+-----+----+--------+-----+---+----+\n| _c0| _c1| _c2| _c3|_c4| _c5|\n+-----+----+--------+-----+---+----+\n|Marie|Anne| Jones|39192| F|7000|\n| Bob|null|Williams|42114| M|6000|\n| Jim|null| Smith|36636| M|3000|\n| Mike|Rose| null|40288| M|5000|\n+-----+----+--------+-----+---+----+\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
+-----+----+--------+-----+---+----+\n _c0| _c1| _c2| _c3|_c4| _c5|\n+-----+----+--------+-----+---+----+\nMarie|Anne| Jones|39192| F|7000|\n Bob|null|Williams|42114| M|6000|\n Jim|null| Smith|36636| M|3000|\n Mike|Rose| null|40288| M|5000|\n+-----+----+--------+-----+---+----+\n\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## Read\n* Specify header, delimmiter, inference"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"36a96b92-a2eb-42fa-b57e-9e42bddf7e53"}}},{"cell_type":"code","source":["df_with_schema = spark.read.format(\"csv\") \\\n .option(\"header\", False) \\\n .schema(schema) \\\n .load(\"/tmp/ch2/csv_data\")\n\ndf_with_schema.show()\ndf_with_schema.printSchema()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"043b5ae1-d9a4-42c1-b003-ec8fc7406325"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
+---------+----------+--------+-----+------+-----+\n|firstname|middlename|lastname| id|gender|wages|\n+---------+----------+--------+-----+------+-----+\n| Marie| Anne| Jones|39192| F| 7000|\n| Bob| null|Williams|42114| M| 6000|\n| Jim| null| Smith|36636| M| 3000|\n| Mike| Rose| null|40288| M| 5000|\n+---------+----------+--------+-----+------+-----+\n\nroot\n |-- firstname: string (nullable = true)\n |-- middlename: string (nullable = true)\n |-- lastname: string (nullable = true)\n |-- id: string (nullable = true)\n |-- gender: string (nullable = true)\n |-- wages: integer (nullable = true)\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
+---------+----------+--------+-----+------+-----+\nfirstname|middlename|lastname| id|gender|wages|\n+---------+----------+--------+-----+------+-----+\n Marie| Anne| Jones|39192| F| 7000|\n Bob| null|Williams|42114| M| 6000|\n Jim| null| Smith|36636| M| 3000|\n Mike| Rose| null|40288| M| 5000|\n+---------+----------+--------+-----+------+-----+\n\nroot\n-- firstname: string (nullable = true)\n-- middlename: string (nullable = true)\n-- lastname: string (nullable = true)\n-- id: string (nullable = true)\n-- gender: string (nullable = true)\n-- wages: integer (nullable = true)\n\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## Providing Nested Schema"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"e2c6f00c-70ea-4cb8-88c2-8f47515d437c"}}},{"cell_type":"code","source":["csv_data = [((\"Jim\",\"\",\"Smith\"),\"36636\",\"M\",3000),\n ((\"Mike\",\"Rose\",\"\"),\"40288\",\"M\", 5000),\n ((\"Bob\",\"\",\"Williams\"),\"42114\",\"M\", 6000),\n ((\"Marie\",\"Anne\",\"Jones\"),\"39192\",\"F\",7000),\n ]\nschema = StructType([ \\\n StructField(\"name\", StructType([ \\\n StructField(\"firstname\",StringType(),True), \\\n StructField(\"middlename\",StringType(),True), \\\n StructField(\"lastname\",StringType(),True)])), \\\n StructField(\"id\", StringType(), True), \\\n StructField(\"gender\", StringType(), True), \\\n StructField(\"wages\", IntegerType(), True) \\\n ])\n\ndf = spark.createDataFrame(data=csv_data,schema=schema)\ndf.printSchema()\ndf.show(truncate=False)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Nested Schema","showTitle":true,"inputWidgets":{},"nuid":"edc99441-5d63-4f8d-b9a5-42d48f0b229a"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
root\n |-- name: struct (nullable = true)\n | |-- firstname: string (nullable = true)\n | |-- middlename: string (nullable = true)\n | |-- lastname: string (nullable = true)\n |-- id: string (nullable = true)\n |-- gender: string (nullable = true)\n |-- wages: integer (nullable = true)\n\n+--------------------+-----+------+-----+\n|name |id |gender|wages|\n+--------------------+-----+------+-----+\n|{Jim, , Smith} |36636|M |3000 |\n|{Mike, Rose, } |40288|M |5000 |\n|{Bob, , Williams} |42114|M |6000 |\n|{Marie, Anne, Jones}|39192|F |7000 |\n+--------------------+-----+------+-----+\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
root\n-- name: struct (nullable = true)\n |-- firstname: string (nullable = true)\n |-- middlename: string (nullable = true)\n |-- lastname: string (nullable = true)\n-- id: string (nullable = true)\n-- gender: string (nullable = true)\n-- wages: integer (nullable = true)\n\n+--------------------+-----+------+-----+\nname |id |gender|wages|\n+--------------------+-----+------+-----+\n{Jim, , Smith} |36636|M |3000 |\n{Mike, Rose, } |40288|M |5000 |\n{Bob, , Williams} |42114|M |6000 |\n{Marie, Anne, Jones}|39192|F |7000 |\n+--------------------+-----+------+-----+\n\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["# Semi-Structured Data\n* Typically xml &. json data\n* Unlike structured data, not all columns/fields are present for each record and the order is less important as the data is self-describing"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"a0be310f-18bd-405a-a750-a3b84b6000e5"}}},{"cell_type":"code","source":["def jsonToDataFrame(json, schema=None):\n reader = spark.read\n if schema:\n reader.schema(schema)\n return reader.json(sc.parallelize([json]))"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Utility Function","showTitle":true,"inputWidgets":{},"nuid":"b7684325-1310-411b-b06b-b93859eb8691"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## StructType & StructField"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"2c63120f-ef6b-45ff-99f8-ab0a531815d4"}}},{"cell_type":"code","source":["#StructType & StructField\nschema = StructType() \\\n .add(\"Person\", StructType()\n .add(\"Name\", StringType())\n .add(\"Age\", IntegerType()))\n \njson_str = ''' \n{\n \"Person\": {\n \"Name\": \"John Smith\",\n \"Age\": 36\n }\n}\n'''\nevents = jsonToDataFrame(json_str, schema)\n#Individual field access\nevents.select(\"Person.Name\").show()\n#Get all fields \nevents.select(\"Person.*\").show()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Reading json string with Nested Data Type as spark data frame","showTitle":true,"inputWidgets":{},"nuid":"ff8c0c0e-545b-4df6-94a7-e3d621220e01"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
+----------+\n| Name|\n+----------+\n|John Smith|\n+----------+\n\n+----------+---+\n| Name|Age|\n+----------+---+\n|John Smith| 36|\n+----------+---+\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
+----------+\n Name|\n+----------+\nJohn Smith|\n+----------+\n\n+----------+---+\n Name|Age|\n+----------+---+\nJohn Smith| 36|\n+----------+---+\n\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## Infer Schema"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"1f051271-85a5-475d-a19c-11df097ea8ef"}}},{"cell_type":"code","source":["json_str = ''' \n{\n \"Person\": {\n \"Name\": \"John Smith\",\n \"Age\": 36\n }\n}\n'''\n#Note schema is not specified\nevents = jsonToDataFrame(json_str)\n#Individual field access\nevents.select(\"Person.Name\").show()\n#Get all fields from a given node\nevents.select(\"Person.*\").show()\n#Get all fields using alias\nevents.select(struct(\"*\").alias(\"Citizen\")).show()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Inferring Schema","showTitle":true,"inputWidgets":{},"nuid":"6f1554a0-0ca2-4d44-a568-6c8a304eb5f1"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
+----------+\n| Name|\n+----------+\n|John Smith|\n+----------+\n\n+---+----------+\n|Age| Name|\n+---+----------+\n| 36|John Smith|\n+---+----------+\n\n+------------------+\n| Citizen|\n+------------------+\n|{{36, John Smith}}|\n+------------------+\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
+----------+\n Name|\n+----------+\nJohn Smith|\n+----------+\n\n+---+----------+\nAge| Name|\n+---+----------+\n 36|John Smith|\n+---+----------+\n\n+------------------+\n Citizen|\n+------------------+\n{{36, John Smith}}|\n+------------------+\n\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## Multi-line"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"609a68c6-f616-419f-89b3-57533b40c35b"}}},{"cell_type":"code","source":["dbutils.fs.rm('/tmp/test_multilie.json', True)\ndbutils.fs.put('/tmp/test_multilie.json',\n '''[{\n \"RecordNumber\": 2,\n \"Zipcode\": 704,\n \"ZipCodeType\": \"STANDARD\",\n \"City\": \"PASEO COSTA DEL SUR\",\n \"State\": \"PR\"\n},\n{\n \"RecordNumber\": 10,\n \"Zipcode\": 709,\n \"ZipCodeType\": \"STANDARD\",\n \"City\": \"BDA SAN LUIS\",\n \"State\": \"PR\"\n}]''')"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"c52e5f7d-e893-454d-a5d5-2e4219dc92b3"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
Wrote 238 bytes.\nOut[24]: True
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
Wrote 238 bytes.\nOut[24]: True
"]}}],"execution_count":0},{"cell_type":"code","source":["events = spark.read.option(\"multiline\", True).json('/tmp/test_multilie.json')\nevents.show()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"1b2e0b59-4d52-4102-b848-9d35158b5509"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
+-------------------+------------+-----+-----------+-------+\n| City|RecordNumber|State|ZipCodeType|Zipcode|\n+-------------------+------------+-----+-----------+-------+\n|PASEO COSTA DEL SUR| 2| PR| STANDARD| 704|\n| BDA SAN LUIS| 10| PR| STANDARD| 709|\n+-------------------+------------+-----+-----------+-------+\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
+-------------------+------------+-----+-----------+-------+\n City|RecordNumber|State|ZipCodeType|Zipcode|\n+-------------------+------------+-----+-----------+-------+\nPASEO COSTA DEL SUR| 2| PR| STANDARD| 704|\n BDA SAN LUIS| 10| PR| STANDARD| 709|\n+-------------------+------------+-----+-----------+-------+\n\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## Arrays"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"818c0f76-8aa5-4693-bfac-cbbdeca1d20a"}}},{"cell_type":"code","source":["#StructType & StructField\nschema = StructType() \\\n .add(\"Person\", StructType()\n .add(\"Name\", ArrayType(StringType()))\n .add(\"Age\", IntegerType()))\n\njson_str = ''' \n{\n \"Person\": {\n \"Name\": [\"John\",\"Smith\"],\n \"Age\": 36\n }\n}\n'''\nevents = jsonToDataFrame(json_str, schema)\n#Individual field access\nevents.select(\"Person.Name\").show()\n#Access Individual elements of the array\nevents.select((col(\"Person.Name\")).getItem(0)).show()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Lists/Arrays","showTitle":true,"inputWidgets":{},"nuid":"370e8008-c7ec-4e7f-9cdd-df1659508638"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
+-------------+\n| Name|\n+-------------+\n|[John, Smith]|\n+-------------+\n\n+--------------+\n|Person.Name[0]|\n+--------------+\n| John|\n+--------------+\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
+-------------+\n Name|\n+-------------+\n[John, Smith]|\n+-------------+\n\n+--------------+\nPerson.Name[0]|\n+--------------+\n John|\n+--------------+\n\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## Maps"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"fbe5ce7d-16d1-4324-abe8-7ce5c1b4330d"}}},{"cell_type":"code","source":["#StructType & StructField\nschema = StructType() \\\n .add(\"Person\", StructType()\n .add(\"Name\", MapType(StringType(), StringType()))\n .add(\"Age\", IntegerType()))\n\njson_str = ''' \n{\n \"Person\": {\n \"Name\": {\"John\":\"Smith\"},\n \"Age\": 36\n }\n}\n'''\nevents = jsonToDataFrame(json_str, schema)\n#Individual field access\nevents.select(\"Person.Name\").show()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Maps","showTitle":true,"inputWidgets":{},"nuid":"343161c6-c8c7-4cd1-aaf8-af3a52562857"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
+---------------+\n| Name|\n+---------------+\n|{John -> Smith}|\n+---------------+\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
+---------------+\n Name|\n+---------------+\n{John -> Smith}|\n+---------------+\n\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## from_json"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"9c443cb1-aacb-4073-8296-9a5284650dfb"}}},{"cell_type":"code","source":["events = jsonToDataFrame(\"\"\"\n{\n \"Person\": \"{\\\\\"Address\\\\\":{\\\\\"Unit\\\\\":12,\\\\\"Location\\\\\":{\\\\\"Street\\\\\":\\\\\"New York\\\\\"}}}\"\n}\n\"\"\")\n \nschema = StructType().add(\"Address\", StructType().add(\"Unit\", IntegerType())\n .add(\"Location\", StringType()))\ndisplay(events.select(from_json(\"Person\", schema).alias(\"Citizen\")))"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"from_json to string","showTitle":true,"inputWidgets":{},"nuid":"5bebc5b4-0708-4e30-aed1-461a911ea614"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[[[[12,"{\"Street\":\"New York\"}"]]]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"Citizen","type":"{\"type\":\"struct\",\"fields\":[{\"name\":\"Address\",\"type\":{\"type\":\"struct\",\"fields\":[{\"name\":\"Unit\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"Location\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]},\"nullable\":true,\"metadata\":{}}]}","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
Citizen
List(List(12, {\"Street\":\"New York\"}))
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## to_json"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"42a26f0c-c92f-4754-ac8c-0c6eb8467c57"}}},{"cell_type":"code","source":["events = jsonToDataFrame(\"\"\"\n{\n \"Person\": {\n \"Name\": {\"John\":\"Smith\"},\n \"Age\": 36\n }\n}\n\"\"\")\n \ndisplay(events.select(to_json(\"Person\").alias(\"Citizen\")))"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"to_json from string","showTitle":true,"inputWidgets":{},"nuid":"1f967de6-b80b-46b1-a387-8cc78b726ada"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[["{\"Age\":36,\"Name\":{\"John\":\"Smith\"}}"]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"Citizen","type":"\"string\"","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
Citizen
{\"Age\":36,\"Name\":{\"John\":\"Smith\"}}
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## json_tuple"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"6aa8ceca-678c-4125-9812-e70287ac5bf9"}}},{"cell_type":"code","source":["events = jsonToDataFrame(\"\"\"\n{\n \"Person\": \"{\\\\\"Address\\\\\":{\\\\\"Unit\\\\\":12,\\\\\"Location\\\\\":{\\\\\"Street\\\\\":\\\\\"New York\\\\\"}}}\"\n}\n\"\"\")\n \ndisplay(events.select(json_tuple(\"Person\", \"Address\").alias(\"Address\")))"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"json_tuple to extract columns in json data","showTitle":true,"inputWidgets":{},"nuid":"c56f25a2-05f3-43f5-9c3a-5f7720355a92"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[["{\"Unit\":12,\"Location\":{\"Street\":\"New York\"}}"]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"Address","type":"\"string\"","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
Address
{\"Unit\":12,\"Location\":{\"Street\":\"New York\"}}
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## regexp_extract"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"b50b21bc-9b68-42c8-9d50-f8d49839741d"}}},{"cell_type":"code","source":["events = jsonToDataFrame(\"\"\"\n[{ \"Identity\": \"010-22-2345\" }, \n { \"Identity\": \"017-26-8345\" },\n { \"Identity\": \"1-2-3\" }]\n\"\"\")\n \nevents.select(regexp_extract(\"Identity\", \"([0-9]*)-([0-9]*)-([0-9]*)\", 1).alias(\"Identity\")).show()\nevents.select(regexp_extract(\"Identity\", \"([0-9]*)-([0-9]*)-([0-9]*)\",3).alias(\"Identity\")).show()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"regexp_extract","showTitle":true,"inputWidgets":{},"nuid":"ff6abb9d-0161-434a-b331-bf99c1c6dcd5"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
+--------+\n|Identity|\n+--------+\n| 010|\n| 017|\n| 1|\n+--------+\n\n+--------+\n|Identity|\n+--------+\n| 2345|\n| 8345|\n| 3|\n+--------+\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
+--------+\nIdentity|\n+--------+\n 010|\n 017|\n 1|\n+--------+\n\n+--------+\nIdentity|\n+--------+\n 2345|\n 8345|\n 3|\n+--------+\n\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## aggregation"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"878489ec-851c-4216-b7f9-2aa5039b8145"}}},{"cell_type":"code","source":["events = jsonToDataFrame(\"\"\"\n[{ \"Name\": \"John\", \"Age\": 27 }, \n { \"Name\": \"John\", \"Age\": 52 }]\n\"\"\")\n \ndisplay(events.groupBy(\"Name\").agg(collect_list(\"Age\").alias(\"Ages\")))"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Aggregation","showTitle":true,"inputWidgets":{},"nuid":"335790f4-400d-4f18-9fad-32ce4417fd8e"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[["John",[27,52]]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"Name","type":"\"string\"","metadata":"{}"},{"name":"Ages","type":"{\"type\":\"array\",\"elementType\":\"long\",\"containsNull\":false}","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
NameAges
JohnList(27, 52)
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## explode"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"aa59f650-b001-4821-9843-817bc49db7be"}}},{"cell_type":"code","source":["events = jsonToDataFrame(\"\"\"\n{\n \"John\" : {\n \"Preferences\": ['Tennis', 'Cricket']\n }\n}\n\"\"\")\n \ndisplay(events.select(explode(\"John.Preferences\").alias(\"taste\")))"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Explode","showTitle":true,"inputWidgets":{},"nuid":"de7802b8-9134-4bd0-b3d6-d6f24c7436af"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[["Tennis"],["Cricket"]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"taste","type":"\"string\"","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
taste
Tennis
Cricket
"]}}],"execution_count":0},{"cell_type":"code","source":[""],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"ba33c9d7-21ee-4b22-b699-dda7a288c249"}},"outputs":[],"execution_count":0}],"metadata":{"application/vnd.databricks.v1+notebook":{"notebookName":"Ch-02-ComplexDataTypesInSpark","dashboards":[],"notebookMetadata":{"pythonIndentUnit":2},"language":"python","widgets":{},"notebookOrigID":386413837244752}},"nbformat":4,"nbformat_minor":0} 2 | -------------------------------------------------------------------------------- /Chapter02/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/5a427fe53b2372030161a57e3485b44757a90149/Chapter02/README.md -------------------------------------------------------------------------------- /Chapter03/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/5a427fe53b2372030161a57e3485b44757a90149/Chapter03/README.md -------------------------------------------------------------------------------- /Chapter04/Ch-04-DeltaForBatch&Streaming.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","source":["https://github.com/delta-io/delta/blob/master/examples/cheat_sheet/delta_lake_cheat_sheet.pdf"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"Delta Cheat Sheet","showTitle":true,"inputWidgets":{},"nuid":"94de84b4-3697-4c96-80ce-46a5a478f41b"}}},{"cell_type":"markdown","source":["# Batch"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"5d81e7a2-64ba-48aa-ae79-da20b897d3d5"}}},{"cell_type":"markdown","source":["## Read\n* By File path\n * df = spark.read.format(\"parquet\"|\"csv\"|\"json\"|etc.).load('path to delta table')\n* By Table\n * df = spark.table('delta table name')"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"6e8dcdd6-6241-4afb-94ae-2e1537f54846"}}},{"cell_type":"markdown","source":["## Write\n* By File path\n * df.write.format(\"delta\").mode(\"overwrite\"|\"append\").partitionBy('field').save('path to delta table')\n* By Table\n * df.write.format(\"delta\").option('mergeSchema', \"true\").saveAsTable('delta table name')"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"d1088737-1241-4502-9998-7331057488ad"}}},{"cell_type":"markdown","source":["# Streaming"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"75e819d0-88bb-4551-8a27-a812af004a21"}}},{"cell_type":"markdown","source":["## Read\n* By File path\n * df = spark.readStream.format(\"parquet\"|\"csv\"|\"json\"|etc.).schema(schema).load('path to delta table')\n* By Table\n * df = spark.readStream.format(\"delta\").table('delta table name')"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"0224e7bc-7673-4fd6-93a4-05877599c64d"}}},{"cell_type":"markdown","source":["## Write\n\n* By File path\n * df.writeStream.format(\"delta\").outputMode(\"append\"|\"update\"|\"complete\").option(\"checkpointLocation', 'path to chkpoint').trigger(once=True|processingTime=\"x minute\").start('path to delta table')\n* By Table\n * df.writeStream.format(\"delta\").outputMode(\"append\"|\"update\"|\"complete\").option(\"checkpointLocation', 'path to chkpoint').trigger(once=True|processingTime=\"x minute\").table('delta table name')"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"ac3c0287-8fcf-488f-b5ee-bc228d482ac5"}}},{"cell_type":"markdown","source":["# Utility Functions"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"ca1b10cc-9bbb-4c93-95a1-392a7c69a2e8"}}},{"cell_type":"code","source":["from pyspark.sql.functions import *\nfrom delta.tables import DeltaTable\n\n# Function to upsert microBatchOutputDF into Delta Lake table using merge\ndef upsertToDelta(microBatchOutputDF, batchId):\n t = deltaTable.alias(\"t\").merge(microBatchOutputDF.alias(\"s\"), \"s.id = t.id\")\\\n .whenMatchedUpdateAll()\\\n .whenNotMatchedInsertAll()\\\n .execute()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"b3946186-7724-4fa3-94e4-5ad0d17b62d6"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["# Streaming Types\n* File Based\n * File lands on disk & is streamed from storage\n* Event Based\n * More real-time and leverages a sstreaming service ssuch as Kafka, Kinesis, Eventhub"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"40ee73f6-a0bb-472b-9c6e-2f4f1135312e"}}},{"cell_type":"code","source":["import shutil\nshutil.rmtree(\"/tmp/ch-4/\", ignore_errors=True)\ndbutils.fs.rm(\"/tmp/ch-4/\", True)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"def3dcc8-0b75-4ce7-9462-1812910cd768"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
Out[33]: True
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
Out[33]: True
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## File Based"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"5c947b1f-f42d-49db-863e-491829e441dc"}}},{"cell_type":"code","source":["import random\n# Create a table(key, value) of some data\ndata = spark.range(8)\ndata = data.withColumn(\"value\", data.id + random.randint(0, 5000))\ndata.write.format(\"delta\").save(\"/tmp/ch-4/delta-table\")"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"59a6ece4-b1a4-4cc0-996b-94a714e20599"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
"]}}],"execution_count":0},{"cell_type":"code","source":["%sql\nSELECT * from delta.`/tmp/ch-4/delta-table` LIMIT 5"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"c28db90c-3f12-4f5b-9998-1ddaa6ad81a8"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[[0,1601],[2,1603],[3,1604],[4,1605],[1,1602]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"id","type":"\"long\"","metadata":"{}"},{"name":"value","type":"\"long\"","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
idvalue
01601
21603
31604
41605
11602
"]}}],"execution_count":0},{"cell_type":"code","source":["streamingDf = spark.readStream.format(\"rate\").load()\n#display(streamingDf)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"1d8dafc2-d441-4328-a552-3ca2ce0217f2"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
"]}}],"execution_count":0},{"cell_type":"code","source":["# Stream writes to the table\nstream = streamingDf.selectExpr(\"value as id\").writeStream\\\n .format(\"delta\")\\\n .option(\"checkpointLocation\", \"/tmp/ch-4/checkpoint\")\\\n .start(\"/tmp/ch-4/delta-table2\")\nstream.awaitTermination(10)\nstream.stop()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"be516b90-9fb1-463c-abe3-2f8ab9a9d519"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
"]}}],"execution_count":0},{"cell_type":"code","source":["%sql\nSELECT * from delta.`/tmp/ch-4/delta-table2` LIMIT 5"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"10365afe-b45f-406a-8b87-8aaa0f3a69e6"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[[1],[3],[4],[0],[2]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"id","type":"\"long\"","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
id
1
3
4
0
2
"]}}],"execution_count":0},{"cell_type":"code","source":["# Stream reads from a table\nstream2 = spark.readStream.format(\"delta\").load(\"/tmp/ch-4/delta-table2\")\\\n .writeStream\\\n .format(\"console\")\\\n .start()\nstream2.awaitTermination(10)\nstream2.stop()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"bd344a7a-bd1e-40ba-a03f-ab6e475f84dd"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["### In-Stream Trasformations"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"b1cb4728-06b2-4ef4-841c-2a8477c5c11f"}}},{"cell_type":"code","source":["# In-stream transformations\nstreamingAggregatesDF = spark.readStream.format(\"rate\").load()\\\n .withColumn(\"id\", col(\"value\") % 10)\\\n .drop(\"timestamp\")"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"bfc81d29-c771-4edd-9906-e07226fba4fa"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
"]}}],"execution_count":0},{"cell_type":"code","source":["# Write the output of a streaming aggregation query into Delta Lake table\ndeltaTable = DeltaTable.forPath(spark, \"/tmp/ch-4/delta-table\")\nprint(\"Before\")\ndeltaTable.toDF().show()\n\nstream3 = streamingAggregatesDF.writeStream\\\n .format(\"delta\") \\\n .foreachBatch(upsertToDelta) \\\n .outputMode(\"update\") \\\n .start()\nstream3.awaitTermination(10)\nstream3.stop()\n\nprint(\"After\")\ndeltaTable.toDF().show()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"5359e3d9-e716-4c23-8010-fa64eb429c91"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
Before\n+---+-----+\n| id|value|\n+---+-----+\n| 0| 1601|\n| 6| 1607|\n| 2| 1603|\n| 5| 1606|\n| 3| 1604|\n| 7| 1608|\n| 4| 1605|\n| 1| 1602|\n+---+-----+\n\nAfter\n+---+-----+\n| id|value|\n+---+-----+\n| 0| 0|\n| 1| 1|\n| 2| 2|\n| 3| 3|\n| 4| 4|\n| 5| 5|\n| 6| 6|\n| 7| 7|\n+---+-----+\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
Before\n+---+-----+\n id|value|\n+---+-----+\n 0| 1601|\n 6| 1607|\n 2| 1603|\n 5| 1606|\n 3| 1604|\n 7| 1608|\n 4| 1605|\n 1| 1602|\n+---+-----+\n\nAfter\n+---+-----+\n id|value|\n+---+-----+\n 0| 0|\n 1| 1|\n 2| 2|\n 3| 3|\n 4| 4|\n 5| 5|\n 6| 6|\n 7| 7|\n+---+-----+\n\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["### Delta table as both a streaming source & sink"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"80b5b5b0-f958-45cd-93cf-3cf9ba655e60"}}},{"cell_type":"code","source":["from_tbl = \"/tmp/ch-4/from_delta\"\nto_tbl = \"/tmp/ch-4/to_delta\"\nnumRows = 10\nspark.range(numRows).write.mode(\"overwrite\").format(\"delta\").save(from_tbl)\nspark.read.format(\"delta\").load(from_tbl).show()\nspark.range(numRows, numRows * 10).write.mode(\"overwrite\").format(\"delta\").save(to_tbl)\n\nstream4 = spark.readStream.format(\"delta\").load(to_tbl).writeStream.format(\"delta\")\\\n .option(\"checkpointLocation\", \"/tmp/ch-4/checkpoint/tbl1\") \\\n .outputMode(\"append\") \\\n .start(from_tbl)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"f3634f57-a79e-433d-aff8-1e4c17ee6d07"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
+---+\n| id|\n+---+\n| 1|\n| 2|\n| 7|\n| 3|\n| 9|\n| 6|\n| 0|\n| 4|\n| 8|\n| 5|\n+---+\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
+---+\n id|\n+---+\n 1|\n 2|\n 7|\n 3|\n 9|\n 6|\n 0|\n 4|\n 8|\n 5|\n+---+\n\n
"]}}],"execution_count":0},{"cell_type":"code","source":["# repartition table while streaming job is running\nspark.read.format(\"delta\").load(to_tbl).repartition(10).write\\\n .format(\"delta\")\\\n .mode(\"overwrite\")\\\n .option(\"dataChange\", \"false\")\\\n .save(to_tbl)\n\nstream4.awaitTermination(10)\nstream4.stop()\n#After streaming write \nspark.read.format(\"delta\").load(from_tbl).show(5)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"126806c8-b538-4889-9179-102fd2057d48"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
+---+\n| id|\n+---+\n| 1|\n| 2|\n| 3|\n| 0|\n| 4|\n+---+\nonly showing top 5 rows\n\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
+---+\n id|\n+---+\n 1|\n 2|\n 3|\n 0|\n 4|\n+---+\nonly showing top 5 rows\n\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## Event Based"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"58abc65c-7789-4272-bc22-57c7cbd09068"}}},{"cell_type":"markdown","source":["### Kafka\n* Replace host/port & topic name"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"1b9cdfa4-18a1-4856-937b-975eb487988e"}}},{"cell_type":"code","source":["# Example of reading from Kafka \n'''\nfrom pyspark.sql.functions import col \nkafkaServer = \"\" # Specify Host & Port & the name of the topic \ntopicName = 'iot-topic' \n\niotData = (spark.readStream # Get the DataStreamReader \n .format(\"kafka\") # Specify the source format as \"kafka\" \n .option(\"kafka.bootstrap.servers\", kafkaServer) # Configure the Kafka server name and port \n .option(\"subscribe\", topicName) # Subscribe to the Kafka topic \n .option(\"startingOffsets\", \"latest\") # stream to latest when we restart notebook \n .option(\"maxOffsetsPerTrigger\", 1000) # Throttle Kafka's processing of the streams \n .load() \n .repartition(8) \n .select(col(\"value\").cast(\"STRING\")) \n) \n'''"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"ccf616c3-0684-4c85-a853-8a7fd1817c46"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
Out[42]: '\\nfrom pyspark.sql.functions import col \\nkafkaServer = "<host:port>" # Specify Host & Port & the name of the topic \\ntopicName = \\'iot-topic\\' \\n\\niotData = (spark.readStream # Get the DataStreamReader \\n .format("kafka") # Specify the source format as "kafka" \\n .option("kafka.bootstrap.servers", kafkaServer) # Configure the Kafka server name and port \\n .option("subscribe", topicName) # Subscribe to the Kafka topic \\n .option("startingOffsets", "latest") # stream to latest when we restart notebook \\n .option("maxOffsetsPerTrigger", 1000) # Throttle Kafka\\'s processing of the streams \\n .load() \\n .repartition(8) \\n .select(col("value").cast("STRING")) \\n) \\n'
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
Out[42]: '\\nfrom pyspark.sql.functions import col \\nkafkaServer = "<host:port>" # Specify Host & Port & the name of the topic \\ntopicName = \\'iot-topic\\' \\n\\niotData = (spark.readStream # Get the DataStreamReader \\n .format("kafka") # Specify the source format as "kafka" \\n .option("kafka.bootstrap.servers", kafkaServer) # Configure the Kafka server name and port \\n .option("subscribe", topicName) # Subscribe to the Kafka topic \\n .option("startingOffsets", "latest") # stream to latest when we restart notebook \\n .option("maxOffsetsPerTrigger", 1000) # Throttle Kafka\\'s processing of the streams \\n .load() \\n .repartition(8) \\n .select(col("value").cast("STRING")) \\n) \\n'
"]}}],"execution_count":0},{"cell_type":"markdown","source":["### Kinesis\n* Replace Kinesis instance details - streamName, region\n* The following example assumes incoming data to have following schema\n * id, user_id, device_id, num_steps, miles_walked, calories_burnt, timestamp"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"e5dd2b83-d6b2-458f-b34d-fd06a932f94f"}}},{"cell_type":"code","source":["# Example of reading from Kinesis to pick up iot device data reporting on a user’s fitness metrics \n'''\nfrom pyspark.sql.functions import * \nkinesisDF = spark.readStream \\ \n .format(\"kinesis\") \\ \n .option(\"streamName\", \"kinesis-stream\") \\ \n .option(\"region\", \"us-east-2\") \\\n .option(\"initialPosition\", \"trim_horizon\") \\\n .load() \n\ndataDF = kinesisDF.select(col(\"data\").cast('string').alias(\"data\"))\ndataDF.createOrReplaceTempView(\"stream_data\") \n\nstream_df = spark.sql(\"\"\"\n SELECT data:id, \n data:user_id, \n data:device_id, \n cast(data:num_steps as int) as num_steps, \n cast(data:miles_walked as double) as miles_walked, \n cast(data:calories_burnt as double) as calories_burnt, \n cast(data:timestamp as timestamp) as timestamp\n FROM stream_data\"\"\") \n\n(stream_df .writeStream.format(\"delta\") \n .trigger(processingTime='30 seconds') \n .option(\"checkpointLocation\", )\n .outputMode(\"append\")\n .table(\"device_data_streaming\")) \n'''"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"7c494a09-692d-4c1a-bea7-497b32e91e76"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
Out[43]: '\\nfrom pyspark.sql.functions import * \\nkinesisDF = spark.readStream \\\\ \\n .format("kinesis") \\\\ \\n .option("streamName", "kinesis-stream") \\\\ \\n .option("region", "us-east-2") .option("initialPosition", "trim_horizon") .load() \\n\\ndataDF = kinesisDF.select(col("data").cast(\\'string\\').alias("data"))\\ndataDF.createOrReplaceTempView("stream_data") \\n\\nstream_df = spark.sql("""\\n SELECT data:id, \\n data:user_id, \\n data:device_id, \\n cast(data:num_steps as int) as num_steps, \\n cast(data:miles_walked as double) as miles_walked, \\n cast(data:calories_burnt as double) as calories_burnt, \\n cast(data:timestamp as timestamp) as timestamp\\n FROM stream_data""") \\n\\n(stream_df .writeStream.format("delta") \\n .trigger(processingTime=\\'30 seconds\\') \\n .option("checkpointLocation", <checkpoint location in storage>)\\n .outputMode("append")\\n .table("device_data_streaming")) \\n'
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
Out[43]: '\\nfrom pyspark.sql.functions import * \\nkinesisDF = spark.readStream \\\\ \\n .format("kinesis") \\\\ \\n .option("streamName", "kinesis-stream") \\\\ \\n .option("region", "us-east-2") .option("initialPosition", "trim_horizon") .load() \\n\\ndataDF = kinesisDF.select(col("data").cast(\\'string\\').alias("data"))\\ndataDF.createOrReplaceTempView("stream_data") \\n\\nstream_df = spark.sql("""\\n SELECT data:id, \\n data:user_id, \\n data:device_id, \\n cast(data:num_steps as int) as num_steps, \\n cast(data:miles_walked as double) as miles_walked, \\n cast(data:calories_burnt as double) as calories_burnt, \\n cast(data:timestamp as timestamp) as timestamp\\n FROM stream_data""") \\n\\n(stream_df .writeStream.format("delta") \\n .trigger(processingTime=\\'30 seconds\\') \\n .option("checkpointLocation", <checkpoint location in storage>)\\n .outputMode("append")\\n .table("device_data_streaming")) \\n'
"]}}],"execution_count":0},{"cell_type":"code","source":["'''\n#hold the lookup table in a dataframe\ndevices_df = spark.table(\"devices_lookup_tbl\") \n\n#read the incoming iot data \niot_df = spark.readStream() …. \n\n#join the 2 dataframes on device identifier \njoin_df = iot_df.join(devices_df, [‘device_id])\n\n#persist to disk\njoin_df.writeStream \\ \n .format('delta') \\ \n .outputMode('append') \\ \n .option('checkpointLocation', checkpoint_path) \\ \n .toTable(\"”devices_iot_tbl\") \n\n# At this point if new devices get registered, the devices_df will get the updates \n# Pre-delta, this would be stale data and would require the user to re-read the table each time prior to a join\n\nuniqueVisitors = iot_df\n .withWatermark(\"event_time\", \"10 minutes\") \n .dropDuplicates(\"event_time\", \"uid\") \n'''"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"00ac7b74-b7a5-4c89-a87a-c934cfbeb6cd"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
Out[44]: '\\n#hold the lookup table in a dataframe\\ndevices_df = spark.table("devices_lookup_tbl") \\n\\n#read the incoming iot data \\niot_df = spark.readStream() …. \\n\\n#join the 2 dataframes on device identifier \\njoin_df = iot_df.join(devices_df, [‘device_id])\\n\\n#persist to disk\\njoin_df.writeStream \\\\ \\n .format(\\'delta\\') \\\\ \\n .outputMode(\\'append\\') \\\\ \\n .option(\\'checkpointLocation\\', checkpoint_path) \\\\ \\n .toTable("”devices_iot_tbl") \\n\\n# At this point if new devices get registered, the devices_df will get the updates \\n# Pre-delta, this would be stale data and would require the user to re-read the table each time prior to a join\\n\\nuniqueVisitors = iot_df\\n .withWatermark("event_time", "10 minutes") \\n .dropDuplicates("event_time", "uid") \\n'
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
Out[44]: '\\n#hold the lookup table in a dataframe\\ndevices_df = spark.table("devices_lookup_tbl") \\n\\n#read the incoming iot data \\niot_df = spark.readStream() …. \\n\\n#join the 2 dataframes on device identifier \\njoin_df = iot_df.join(devices_df, [‘device_id])\\n\\n#persist to disk\\njoin_df.writeStream \\\\ \\n .format(\\'delta\\') \\\\ \\n .outputMode(\\'append\\') \\\\ \\n .option(\\'checkpointLocation\\', checkpoint_path) \\\\ \\n .toTable("”devices_iot_tbl") \\n\\n# At this point if new devices get registered, the devices_df will get the updates \\n# Pre-delta, this would be stale data and would require the user to re-read the table each time prior to a join\\n\\nuniqueVisitors = iot_df\\n .withWatermark("event_time", "10 minutes") \\n .dropDuplicates("event_time", "uid") \\n'
"]}}],"execution_count":0},{"cell_type":"code","source":[""],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"3c125a25-7156-43ee-b0ec-d4d2b970e437"}},"outputs":[],"execution_count":0}],"metadata":{"application/vnd.databricks.v1+notebook":{"notebookName":"Ch-04-DeltaForBatch&Streaming","dashboards":[],"notebookMetadata":{"pythonIndentUnit":2},"language":"python","widgets":{},"notebookOrigID":386413837244879}},"nbformat":4,"nbformat_minor":0} 2 | -------------------------------------------------------------------------------- /Chapter04/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/5a427fe53b2372030161a57e3485b44757a90149/Chapter04/README.md -------------------------------------------------------------------------------- /Chapter05/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/5a427fe53b2372030161a57e3485b44757a90149/Chapter05/README.md -------------------------------------------------------------------------------- /Chapter06/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/5a427fe53b2372030161a57e3485b44757a90149/Chapter06/README.md -------------------------------------------------------------------------------- /Chapter07/Ch-07-DeltaForDataWarehouseUseCases.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","source":["# Use Case\n##### Warehouse use cases focus on fast queries that retrieve aggregated data by applying desired filters"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"685b0b89-37fd-41ac-9520-33cda1848a27"}}},{"cell_type":"markdown","source":["# Data Warehouse Modeling techniques\n* Industry-specific domain models such as OMG models \n * Example: Object Management Group\n * (https://www.omg.org/industries/index.htm)\n* Kimball\n * identifis the key business processes and the key business questions that needs to answer\n * key dimensions, like customer and product, that are shared across the different facts will be built once & reused\n* Inmon\n * Identifies the key subject areas, and key entities the business operates with and cares about\n * Build data marts specific for departments/LOB\n * Data warehouse is the only source of data for the different data marts\n* Data Vault methodologies\n * store raw data as-is without applying business rules\n * (https://datavaultalliance.com/)\n * has three types of entities: \n * hubs: hold all unique business keys of a subject\n * links: track all relationships between hubs (join keys)\n * satellites: hold any attributes related to a link or hub and update them as they change"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"af841da0-fa45-4926-b597-b7f35af8e6ee"}}},{"cell_type":"markdown","source":["# Cleanup"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"3ea21d2c-7c55-44ea-8d34-340ebfe30e4c"}}},{"cell_type":"code","source":["# Clean prior run data files\ndbutils.fs.rm('/tmp/ch-7/', True)\n\n# Drop & recreate database\nspark.sql(\"DROP DATABASE IF EXISTS ch_7 CASCADE\")\nspark.sql(\"CREATE DATABASE ch_7 \")\nspark.sql(\"USE ch_7\")"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"9f1bcec4-794b-4f05-918c-59a0e88ee5a3"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"Out[1]: DataFrame[]","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"ansi","arguments":{}}},"output_type":"display_data","data":{"text/plain":["Out[1]: DataFrame[]"]}}],"execution_count":0},{"cell_type":"markdown","source":["# Identity Columns"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"467a9705-55bb-4a34-811e-e3b29c7e5085"}}},{"cell_type":"code","source":["%sql\nCREATE TABLE IF NOT EXISTS Identity_tbl (\n pKey bigint GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1),\n name string\n);\ninsert into Identity_tbl (name) values ('a'),('b'),('c');\nselect * from Identity_tbl;"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"fe399f6a-e4fb-4f54-a48b-74e87b6d9081"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[[1,"a"],[3,"c"],[2,"b"]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"pKey","type":"\"long\"","metadata":"{}"},{"name":"name","type":"\"string\"","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
pKeyname
1a
3c
2b
"]}}],"execution_count":0},{"cell_type":"code","source":["%sql\ndescribe history Identity_tbl;"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"38c0e483-388e-43b5-878e-446d369226c6"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[[1,"2022-08-04T00:08:51.000+0000","6490153397734611","anindita.mahapatra@databricks.com","WRITE",{"mode":"Append","partitionBy":"[]"},null,["1322756337600463"],"0803-022207-5zm6vs41",0,"WriteSerializable",true,{"numFiles":"3","numOutputRows":"3","numOutputBytes":"2457"},null,"Databricks-Runtime/11.1.x-scala2.12"],[0,"2022-08-04T00:08:48.000+0000","6490153397734611","anindita.mahapatra@databricks.com","CREATE TABLE",{"isManaged":"true","description":null,"partitionBy":"[]","properties":"{}"},null,["1322756337600463"],"0803-022207-5zm6vs41",null,"WriteSerializable",true,{},null,"Databricks-Runtime/11.1.x-scala2.12"]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"version","type":"\"long\"","metadata":"{}"},{"name":"timestamp","type":"\"timestamp\"","metadata":"{}"},{"name":"userId","type":"\"string\"","metadata":"{}"},{"name":"userName","type":"\"string\"","metadata":"{}"},{"name":"operation","type":"\"string\"","metadata":"{}"},{"name":"operationParameters","type":"{\"type\":\"map\",\"keyType\":\"string\",\"valueType\":\"string\",\"valueContainsNull\":true}","metadata":"{}"},{"name":"job","type":"{\"type\":\"struct\",\"fields\":[{\"name\":\"jobId\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"jobName\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"runId\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"jobOwnerId\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"triggerType\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}","metadata":"{}"},{"name":"notebook","type":"{\"type\":\"struct\",\"fields\":[{\"name\":\"notebookId\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}","metadata":"{}"},{"name":"clusterId","type":"\"string\"","metadata":"{}"},{"name":"readVersion","type":"\"long\"","metadata":"{}"},{"name":"isolationLevel","type":"\"string\"","metadata":"{}"},{"name":"isBlindAppend","type":"\"boolean\"","metadata":"{}"},{"name":"operationMetrics","type":"{\"type\":\"map\",\"keyType\":\"string\",\"valueType\":\"string\",\"valueContainsNull\":true}","metadata":"{}"},{"name":"userMetadata","type":"\"string\"","metadata":"{}"},{"name":"engineInfo","type":"\"string\"","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
versiontimestampuserIduserNameoperationoperationParametersjobnotebookclusterIdreadVersionisolationLevelisBlindAppendoperationMetricsuserMetadataengineInfo
12022-08-04T00:08:51.000+00006490153397734611anindita.mahapatra@databricks.comWRITEMap(mode -> Append, partitionBy -> [])nullList(1322756337600463)0803-022207-5zm6vs410WriteSerializabletrueMap(numFiles -> 3, numOutputRows -> 3, numOutputBytes -> 2457)nullDatabricks-Runtime/11.1.x-scala2.12
02022-08-04T00:08:48.000+00006490153397734611anindita.mahapatra@databricks.comCREATE TABLEMap(isManaged -> true, description -> null, partitionBy -> [], properties -> {})nullList(1322756337600463)0803-022207-5zm6vs41nullWriteSerializabletrueMap()nullDatabricks-Runtime/11.1.x-scala2.12
"]}}],"execution_count":0},{"cell_type":"markdown","source":["# Restore Delta Table\n* It is more efficient using REPLACE instead of dropping and re-creating Delta Lake tables"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"cd64b9f3-9c78-44ca-a9db-2c59a71ad000"}}},{"cell_type":"code","source":["%sql\nRestore Table Identity_tbl to version as of 0;"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"6ca5d09d-42a4-4603-a0a8-0e866f2f4555"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[[0,0,3,0,2457,0]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"table_size_after_restore","type":"\"long\"","metadata":"{}"},{"name":"num_of_files_after_restore","type":"\"long\"","metadata":"{}"},{"name":"num_removed_files","type":"\"long\"","metadata":"{}"},{"name":"num_restored_files","type":"\"long\"","metadata":"{}"},{"name":"removed_files_size","type":"\"long\"","metadata":"{}"},{"name":"restored_files_size","type":"\"long\"","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
table_size_after_restorenum_of_files_after_restorenum_removed_filesnum_restored_filesremoved_files_sizerestored_files_size
003024570
"]}}],"execution_count":0},{"cell_type":"markdown","source":["# Replace Table"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"0abfc754-2958-476c-82ac-a711995607fa"}}},{"cell_type":"code","source":["%sql\nCREATE or REPLACE TABLE Identity_tbl (\n pKey bigint GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1),\n name string\n);"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"1241c059-204d-4b00-acf2-43ffd980b890"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
"]}}],"execution_count":0},{"cell_type":"code","source":["%sql\nselect * from Identity_tbl"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"0382d23e-85ff-4583-a6a2-07531e5171d5"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"pKey","type":"\"long\"","metadata":"{}"},{"name":"name","type":"\"string\"","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
pKeyname
"]}}],"execution_count":0},{"cell_type":"markdown","source":["# Primary Key and Foreign Key Constraints\n* tested against DBR 11.1\n* FK/PK contraints are informational only (and can be leveraged in reporting tools etc) but the data is not enforced.\n* Table constraints are not supported with Hive Metastore"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"0ee3b3f3-9879-4a9d-abda-db209105dfaf"}}},{"cell_type":"markdown","source":["%sql\n-- Create a table with a primary key\nCREATE TABLE persons(\n first_name STRING NOT NULL, \n last_name STRING NOT NULL, \n nickname STRING,\n CONSTRAINT persons_pk PRIMARY KEY(first_name, last_name)\n )\nUSING DELTA\nLOCATION '/tmp/ch-5/pk';\n\n-- create a table with a foreign key\nCREATE TABLE pets(name STRING, \n owner_first_name STRING, \n owner_last_name STRING,\n CONSTRAINT pets_persons_fk FOREIGN KEY (owner_first_name, owner_last_name) REFERENCES persons\n )\nUSING DELTA\nLOCATION '/tmp/ch-5/pk';"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"560b0998-0a82-4849-81c3-19e5a8db34bb"}}},{"cell_type":"markdown","source":["# Example Implementation"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"5910d039-38c9-4658-8b85-ffe88a19e092"}}},{"cell_type":"markdown","source":["## Dimensional Modeling with Star Schema\n* Efficiently stores data, \n* Maintains history and \n* Updates data by reducing the duplication of repetitive business definitions"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"bc762e44-a11d-47dc-b478-300701624ed1"}}},{"cell_type":"code","source":["%sql\nDROP DATABASE IF EXISTS ch_7 CASCADE;\nCREATE DATABASE ch_7;\nUSE ch_7;"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"52afe9af-7e9d-44de-b3c3-74acc7c0fb45"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## Create your fact and dimension Delta tables\n* Surrogate Keys\n * A surrogate key is a type of primary key (non-natural) that helps to identify each record uniquely\n* PK/FK constraint"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"2fe8c494-09fd-4679-ab3c-f3664a1b2216"}}},{"cell_type":"markdown","source":["#### Fact table"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"5c410cf5-d619-4c3a-a542-ccd80b465e75"}}},{"cell_type":"code","source":["%sql\nCREATE TABLE transaction ( \n customer_id BIGINT, \n product_sku STRING, \n tx_Date TIMESTAMP, \n units INT,\n sale_amount FLOAT\n)\nUSING DELTA;"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"2dd40685-b7eb-4fce-a3ca-43f2e0010f0a"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
"]}}],"execution_count":0},{"cell_type":"markdown","source":["#### Dimension tables"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"70330acc-5d21-4c69-be47-dfab23b7609b"}}},{"cell_type":"code","source":["%sql\nCREATE TABLE product ( \n product_sku STRING, \n product_name STRING, \n category STRING, \n price FLOAT\n)\nUSING DELTA;"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"561d20d3-5dd4-407a-a2a5-5b9b4c6ae79d"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
"]}}],"execution_count":0},{"cell_type":"code","source":["%sql\nCREATE TABLE customer ( \n customer_id BIGINT not null GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1),\n customer_name STRING, \n address STRING, \n zip STRING, \n status BOOLEAN\n)\nUSING DELTA;"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"f2b7163a-6309-4f7a-87a1-93b217f98396"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## Optimize file size for fast file pruning\n* Data skipping can help with file pruning aand partition pruning \n* So what is Godilock zone for ideal data file size? A good file size range is 32-128MB \n* ALTER TABLE (database).(table) SET TBLPROPERTIES (delta.targetFileSize=33554432)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"c07815c3-2626-49ae-8305-1a813b64ca4c"}}},{"cell_type":"code","source":["%sql\nALTER TABLE transaction SET TBLPROPERTIES (delta.targetFileSize=33554432)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"d3f0f281-bb38-4128-89fd-6f3372ee2a29"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## Create a Z-Order on fact tables\n* ZORDER BY (LARGEST_DIM_FK, NEXT_LARGEST_DIM_FK, ...)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"c74fdf25-501f-4f0f-814c-23b3b9e37487"}}},{"cell_type":"code","source":["%sql\nOPTIMIZE transaction\nZORDER BY (product_sku, customer_id);"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"df2e010a-e445-4ccb-ac7b-5516b3d272f0"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[["dbfs:/user/hive/warehouse/ch_7.db/transaction",[0,0,[null,null,0.0,0,0],[null,null,0.0,0,0],0,["minCubeSize(107374182400)",[0,0],[0,0],0,[0,0],0,null],0,0,0,false,0,0,1659571755726,1659571757248]]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"path","type":"\"string\"","metadata":"{}"},{"name":"metrics","type":"{\"type\":\"struct\",\"fields\":[{\"name\":\"numFilesAdded\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"numFilesRemoved\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"filesAdded\",\"type\":{\"type\":\"struct\",\"fields\":[{\"name\":\"min\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"max\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"avg\",\"type\":\"double\",\"nullable\":false,\"metadata\":{}},{\"name\":\"totalFiles\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"totalSize\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}}]},\"nullable\":true,\"metadata\":{}},{\"name\":\"filesRemoved\",\"type\":{\"type\":\"struct\",\"fields\":[{\"name\":\"min\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"max\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"avg\",\"type\":\"double\",\"nullable\":false,\"metadata\":{}},{\"name\":\"totalFiles\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"totalSize\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}}]},\"nullable\":true,\"metadata\":{}},{\"name\":\"partitionsOptimized\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"zOrderStats\",\"type\":{\"type\":\"struct\",\"fields\":[{\"name\":\"strategyName\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"inputCubeFiles\",\"type\":{\"type\":\"struct\",\"fields\":[{\"name\":\"num\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"size\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}}]},\"nullable\":true,\"metadata\":{}},{\"name\":\"inputOtherFiles\",\"type\":{\"type\":\"struct\",\"fields\":[{\"name\":\"num\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"size\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}}]},\"nullable\":true,\"metadata\":{}},{\"name\":\"inputNumCubes\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"mergedFiles\",\"type\":{\"type\":\"struct\",\"fields\":[{\"name\":\"num\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"size\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}}]},\"nullable\":true,\"metadata\":{}},{\"name\":\"numOutputCubes\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"mergedNumCubes\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]},\"nullable\":true,\"metadata\":{}},{\"name\":\"numBatches\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"totalConsideredFiles\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"totalFilesSkipped\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"preserveInsertionOrder\",\"type\":\"boolean\",\"nullable\":false,\"metadata\":{}},{\"name\":\"numFilesSkippedToReduceWriteAmplification\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"numBytesSkippedToReduceWriteAmplification\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"startTimeMs\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"endTimeMs\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}}]}","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
pathmetrics
dbfs:/user/hive/warehouse/ch_7.db/transactionList(0, 0, List(null, null, 0.0, 0, 0), List(null, null, 0.0, 0, 0), 0, List(minCubeSize(107374182400), List(0, 0), List(0, 0), 0, List(0, 0), 0, null), 0, 0, 0, false, 0, 0, 1659571755726, 1659571757248)
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## Create Z-Orders on dimension key fields and popular predicates\n* ZORDER BY (BIG_DIM_PK, LIKELY_FIELD_1, LIKELY_FIELD_2)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"87715ba5-d124-4dba-946b-dbed59879a5e"}}},{"cell_type":"code","source":["%sql\nOPTIMIZE product\nZORDER BY (product_sku, category, price);"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"de30b55a-43aa-4d2c-ba86-6724d8a51176"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[["dbfs:/user/hive/warehouse/ch_7.db/product",[0,0,[null,null,0.0,0,0],[null,null,0.0,0,0],0,["minCubeSize(107374182400)",[0,0],[0,0],0,[0,0],0,null],0,0,0,false,0,0,1659571758128,1659571758944]]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"path","type":"\"string\"","metadata":"{}"},{"name":"metrics","type":"{\"type\":\"struct\",\"fields\":[{\"name\":\"numFilesAdded\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"numFilesRemoved\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"filesAdded\",\"type\":{\"type\":\"struct\",\"fields\":[{\"name\":\"min\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"max\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"avg\",\"type\":\"double\",\"nullable\":false,\"metadata\":{}},{\"name\":\"totalFiles\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"totalSize\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}}]},\"nullable\":true,\"metadata\":{}},{\"name\":\"filesRemoved\",\"type\":{\"type\":\"struct\",\"fields\":[{\"name\":\"min\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"max\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"avg\",\"type\":\"double\",\"nullable\":false,\"metadata\":{}},{\"name\":\"totalFiles\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"totalSize\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}}]},\"nullable\":true,\"metadata\":{}},{\"name\":\"partitionsOptimized\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"zOrderStats\",\"type\":{\"type\":\"struct\",\"fields\":[{\"name\":\"strategyName\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"inputCubeFiles\",\"type\":{\"type\":\"struct\",\"fields\":[{\"name\":\"num\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"size\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}}]},\"nullable\":true,\"metadata\":{}},{\"name\":\"inputOtherFiles\",\"type\":{\"type\":\"struct\",\"fields\":[{\"name\":\"num\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"size\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}}]},\"nullable\":true,\"metadata\":{}},{\"name\":\"inputNumCubes\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"mergedFiles\",\"type\":{\"type\":\"struct\",\"fields\":[{\"name\":\"num\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"size\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}}]},\"nullable\":true,\"metadata\":{}},{\"name\":\"numOutputCubes\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"mergedNumCubes\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]},\"nullable\":true,\"metadata\":{}},{\"name\":\"numBatches\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"totalConsideredFiles\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"totalFilesSkipped\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"preserveInsertionOrder\",\"type\":\"boolean\",\"nullable\":false,\"metadata\":{}},{\"name\":\"numFilesSkippedToReduceWriteAmplification\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"numBytesSkippedToReduceWriteAmplification\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"startTimeMs\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"endTimeMs\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}}]}","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
pathmetrics
dbfs:/user/hive/warehouse/ch_7.db/productList(0, 0, List(null, null, 0.0, 0, 0), List(null, null, 0.0, 0, 0), 0, List(minCubeSize(107374182400), List(0, 0), List(0, 0), 0, List(0, 0), 0, null), 0, 0, 0, false, 0, 0, 1659571758128, 1659571758944)
"]}}],"execution_count":0},{"cell_type":"code","source":["%sql\nOPTIMIZE customer\nZORDER BY (customer_id, zip);"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"3c2b1cfc-a53f-4448-b58f-9396747d2106"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[["dbfs:/user/hive/warehouse/ch_7.db/customer",[0,0,[null,null,0.0,0,0],[null,null,0.0,0,0],0,["minCubeSize(107374182400)",[0,0],[0,0],0,[0,0],0,null],0,0,0,false,0,0,1659571759784,1659571760613]]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"path","type":"\"string\"","metadata":"{}"},{"name":"metrics","type":"{\"type\":\"struct\",\"fields\":[{\"name\":\"numFilesAdded\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"numFilesRemoved\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"filesAdded\",\"type\":{\"type\":\"struct\",\"fields\":[{\"name\":\"min\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"max\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"avg\",\"type\":\"double\",\"nullable\":false,\"metadata\":{}},{\"name\":\"totalFiles\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"totalSize\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}}]},\"nullable\":true,\"metadata\":{}},{\"name\":\"filesRemoved\",\"type\":{\"type\":\"struct\",\"fields\":[{\"name\":\"min\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"max\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"avg\",\"type\":\"double\",\"nullable\":false,\"metadata\":{}},{\"name\":\"totalFiles\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"totalSize\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}}]},\"nullable\":true,\"metadata\":{}},{\"name\":\"partitionsOptimized\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"zOrderStats\",\"type\":{\"type\":\"struct\",\"fields\":[{\"name\":\"strategyName\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"inputCubeFiles\",\"type\":{\"type\":\"struct\",\"fields\":[{\"name\":\"num\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"size\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}}]},\"nullable\":true,\"metadata\":{}},{\"name\":\"inputOtherFiles\",\"type\":{\"type\":\"struct\",\"fields\":[{\"name\":\"num\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"size\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}}]},\"nullable\":true,\"metadata\":{}},{\"name\":\"inputNumCubes\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"mergedFiles\",\"type\":{\"type\":\"struct\",\"fields\":[{\"name\":\"num\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"size\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}}]},\"nullable\":true,\"metadata\":{}},{\"name\":\"numOutputCubes\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"mergedNumCubes\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]},\"nullable\":true,\"metadata\":{}},{\"name\":\"numBatches\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"totalConsideredFiles\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"totalFilesSkipped\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"preserveInsertionOrder\",\"type\":\"boolean\",\"nullable\":false,\"metadata\":{}},{\"name\":\"numFilesSkippedToReduceWriteAmplification\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"numBytesSkippedToReduceWriteAmplification\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"startTimeMs\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}},{\"name\":\"endTimeMs\",\"type\":\"long\",\"nullable\":false,\"metadata\":{}}]}","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
pathmetrics
dbfs:/user/hive/warehouse/ch_7.db/customerList(0, 0, List(null, null, 0.0, 0, 0), List(null, null, 0.0, 0, 0), 0, List(minCubeSize(107374182400), List(0, 0), List(0, 0), 0, List(0, 0), 0, null), 0, 0, 0, false, 0, 0, 1659571759784, 1659571760613)
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## Analyze Table to gather statistics for AQE Optimizer\n* Adaptive Query Execution\n* ANALYZE TABLE MY_BIG_DIM COMPUTE STATISTICS FOR ALL COLUMNS"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"701d1be1-ea21-4c70-96dd-134f61c6387e"}}},{"cell_type":"code","source":["%sql\nANALYZE TABLE transaction COMPUTE STATISTICS FOR ALL COLUMNS;\nANALYZE TABLE product COMPUTE STATISTICS FOR ALL COLUMNS;\nANALYZE TABLE customer COMPUTE STATISTICS FOR ALL COLUMNS;"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"c5ab8aa7-b8ed-420f-8f9b-8aa102c5e942"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
"]}}],"execution_count":0},{"cell_type":"markdown","source":["# Data Organization\n* Bronze layer is the landing area here data is retained as is\n* Silver layer is more vault like where transformations are applied\n* Gold layer is the presentation layer which is read optimized & where star/snowflake schema use aggregated data oof fact tables"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"85e0f828-bf96-4879-8dda-68542ede6d1f"}}},{"cell_type":"markdown","source":["# BI tool connectivity\n* Power BI: https://github.com/delta-io/connectors/tree/master/powerbi"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"9ddd1653-6793-4c1e-9957-135058af20c7"}}}],"metadata":{"application/vnd.databricks.v1+notebook":{"notebookName":"Ch-07-DeltaForDataWarehouseUseCases","dashboards":[],"notebookMetadata":{"pythonIndentUnit":2,"mostRecentlyExecutedCommandWithImplicitDF":{"commandId":1322756337600464,"dataframes":["_sqldf"]}},"language":"python","widgets":{},"notebookOrigID":1322756337600463}},"nbformat":4,"nbformat_minor":0} 2 | -------------------------------------------------------------------------------- /Chapter07/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/5a427fe53b2372030161a57e3485b44757a90149/Chapter07/README.md -------------------------------------------------------------------------------- /Chapter08/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/5a427fe53b2372030161a57e3485b44757a90149/Chapter08/README.md -------------------------------------------------------------------------------- /Chapter09/Ch-09-DeltaForReproducibleMachineLearning.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","source":["# Delta for ML Practitioners"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"00b6a684-bd38-485d-bd2a-b22e0da2c379"}}},{"cell_type":"markdown","source":["## Setup & Cleanup"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"ae1347f7-e923-4685-9b16-ae22d572686d"}}},{"cell_type":"code","source":["# Clean prior run data files\ndbutils.fs.rm('/tmp/ch-9/', True)\n\n# Drop & recreate database\nspark.sql(\"DROP DATABASE IF EXISTS ch_9 CASCADE\")\nspark.sql(\"CREATE DATABASE ch_9 \")\nspark.sql(\"USE ch_9\")\n\n# Configure Path\nDELTALAKE_BRONZE_PATH = \"/tmp/ch-9/bronze/loan_raw\"\nDELTALAKE_SILVER_PATH = \"/tmp/ch-9/silver/loan_refined\"\n# Remove table if it exists\ndbutils.fs.rm(DELTALAKE_BRONZE_PATH, recurse=True)\ndbutils.fs.rm(DELTALAKE_SILVER_PATH, recurse=True)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"1c932e44-6c89-4b9b-aa37-9eb463054edd"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
Out[98]: False
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
Out[98]: False
"]}}],"execution_count":0},{"cell_type":"code","source":["from pyspark.sql.functions import *\nfrom pyspark.sql.types import *\n\nschema = StructType([ \\\n StructField(\"client_id\", StringType(),True), \\\n StructField(\"loan_status\", StringType(),True), \\\n StructField(\"int_rate\",StringType(),True), \\\n StructField(\"issue_dt\",StringType(),True), \\\n StructField(\"payment_amt\", FloatType(), True), \\\n StructField(\"loan_amnt\", FloatType(), True), \\\n StructField(\"addr_state\", StringType(), True),\n StructField(\"term\", StringType(), True), \\\n StructField(\"ownership\", StringType(), True) \\\n ])\nstrutured_data = [('A123', 'Paid', '13.3%', 'Feb-2020', 2000.0, 2000.0, 'MA', '36 months', 'MORTGAGE'),\n ('B678', 'Default', '15.0%', 'Nov-2022', 2000.0, 5000.0, 'DE', '16 months','RENT'),\n ('C566', 'Not-Paid', '10.5%', 'Dec-2022', 0.0, 7000.0, 'WA', '24 months', 'MORTGAGE'),\n ('Z111', 'Paid', '1.3%', 'Oct-2020', 2000.0, 2000.0, 'MA', '36 months', 'MORTGAGE'),\n ('L231', 'Default', '5.0%', 'Oct-2020', 2000.0, 5000.0, 'DE', '16 months','RENT'),\n ('C890','Not-Paid', '10.1%', 'Aug-2020', 0.0, 7000.0, 'WA', '24 months', 'MORTGAGE')\n ]\ndf = spark.createDataFrame(data=strutured_data,schema=schema)\ndf.printSchema()\ndf.write.format(\"delta\").mode(\"append\").save(DELTALAKE_BRONZE_PATH)\nspark.sql(\"CREATE TABLE loan_raw_data USING DELTA LOCATION '\" + DELTALAKE_BRONZE_PATH + \"'\")"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"2b202722-9666-4bb6-939c-2d87e32c1216"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
root\n |-- client_id: string (nullable = true)\n |-- loan_status: string (nullable = true)\n |-- int_rate: string (nullable = true)\n |-- issue_dt: string (nullable = true)\n |-- payment_amt: float (nullable = true)\n |-- loan_amnt: float (nullable = true)\n |-- addr_state: string (nullable = true)\n |-- term: string (nullable = true)\n |-- ownership: string (nullable = true)\n\nOut[99]: DataFrame[]
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
root\n-- client_id: string (nullable = true)\n-- loan_status: string (nullable = true)\n-- int_rate: string (nullable = true)\n-- issue_dt: string (nullable = true)\n-- payment_amt: float (nullable = true)\n-- loan_amnt: float (nullable = true)\n-- addr_state: string (nullable = true)\n-- term: string (nullable = true)\n-- ownership: string (nullable = true)\n\nOut[99]: DataFrame[]
"]}}],"execution_count":0},{"cell_type":"code","source":["strutured_data = [('P123', 'Paid', '13.3%', 'Feb-2021', 1000.0, 2000.0, 'MA', '12 months', 'MORTGAGE'),\n ('Q678', 'Default', '15.0%', 'Nov-2021', 3000.0, 5000.0, 'DE', '12 months','RENT'),\n ('R566', 'Not-Paid', '10.5%', 'Dec-2021', 0.0, 7000.0, 'WA', '12 months', 'MORTGAGE'),\n ('S111', 'Paid', '1.3%', 'Oct-2021', 8000.0, 2000.0, 'MA', '12 months', 'MORTGAGE'),\n ]\ndf = spark.createDataFrame(data=strutured_data,schema=schema)\ndf.write.format(\"delta\").mode(\"append\").save(DELTALAKE_BRONZE_PATH)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"23c197da-5656-4752-a6ad-6db480194fad"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["# Delta backed Model Management"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"358e1038-9034-4ec4-84f6-88c2e640b7e8"}}},{"cell_type":"markdown","source":["## 1. Data Preparation\n* EDA\n* Featurization"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"3d25dce4-ee80-4a7d-90ff-5f79558d5157"}}},{"cell_type":"code","source":["df = spark.sql('SELECT * FROM loan_raw_data VERSION AS OF 0')\ndisplay(df.summary())"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"f261500a-8441-403d-8469-b6e738a8e42b"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[["count","6","6","6","6","6","6","6","6","6"],["mean",null,null,null,null,"1333.3333333333333","4666.666666666667",null,null,null],["stddev",null,null,null,null,"1032.7955589886444","2250.925735484551",null,null,null],["min","A123","Default","1.3%","Aug-2020","0.0","2000.0","DE","16 months","MORTGAGE"],["25%",null,null,null,null,"0.0","2000.0",null,null,null],["50%",null,null,null,null,"2000.0","5000.0",null,null,null],["75%",null,null,null,null,"2000.0","7000.0",null,null,null],["max","Z111","Paid","5.0%","Oct-2020","2000.0","7000.0","WA","36 months","RENT"]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"summary","type":"\"string\"","metadata":"{}"},{"name":"client_id","type":"\"string\"","metadata":"{}"},{"name":"loan_status","type":"\"string\"","metadata":"{}"},{"name":"int_rate","type":"\"string\"","metadata":"{}"},{"name":"issue_dt","type":"\"string\"","metadata":"{}"},{"name":"payment_amt","type":"\"string\"","metadata":"{}"},{"name":"loan_amnt","type":"\"string\"","metadata":"{}"},{"name":"addr_state","type":"\"string\"","metadata":"{}"},{"name":"term","type":"\"string\"","metadata":"{}"},{"name":"ownership","type":"\"string\"","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
summaryclient_idloan_statusint_rateissue_dtpayment_amtloan_amntaddr_statetermownership
count666666666
meannullnullnullnull1333.33333333333334666.666666666667nullnullnull
stddevnullnullnullnull1032.79555898864442250.925735484551nullnullnull
minA123Default1.3%Aug-20200.02000.0DE16 monthsMORTGAGE
25%nullnullnullnull0.02000.0nullnullnull
50%nullnullnullnull2000.05000.0nullnullnull
75%nullnullnullnull2000.07000.0nullnullnull
maxZ111Paid5.0%Oct-20202000.07000.0WA36 monthsRENT
"]}}],"execution_count":0},{"cell_type":"code","source":["# Select only the columns needed\nloan_stats = df.select(\"loan_status\", \"int_rate\", \"issue_dt\", \"payment_amt\", \"loan_amnt\", \"addr_state\", \"term\", \"ownership\")"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"1816f040-5c36-44f3-9104-3b481d8f52cf"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["### Delta backed Feature Store table\n* Stores expensive features that are computed once and used across use cases\n* Used at both training and inferencing time to prevent drift\n* Allows for feature trend comparison\n* Allows for rollback to a prior version"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"8cb5f0cb-3356-4819-b4bf-6e5a503a63ce"}}},{"cell_type":"code","source":["from pyspark.sql.functions import *\n\nprint(\"------------------------------------------------------------------------------------------------\")\nprint(\"Create bad loan label, this will include charged off, defaulted, and late repayments on loans...\")\nloan_stats = loan_stats.filter(loan_stats.loan_status.isin([\"Default\", \"Not-Paid\", \"Paid\"]))\\\n .withColumn(\"bad_loan\", (~(loan_stats.loan_status == \"Paid\")).cast(\"string\"))\n\nprint(\"------------------------------------------------------------------------------------------------\")\nprint(\"Turning string interest rate and revoling util columns into numeric columns...\")\nloan_stats = loan_stats.withColumn('int_rate', regexp_replace('int_rate', '%', '').cast('float')) \\\n .withColumn('issue_year', substring(loan_stats.issue_dt, 5, 4).cast('double') ) "],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"f7090ad6-2052-4a66-a6d3-b41884a6cef2"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
------------------------------------------------------------------------------------------------\nCreate bad loan label, this will include charged off, defaulted, and late repayments on loans...\n------------------------------------------------------------------------------------------------\nTurning string interest rate and revoling util columns into numeric columns...\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
------------------------------------------------------------------------------------------------\nCreate bad loan label, this will include charged off, defaulted, and late repayments on loans...\n------------------------------------------------------------------------------------------------\nTurning string interest rate and revoling util columns into numeric columns...\n
"]}}],"execution_count":0},{"cell_type":"code","source":["# Save table as Delta Lake\nloan_stats.write.format(\"delta\").mode(\"overwrite\").save(DELTALAKE_SILVER_PATH)\n\n# Re-read as Delta Lake\nloan_stats = spark.read.format(\"delta\").load(DELTALAKE_SILVER_PATH)\n\n# Review data\ndisplay(loan_stats)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"9e68cd9b-ba77-4a24-9199-0320b467ece0"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[["Paid",1.3,"Oct-2020",2000.0,2000.0,"MA","36 months","MORTGAGE","false",2020.0],["Not-Paid",10.1,"Aug-2020",0.0,7000.0,"WA","24 months","MORTGAGE","true",2020.0],["Not-Paid",10.5,"Dec-2022",0.0,7000.0,"WA","24 months","MORTGAGE","true",2022.0],["Paid",13.3,"Feb-2020",2000.0,2000.0,"MA","36 months","MORTGAGE","false",2020.0],["Default",5.0,"Oct-2020",2000.0,5000.0,"DE","16 months","RENT","true",2020.0],["Default",15.0,"Nov-2022",2000.0,5000.0,"DE","16 months","RENT","true",2022.0]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"loan_status","type":"\"string\"","metadata":"{}"},{"name":"int_rate","type":"\"float\"","metadata":"{}"},{"name":"issue_dt","type":"\"string\"","metadata":"{}"},{"name":"payment_amt","type":"\"float\"","metadata":"{}"},{"name":"loan_amnt","type":"\"float\"","metadata":"{}"},{"name":"addr_state","type":"\"string\"","metadata":"{}"},{"name":"term","type":"\"string\"","metadata":"{}"},{"name":"ownership","type":"\"string\"","metadata":"{}"},{"name":"bad_loan","type":"\"string\"","metadata":"{}"},{"name":"issue_year","type":"\"double\"","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
loan_statusint_rateissue_dtpayment_amtloan_amntaddr_statetermownershipbad_loanissue_year
Paid1.3Oct-20202000.02000.0MA36 monthsMORTGAGEfalse2020.0
Not-Paid10.1Aug-20200.07000.0WA24 monthsMORTGAGEtrue2020.0
Not-Paid10.5Dec-20220.07000.0WA24 monthsMORTGAGEtrue2022.0
Paid13.3Feb-20202000.02000.0MA36 monthsMORTGAGEfalse2020.0
Default5.0Oct-20202000.05000.0DE16 monthsRENTtrue2020.0
Default15.0Nov-20222000.05000.0DE16 monthsRENTtrue2022.0
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## 2. Model Preparation\n* Training data/version\n* Model Metrics\n* Drift Thresholds"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"cd1cfcfe-4a7a-4606-bb6f-27b23a96e850"}}},{"cell_type":"markdown","source":["### Consistent dataset for all ML Experiments to ensure fair comparison\n* Use Delta Time Travel capabilities instead of making multiple versions of the data\n* Log the data version used for each training run\n * MLFlow is a good option to track model management: https://mlflow.org/"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"84a72af0-935d-43b9-acd4-bf41ada658b7"}}},{"cell_type":"code","source":[""],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"f87537d6-a04d-4340-8cf5-20a3b8d3db4c"}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["### Building a ML Pipeline with Delta\n* Multi-hop\n* Combine real-time streaming data with historical data in the same Delta table to be used for training\n* Combine structured, semi-structured and un-structurd data into same Delta table for training\n* Data Cleansing, transformations on Delta table using Spark APIs\n* Schema Enforcement and Schemaa Evolution"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"da7bf9cb-9a6b-4203-9e2e-501ce2860e93"}}},{"cell_type":"code","source":["spark.sql(\"DROP TABLE IF EXISTS loan_stats\")\nspark.sql(\"CREATE TABLE loan_stats USING DELTA LOCATION '\" + DELTALAKE_SILVER_PATH + \"'\")"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"586bb183-23d3-435d-8cb5-341981f4a008"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
Out[105]: DataFrame[]
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
Out[105]: DataFrame[]
"]}}],"execution_count":0},{"cell_type":"code","source":["# Add the mergeSchema option\nloan_stats.write.option(\"mergeSchema\",\"true\").format(\"delta\").mode(\"overwrite\").save(DELTALAKE_SILVER_PATH)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"066473b4-338c-465c-aa9d-85a8781ffdae"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
"]}}],"execution_count":0},{"cell_type":"code","source":["myY = \"bad_loan\"\ncategoricals = [\"term\", \"ownership\", \"addr_state\"]\nnumerics = [\"loan_amnt\",\"payment_amt\"]\nmyX = categoricals + numerics\n\nloan_stats2 = loan_stats.select(myX + [myY, \"int_rate\", \"issue_year\"])\ntrain = loan_stats2.filter(loan_stats2.issue_year <= 2020).cache()\nvalid = loan_stats2.filter(loan_stats2.issue_year > 2020).cache()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"2d65a508-4c5d-4764-9422-b4508f979067"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
"]}}],"execution_count":0},{"cell_type":"code","source":["from pyspark.ml import Pipeline\nfrom pyspark.ml.feature import StringIndexer, VectorAssembler, OneHotEncoder\nfrom pyspark.ml.feature import StandardScaler, Imputer\nfrom pyspark.ml.classification import LogisticRegression\nfrom pyspark.ml.evaluation import BinaryClassificationEvaluator\nfrom pyspark.ml.tuning import CrossValidator, ParamGridBuilder\n\n## Current possible ways to handle categoricals in string indexer is 'error', 'keep', and 'skip'\nindexers = map(lambda c: StringIndexer(inputCol=c, outputCol=c+\"_idx\", handleInvalid = 'keep'), categoricals)\nohes = map(lambda c: OneHotEncoder(inputCol=c + \"_idx\", outputCol=c+\"_class\"),categoricals)\nimputers = Imputer(inputCols = numerics, outputCols = numerics)\n\n# Establish features columns\nfeatureCols = list(map(lambda c: c+\"_class\", categoricals)) + numerics\n\n# Build the stage for the ML pipeline\n# Build the stage for the ML pipeline\nmodel_matrix_stages = list(indexers) + list(ohes) + [imputers] + \\\n [VectorAssembler(inputCols=featureCols, outputCol=\"features\"), StringIndexer(inputCol=\"bad_loan\", outputCol=\"label\")]\n\n# Apply StandardScaler to create scaledFeatures\nscaler = StandardScaler(inputCol=\"features\",\n outputCol=\"scaledFeatures\",\n withStd=True,\n withMean=True)\n\n# Use logistic regression \nlr = LogisticRegression(maxIter=10, elasticNetParam=0.5, featuresCol = \"scaledFeatures\")\n\n# Build our ML pipeline\npipeline = Pipeline(stages=model_matrix_stages+[scaler]+[lr])\n\n# Build the parameter grid for model tuning\nparamGrid = ParamGridBuilder() \\\n .addGrid(lr.regParam, [0.1, 0.01]) \\\n .build()\n\n# Execute CrossValidator for model tuning\ncrossval = CrossValidator(estimator=pipeline,\n estimatorParamMaps=paramGrid,\n evaluator=BinaryClassificationEvaluator(),\n numFolds=5)\n\n# Train the tuned model and establish our best model\ncvModel = crossval.fit(train)\nglm_model = cvModel.bestModel\n\n# Return ROC\nlr_summary = glm_model.stages[len(glm_model.stages)-1].summary\ndisplay(lr_summary.roc)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"0de84382-82f8-4e80-b307-9f873b407ab4"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[[0.0,0.0],[0.0,1.0],[0.3333333333333333,1.0],[1.0,1.0],[1.0,1.0]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"FPR","type":"\"double\"","metadata":"{}"},{"name":"TPR","type":"\"double\"","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
FPRTPR
0.00.0
0.01.0
0.33333333333333331.0
1.01.0
1.01.0
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## 3. Model Serving\n* Testing and comparison against different architecture types\n* Deployment/Serving\n* Inferencing"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"a621a61c-f618-4522-a3af-1181f3e4e0af"}}},{"cell_type":"code","source":["display(glm_model.transform(valid))"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"3702195a-edbf-420a-940d-f37233686fb9"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[["16 months","RENT","DE",5000.0,2000.0,"true",15.0,2022.0,2.0,1.0,2.0,{"vectorType":"sparse","length":3,"indices":[2],"values":[1.0]},{"vectorType":"sparse","length":2,"indices":[1],"values":[1.0]},{"vectorType":"sparse","length":3,"indices":[2],"values":[1.0]},{"vectorType":"sparse","length":10,"indices":[2,4,7,8,9],"values":[1.0,1.0,1.0,5000.0,2000.0]},0.0,{"vectorType":"dense","length":10,"values":[-0.7302967433402215,-0.7302967433402215,1.788854381999832,-1.788854381999832,1.788854381999832,-0.7302967433402215,-0.7302967433402215,1.788854381999832,0.15936381457791912,0.7302967433402214]},{"vectorType":"dense","length":2,"values":[2.0477153332452165,-2.0477153332452165]},{"vectorType":"dense","length":2,"values":[0.8857165622211622,0.11428343777883776]},0.0]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"term","type":"\"string\"","metadata":"{}"},{"name":"ownership","type":"\"string\"","metadata":"{}"},{"name":"addr_state","type":"\"string\"","metadata":"{}"},{"name":"loan_amnt","type":"\"float\"","metadata":"{}"},{"name":"payment_amt","type":"\"float\"","metadata":"{}"},{"name":"bad_loan","type":"\"string\"","metadata":"{}"},{"name":"int_rate","type":"\"float\"","metadata":"{}"},{"name":"issue_year","type":"\"double\"","metadata":"{}"},{"name":"term_idx","type":"\"double\"","metadata":"{\"ml_attr\":{\"vals\":[\"24 months\",\"36 months\",\"16 months\",\"__unknown\"],\"type\":\"nominal\",\"name\":\"term_idx\"}}"},{"name":"ownership_idx","type":"\"double\"","metadata":"{\"ml_attr\":{\"vals\":[\"MORTGAGE\",\"RENT\",\"__unknown\"],\"type\":\"nominal\",\"name\":\"ownership_idx\"}}"},{"name":"addr_state_idx","type":"\"double\"","metadata":"{\"ml_attr\":{\"vals\":[\"MA\",\"WA\",\"DE\",\"__unknown\"],\"type\":\"nominal\",\"name\":\"addr_state_idx\"}}"},{"name":"term_class","type":"{\"type\":\"udt\",\"class\":\"org.apache.spark.ml.linalg.VectorUDT\",\"pyClass\":\"pyspark.ml.linalg.VectorUDT\",\"sqlType\":{\"type\":\"struct\",\"fields\":[{\"name\":\"type\",\"type\":\"byte\",\"nullable\":false,\"metadata\":{}},{\"name\":\"size\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"indices\",\"type\":{\"type\":\"array\",\"elementType\":\"integer\",\"containsNull\":false},\"nullable\":true,\"metadata\":{}},{\"name\":\"values\",\"type\":{\"type\":\"array\",\"elementType\":\"double\",\"containsNull\":false},\"nullable\":true,\"metadata\":{}}]}}","metadata":"{\"ml_attr\":{\"attrs\":{\"binary\":[{\"idx\":0,\"name\":\"24 months\"},{\"idx\":1,\"name\":\"36 months\"},{\"idx\":2,\"name\":\"16 months\"}]},\"num_attrs\":3}}"},{"name":"ownership_class","type":"{\"type\":\"udt\",\"class\":\"org.apache.spark.ml.linalg.VectorUDT\",\"pyClass\":\"pyspark.ml.linalg.VectorUDT\",\"sqlType\":{\"type\":\"struct\",\"fields\":[{\"name\":\"type\",\"type\":\"byte\",\"nullable\":false,\"metadata\":{}},{\"name\":\"size\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"indices\",\"type\":{\"type\":\"array\",\"elementType\":\"integer\",\"containsNull\":false},\"nullable\":true,\"metadata\":{}},{\"name\":\"values\",\"type\":{\"type\":\"array\",\"elementType\":\"double\",\"containsNull\":false},\"nullable\":true,\"metadata\":{}}]}}","metadata":"{\"ml_attr\":{\"attrs\":{\"binary\":[{\"idx\":0,\"name\":\"MORTGAGE\"},{\"idx\":1,\"name\":\"RENT\"}]},\"num_attrs\":2}}"},{"name":"addr_state_class","type":"{\"type\":\"udt\",\"class\":\"org.apache.spark.ml.linalg.VectorUDT\",\"pyClass\":\"pyspark.ml.linalg.VectorUDT\",\"sqlType\":{\"type\":\"struct\",\"fields\":[{\"name\":\"type\",\"type\":\"byte\",\"nullable\":false,\"metadata\":{}},{\"name\":\"size\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"indices\",\"type\":{\"type\":\"array\",\"elementType\":\"integer\",\"containsNull\":false},\"nullable\":true,\"metadata\":{}},{\"name\":\"values\",\"type\":{\"type\":\"array\",\"elementType\":\"double\",\"containsNull\":false},\"nullable\":true,\"metadata\":{}}]}}","metadata":"{\"ml_attr\":{\"attrs\":{\"binary\":[{\"idx\":0,\"name\":\"MA\"},{\"idx\":1,\"name\":\"WA\"},{\"idx\":2,\"name\":\"DE\"}]},\"num_attrs\":3}}"},{"name":"features","type":"{\"type\":\"udt\",\"class\":\"org.apache.spark.ml.linalg.VectorUDT\",\"pyClass\":\"pyspark.ml.linalg.VectorUDT\",\"sqlType\":{\"type\":\"struct\",\"fields\":[{\"name\":\"type\",\"type\":\"byte\",\"nullable\":false,\"metadata\":{}},{\"name\":\"size\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"indices\",\"type\":{\"type\":\"array\",\"elementType\":\"integer\",\"containsNull\":false},\"nullable\":true,\"metadata\":{}},{\"name\":\"values\",\"type\":{\"type\":\"array\",\"elementType\":\"double\",\"containsNull\":false},\"nullable\":true,\"metadata\":{}}]}}","metadata":"{\"ml_attr\":{\"attrs\":{\"numeric\":[{\"idx\":8,\"name\":\"loan_amnt\"},{\"idx\":9,\"name\":\"payment_amt\"}],\"binary\":[{\"idx\":0,\"name\":\"term_class_24 months\"},{\"idx\":1,\"name\":\"term_class_36 months\"},{\"idx\":2,\"name\":\"term_class_16 months\"},{\"idx\":3,\"name\":\"ownership_class_MORTGAGE\"},{\"idx\":4,\"name\":\"ownership_class_RENT\"},{\"idx\":5,\"name\":\"addr_state_class_MA\"},{\"idx\":6,\"name\":\"addr_state_class_WA\"},{\"idx\":7,\"name\":\"addr_state_class_DE\"}]},\"num_attrs\":10}}"},{"name":"label","type":"\"double\"","metadata":"{\"ml_attr\":{\"vals\":[\"true\",\"false\"],\"type\":\"nominal\",\"name\":\"label\"}}"},{"name":"scaledFeatures","type":"{\"type\":\"udt\",\"class\":\"org.apache.spark.ml.linalg.VectorUDT\",\"pyClass\":\"pyspark.ml.linalg.VectorUDT\",\"sqlType\":{\"type\":\"struct\",\"fields\":[{\"name\":\"type\",\"type\":\"byte\",\"nullable\":false,\"metadata\":{}},{\"name\":\"size\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"indices\",\"type\":{\"type\":\"array\",\"elementType\":\"integer\",\"containsNull\":false},\"nullable\":true,\"metadata\":{}},{\"name\":\"values\",\"type\":{\"type\":\"array\",\"elementType\":\"double\",\"containsNull\":false},\"nullable\":true,\"metadata\":{}}]}}","metadata":"{\"ml_attr\":{\"num_attrs\":10}}"},{"name":"rawPrediction","type":"{\"type\":\"udt\",\"class\":\"org.apache.spark.ml.linalg.VectorUDT\",\"pyClass\":\"pyspark.ml.linalg.VectorUDT\",\"sqlType\":{\"type\":\"struct\",\"fields\":[{\"name\":\"type\",\"type\":\"byte\",\"nullable\":false,\"metadata\":{}},{\"name\":\"size\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"indices\",\"type\":{\"type\":\"array\",\"elementType\":\"integer\",\"containsNull\":false},\"nullable\":true,\"metadata\":{}},{\"name\":\"values\",\"type\":{\"type\":\"array\",\"elementType\":\"double\",\"containsNull\":false},\"nullable\":true,\"metadata\":{}}]}}","metadata":"{\"ml_attr\":{\"num_attrs\":2}}"},{"name":"probability","type":"{\"type\":\"udt\",\"class\":\"org.apache.spark.ml.linalg.VectorUDT\",\"pyClass\":\"pyspark.ml.linalg.VectorUDT\",\"sqlType\":{\"type\":\"struct\",\"fields\":[{\"name\":\"type\",\"type\":\"byte\",\"nullable\":false,\"metadata\":{}},{\"name\":\"size\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"indices\",\"type\":{\"type\":\"array\",\"elementType\":\"integer\",\"containsNull\":false},\"nullable\":true,\"metadata\":{}},{\"name\":\"values\",\"type\":{\"type\":\"array\",\"elementType\":\"double\",\"containsNull\":false},\"nullable\":true,\"metadata\":{}}]}}","metadata":"{\"ml_attr\":{\"num_attrs\":2}}"},{"name":"prediction","type":"\"double\"","metadata":"{\"ml_attr\":{\"type\":\"nominal\",\"num_vals\":2}}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
termownershipaddr_stateloan_amntpayment_amtbad_loanint_rateissue_yearterm_idxownership_idxaddr_state_idxterm_classownership_classaddr_state_classfeatureslabelscaledFeaturesrawPredictionprobabilityprediction
16 monthsRENTDE5000.02000.0true15.02022.02.01.02.0Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))Map(vectorType -> sparse, length -> 2, indices -> List(1), values -> List(1.0))Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))Map(vectorType -> sparse, length -> 10, indices -> List(2, 4, 7, 8, 9), values -> List(1.0, 1.0, 1.0, 5000.0, 2000.0))0.0Map(vectorType -> dense, length -> 10, values -> List(-0.7302967433402215, -0.7302967433402215, 1.788854381999832, -1.788854381999832, 1.788854381999832, -0.7302967433402215, -0.7302967433402215, 1.788854381999832, 0.15936381457791912, 0.7302967433402214))Map(vectorType -> dense, length -> 2, values -> List(2.0477153332452165, -2.0477153332452165))Map(vectorType -> dense, length -> 2, values -> List(0.8857165622211622, 0.11428343777883776))0.0
"]}}],"execution_count":0},{"cell_type":"code","source":["from pyspark.mllib.evaluation import BinaryClassificationMetrics\nfrom pyspark.ml.linalg import Vectors\n\ndef extract(row):\n return tuple(row.probability.toArray().tolist()) + (row.label,) + (row.prediction,)\n\ndef score(model,data):\n pred = model.transform(data).select( \"probability\", \"label\", \"prediction\")\n pred = pred.rdd.map(extract).toDF([\"p0\", \"p1\", \"label\", \"prediction\"])\n return pred \n\ndef auc(pred):\n metric = BinaryClassificationMetrics(pred.select(\"p1\", \"label\").rdd)\n return metric.areaUnderROC\n\nglm_train = score(glm_model, train)\nglm_valid = score(glm_model, valid)\n\nglm_train.createOrReplaceTempView(\"glm_train\")\nglm_valid.createOrReplaceTempView(\"glm_valid\")\n\nprint (\"GLM Training AUC:\" + str(auc(glm_train)))\nprint (\"GLM Validation AUC :\" + str(auc(glm_valid)))"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"8aa5e96c-06a9-4302-8b5c-6b0a329d3da3"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"
GLM Training AUC:1.0\nGLM Validation AUC :0.0\n
","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["\n
GLM Training AUC:1.0\nGLM Validation AUC :0.0\n
"]}}],"execution_count":0},{"cell_type":"markdown","source":["## 4. Model Monitoring\n* Drift Detection\n * Data/Feature\n * Model"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"8359aa8f-ea8b-45bb-b376-f363cf4ea788"}}},{"cell_type":"code","source":["df0 = spark.sql('SELECT * FROM loan_raw_data VERSION AS OF 0')\ndisplay(df0.summary())"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"25a68911-f6c4-4194-9dd3-3850359327e9"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[["count","6","6","6","6","6","6","6","6","6"],["mean",null,null,null,null,"1333.3333333333333","4666.666666666667",null,null,null],["stddev",null,null,null,null,"1032.7955589886444","2250.925735484551",null,null,null],["min","A123","Default","1.3%","Aug-2020","0.0","2000.0","DE","16 months","MORTGAGE"],["25%",null,null,null,null,"0.0","2000.0",null,null,null],["50%",null,null,null,null,"2000.0","5000.0",null,null,null],["75%",null,null,null,null,"2000.0","7000.0",null,null,null],["max","Z111","Paid","5.0%","Oct-2020","2000.0","7000.0","WA","36 months","RENT"]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"summary","type":"\"string\"","metadata":"{}"},{"name":"client_id","type":"\"string\"","metadata":"{}"},{"name":"loan_status","type":"\"string\"","metadata":"{}"},{"name":"int_rate","type":"\"string\"","metadata":"{}"},{"name":"issue_dt","type":"\"string\"","metadata":"{}"},{"name":"payment_amt","type":"\"string\"","metadata":"{}"},{"name":"loan_amnt","type":"\"string\"","metadata":"{}"},{"name":"addr_state","type":"\"string\"","metadata":"{}"},{"name":"term","type":"\"string\"","metadata":"{}"},{"name":"ownership","type":"\"string\"","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
summaryclient_idloan_statusint_rateissue_dtpayment_amtloan_amntaddr_statetermownership
count666666666
meannullnullnullnull1333.33333333333334666.666666666667nullnullnull
stddevnullnullnullnull1032.79555898864442250.925735484551nullnullnull
minA123Default1.3%Aug-20200.02000.0DE16 monthsMORTGAGE
25%nullnullnullnull0.02000.0nullnullnull
50%nullnullnullnull2000.05000.0nullnullnull
75%nullnullnullnull2000.07000.0nullnullnull
maxZ111Paid5.0%Oct-20202000.07000.0WA36 monthsRENT
"]}}],"execution_count":0},{"cell_type":"code","source":["df1 = spark.sql('SELECT * FROM loan_raw_data VERSION AS OF 1')\ndisplay(df1.summary())"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"36002bbb-fa11-41bc-9321-dacc0bcae55f"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"overflow":false,"datasetInfos":[],"data":[["count","10","10","10","10","10","10","10","10","10"],["mean",null,null,null,null,"2000.0","4400.0",null,null,null],["stddev",null,null,null,null,"2357.0226039551585","2221.1108331943574",null,null,null],["min","A123","Default","1.3%","Aug-2020","0.0","2000.0","DE","12 months","MORTGAGE"],["25%",null,null,null,null,"0.0","2000.0",null,null,null],["50%",null,null,null,null,"2000.0","5000.0",null,null,null],["75%",null,null,null,null,"2000.0","7000.0",null,null,null],["max","Z111","Paid","5.0%","Oct-2021","8000.0","7000.0","WA","36 months","RENT"]],"plotOptions":{"displayType":"table","customPlotOptions":{},"pivotColumns":null,"pivotAggregation":null,"xColumns":null,"yColumns":null},"columnCustomDisplayInfos":{},"aggType":"","isJsonSchema":true,"removedWidgets":[],"aggSchema":[],"schema":[{"name":"summary","type":"\"string\"","metadata":"{}"},{"name":"client_id","type":"\"string\"","metadata":"{}"},{"name":"loan_status","type":"\"string\"","metadata":"{}"},{"name":"int_rate","type":"\"string\"","metadata":"{}"},{"name":"issue_dt","type":"\"string\"","metadata":"{}"},{"name":"payment_amt","type":"\"string\"","metadata":"{}"},{"name":"loan_amnt","type":"\"string\"","metadata":"{}"},{"name":"addr_state","type":"\"string\"","metadata":"{}"},{"name":"term","type":"\"string\"","metadata":"{}"},{"name":"ownership","type":"\"string\"","metadata":"{}"}],"aggError":"","aggData":[],"addedWidgets":{},"metadata":{},"dbfsResultPath":null,"type":"table","aggOverflow":false,"aggSeriesLimitReached":false,"arguments":{}}},"output_type":"display_data","data":{"text/html":["
summaryclient_idloan_statusint_rateissue_dtpayment_amtloan_amntaddr_statetermownership
count101010101010101010
meannullnullnullnull2000.04400.0nullnullnull
stddevnullnullnullnull2357.02260395515852221.1108331943574nullnullnull
minA123Default1.3%Aug-20200.02000.0DE12 monthsMORTGAGE
25%nullnullnullnull0.02000.0nullnullnull
50%nullnullnullnull2000.05000.0nullnullnull
75%nullnullnullnull2000.07000.0nullnullnull
maxZ111Paid5.0%Oct-20218000.07000.0WA36 monthsRENT
"]}}],"execution_count":0},{"cell_type":"code","source":[""],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"d81761a7-0285-44f0-a680-345b9254eda9"}},"outputs":[],"execution_count":0}],"metadata":{"application/vnd.databricks.v1+notebook":{"notebookName":"Ch-09-DeltaForReproducibleMachineLearning","dashboards":[],"notebookMetadata":{"pythonIndentUnit":2,"mostRecentlyExecutedCommandWithImplicitDF":{"commandId":844573653885235,"dataframes":["_sqldf"]}},"language":"python","widgets":{},"notebookOrigID":4384948881437005}},"nbformat":4,"nbformat_minor":0} 2 | -------------------------------------------------------------------------------- /Chapter09/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/5a427fe53b2372030161a57e3485b44757a90149/Chapter09/README.md -------------------------------------------------------------------------------- /Chapter10/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/5a427fe53b2372030161a57e3485b44757a90149/Chapter10/README.md -------------------------------------------------------------------------------- /Chapter11/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/5a427fe53b2372030161a57e3485b44757a90149/Chapter11/README.md -------------------------------------------------------------------------------- /Chapter12/Ch-12-DeltaPerformance.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","source":["# Utility function to generate data\n* Requirement: pip install faker"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"4e40241d-5097-438b-a9a8-fa44cd3d7ebf"}}},{"cell_type":"code","source":["%pip install faker"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"28e3cdaa-0439-4bf4-bb36-bea648a6dc81"}},"outputs":[],"execution_count":0},{"cell_type":"code","source":["from faker import Faker\nfrom faker.providers import BaseProvider\n\nimport random\n\ncols = [\"COMP_CODE\", \"GL_ACCOUNT\", \"FISC_YEAR\", \"BALANCE\", \"CURRENCY\", \"CURRENCY_ISO\", \"CODE\", \"MESSAGE\"]\n# Create a customer provider for generating random salaries and ages.\nclass CustomProvider(BaseProvider):\n def COMP_CODE(self):\n comp_code_range = range(1000, 1006)\n return random.choice(comp_code_range)\n \n def GL_ACCOUNT(self):\n gl_account_range = range(1000, 9999)\n return '000000' + str(random.choice(gl_account_range))\n \n def FISC_YEAR(self):\n year_range = range(2010, 2019)\n return random.choice(year_range)\n\n def BALANCE(self):\n balance_range = range(-10, 1000000)\n return random.choice(balance_range)\n\nfaker = Faker()\nfaker.add_provider(CustomProvider)\n\n\n# Generate data for 4 columns: name, age, job and salary.\ndef gen_data_v1(num: int) -> list:\n return [(faker.COMP_CODE(), faker.GL_ACCOUNT(), int(faker.FISC_YEAR()), float(faker.BALANCE()), 'USD', '$', 'FN028', '..notes..') for _ in range(num)]"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"d5cdeca6-50a8-4502-ba8b-b06c8192bf56"}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["# Clean prior run data files"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"d7c86f4d-eafb-478b-9599-c504870e317e"}}},{"cell_type":"code","source":["dbutils.fs.rm('/tmp/ch-12/', True)\n\n# Drop & recreate database\nspark.sql(\"DROP DATABASE IF EXISTS ch_12 CASCADE\")\nspark.sql(\"CREATE DATABASE ch_12 \")\nspark.sql(\"USE ch_12\")\n\n# Configure Path\nDELTALAKE_PATH = \"/tmp/ch-12/data\"\n\n# Remove table if it exists\ndbutils.fs.rm(DELTALAKE_PATH, recurse=True)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"5f9eb3c6-874b-4277-98bd-ec6477177f45"}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["## Create a Delta table"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"5ebca747-8d43-4b45-994e-ea7458d1e89a"}}},{"cell_type":"code","source":["df_0 = spark.createDataFrame(gen_data_v1(1000)).toDF(*cols)\ndf_0.write.format('delta').partitionBy('FISC_YEAR').save(DELTALAKE_PATH)\n\ns_sql = \"CREATE TABLE IF NOT EXISTS perf_test USING delta LOCATION '\" + DELTALAKE_PATH+ \"'\"\nspark.sql(s_sql)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"bceefe64-9844-4c59-9f14-7991dfda2c9c"}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["## Simulate new data coming in"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"b3ba73fe-9682-4146-9337-c20c5ea74847"}}},{"cell_type":"code","source":["for i in range(5):\n df = spark.createDataFrame(gen_data_v1(1000)).toDF(*cols)\n df.write.format('delta').mode('append').save(DELTALAKE_PATH)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"164bfd97-f223-4ff1-ac8f-8660dfc3115e"}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["# Tune Table Properties & spark settings"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"cd5570a8-107f-4bbb-aeae-b8bc6603b1cc"}}},{"cell_type":"markdown","source":["## Optimize Writes\n* combines multiple small files to reduce the number of disk I/O operations"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"6b5a5610-4013-4a4f-9db3-9ba75231c670"}}},{"cell_type":"code","source":["%sql\nALTER TABLE perf_test SET TBLPROPERTIES ('delta.autoOptimize.optimizeWrite' = 'true');"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"3fcbb29d-c5b9-4bac-ae81-8e0a862c93eb"}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["## Randomize File Prefixes to avoid hotspots\n * ALTER TABLE SET TBLPROPERTIES 'delta.randomizeFilePrefixes' = 'true'"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"5e666076-7a69-4f41-893e-ac34186b3157"}}},{"cell_type":"code","source":["%sql\nALTER TABLE perf_test SET TBLPROPERTIES ('delta.randomizeFilePrefixes' = 'true');"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"834de500-33f8-4b39-994e-207c6218d32e"}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["## Other specialized settings (specific to Databricks Runtime)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"bd07948f-4a46-4141-a150-34b758d92dea"}}},{"cell_type":"markdown","source":["### Dynamic File Pruning\nSET spark.databricks.optimizer.dynamicFilePruning = true;\n* Useful for non-partitioned tables, or for joins on non-partitioned columns.
"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"1e9cd102-2e28-4f3d-bcdd-a003df50f67e"}}},{"cell_type":"markdown","source":["#### Max file size on disk\nSET spark.databricks.delta.optimize.maxFileSize = 1610612736;"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"06c09fd6-d784-412b-bc90-adbf3140225c"}}},{"cell_type":"markdown","source":["#### Join\nSET spark.databricks.optimizer.dynamicFilePruning = true;\n* Number of files of the Delta table on the probe side of the join required to trigger dynamic file pruning\n* minimum table size on the probe side of the join required to trigger (DFP) \n* default is 10G
"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"3ebb767b-e7e6-497f-85ae-f159f79996bd"}}},{"cell_type":"markdown","source":["### IO Caching (Delta Cache)\nSET spark.databricks.io.cache.enabled = true;
\nSET spark.databricks.io.cache.maxDiskUsage = <> ;
\nSET spark.databricks.io.cache.maxMetaDataCache = <> ;
\nSET spark.databricks.io.cache.compression.enabled = true;
\nCACHE SELECT * FROM perf_test;"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"685d454a-17e5-40b4-bf25-a542337d66cd"}}},{"cell_type":"markdown","source":["### Optimize Join performance (Range & Skew Joins using hints)\nSET spark.databricks.optimizer.rangeJoin.binSize=5;
\n \nSELECT /*+ RANGE_JOIN(points, 10) */ *
\nFROM points JOIN ranges ON points.p >= ranges.start AND points.p < ranges.end;"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"f4e82161-4fa9-4cd5-bcfa-3e262d3bc39e"}}},{"cell_type":"markdown","source":["### Enable Low Shuffle Merge\nSET spark.databricks.delta.merge.enableLowShuffle = true;"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"49fd51cd-18ae-404f-ab7f-92051d23a9b6"}}},{"cell_type":"markdown","source":["# Optimize (file management)\n* OPTIMIZE [WHERE ] ZORDER BY ([, …]) \n* Combine small filess into larger one on disk\n* Aids file skipping\n* Bin packing is idempotent meaning 2nd run without any new data does not have any impact on data layout"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"4432093a-f65f-4d14-8558-2da5befc362e"}}},{"cell_type":"code","source":["%sql\n-- optimize entire table\nOPTIMIZE perf_test;"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"d4856358-9e6a-4d2c-805d-021967b1290d"}},"outputs":[],"execution_count":0},{"cell_type":"code","source":["%sql\n-- optimize only subset of table ex. recent data\nOPTIMIZE perf_test WHERE FISC_YEAR >= 2015;"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"41083ae2-ef59-4b76-ac5a-af08381ee1b6"}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["# ZORDER\n* Best applied on increment data \n* it is not idempotent"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"1a237d16-9db1-4db3-85e8-4d5f64f3489d"}}},{"cell_type":"code","source":["%sql\nOPTIMIZE perf_test\nWHERE FISC_YEAR >= 2015 \nZORDER BY (Comp_Code);"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"80aa7707-1b72-4c80-9bd5-8bc3941027da"}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["# Bloom Filter\n* Databricks specific feature"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"c7277d38-eb6a-4998-bc88-69c1db91ac89"}}},{"cell_type":"code","source":["%sql\n-- Enable the Bloom filter index capability \nSET spark.databricks.io.skipping.bloomFilter.enabled = true; \n\nCREATE BLOOMFILTER INDEX \nON TABLE perf_test\nFOR COLUMNS(balance OPTIONS (fpp=0.1, numItems=50000000));"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"35486962-6fdd-43e4-8905-c2dbb588e463"}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["# Use Delta APIs"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"7c2bed55-72a6-43ba-ad70-5aac79ff4baf"}}},{"cell_type":"code","source":["from delta.tables import * \n\ndeltaTable = DeltaTable.forPath(spark, DELTALAKE_PATH) "],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"86adb60a-9484-4e25-8ebf-1a727ec5c0b4"}},"outputs":[],"execution_count":0},{"cell_type":"code","source":["deltaTable."],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"410e6a57-6dca-417b-9409-2641e9e4db79"}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["## Optimize"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"722def5b-986f-4b11-a1b8-dc203245160f"}}},{"cell_type":"code","source":["deltaTable.optimize()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"8e81321d-ff5d-4530-a7c8-ca155b580cc6"}},"outputs":[],"execution_count":0},{"cell_type":"code","source":["deltaTable.optimize().where(\"FISC_YEAR='2011'\").executeCompaction()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"b89f425c-3ea5-4d4c-95cd-bf2187b42eba"}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["## Vacuum"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"74ebec66-2c5a-4b48-a49f-cc6b65ec7eae"}}},{"cell_type":"code","source":["# versions older than the default retention period\ndeltaTable.vacuum() "],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"6731821b-7921-4815-92a6-337bd3d26bf2"}},"outputs":[],"execution_count":0},{"cell_type":"code","source":["# not required by versions more than 100 hours old\ndeltaTable.vacuum(100) "],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"5125f06e-a0e5-4f39-8abe-955bebb5baf2"}},"outputs":[],"execution_count":0}],"metadata":{"application/vnd.databricks.v1+notebook":{"notebookName":"Ch-12-DeltaPerformance","dashboards":[],"notebookMetadata":{"pythonIndentUnit":2,"mostRecentlyExecutedCommandWithImplicitDF":{"commandId":844573653885291,"dataframes":["_sqldf"]}},"language":"python","widgets":{},"notebookOrigID":844573653885265}},"nbformat":4,"nbformat_minor":0} 2 | -------------------------------------------------------------------------------- /Chapter12/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/5a427fe53b2372030161a57e3485b44757a90149/Chapter12/README.md -------------------------------------------------------------------------------- /Chapter13/Ch-13-MappingYourDeltaJourney.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","source":["# Clean prior run data files"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"bdec79ed-4c9d-4e65-9167-db5668b7ea9b"}}},{"cell_type":"code","source":["dbutils.fs.rm('/tmp/ch-13/', True)\n\n# Drop & recreate database\nspark.sql(\"DROP DATABASE IF EXISTS ch_13 CASCADE\")\nspark.sql(\"CREATE DATABASE ch_13 \")\nspark.sql(\"USE ch_13\")\n\n# Configure Path\nDELTALAKE_PATH = \"/tmp/ch-13/data\"\n\n# Remove table if it exists\ndbutils.fs.rm(DELTALAKE_PATH, recurse=True)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"f3c0b859-d57b-46d3-8fdf-116b20670537"}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["# Migration to Delta Operational Simplification\n* No ned to Refresh Table as Delta always return the most up-to-date information\n* No ned to run Table repair commands such as MSCK or babysit partition creation/deletion\n* Because listing large number of files in a directory is often slower than reading the list of files from the transaction log, you can use gneric WHERE clause instead of trying to optimize by loading single partitions explicitly\n * Use spark.read.delta(\"/data\").where(\"partitionKey = 'value'\")\n * as opposed to spark.read.format(\"parquet\").load(\"/data/partitionKey=value\")\n* The delta transactional log is the source of truth for the state of data\n * Do not manually manipulate the data/log files as you can invalidate the data\n* Use the language, services, connectors, or database of your choice with Delta Lake and Delta Sharing. \n * https://delta.io/integrations/"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"bfa20dbb-3642-4b4e-bc35-6913a757bbbb"}}},{"cell_type":"markdown","source":["# Do you have the buy in for your Data Initiative?\n* Is the Business Use Case and value proposition clearly articulated?\n* Have businesss requirements been captured, prioritized and agreed upon? (Functional & Non-Functional)\n* Are stakeholders and champions vessted with a timeline and budget?"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"37a6e57d-b808-484d-9490-7274796d27c5"}}},{"cell_type":"markdown","source":["# Developing your data product/service for production\n* Have businesss requirements been mapped to clear technical Requirements (Functional & Non-Functional)\n* Access to Data?\n * Identify sensitive data sets\n * Ascrtain user priveleges\n* Are folks available and trained to work on the big data initiative?\n * Identify enablement and training needs\n* For a multi-tenancy scenario, have all access considerations been accounted for?\n * Understanding the collaboration and IIsolation needs of data teams"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"b4d3b9e8-2f0c-4c22-a668-93be9e712b08"}}},{"cell_type":"markdown","source":["# Data Migration Plays\n* Use Ventor tools to automate bulk of Migrating from other data systems such as Hadoop, Oracle, Netezza, Teradata\n * Data storage\n * Metadata storage\n * Code migration around compatible libraries and APIs\n * Data processing and transformations\n * Security\n * Orchestration of jobs and workflows\n* Fix workloads that cannot be automated\n* Establish performance benchmarks before and after\n* Tune the expensive workloads\n* Run workloads in parallel before switching off older systems"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"e98edf7f-515e-4b9c-a58b-d683a5556c6a"}}},{"cell_type":"markdown","source":["* Convert a Parquet table to Delta\n * CONVERT TO DELTA ``\n* Convert files to Delta format and create a table using that data\n * CONVERT TO DELTA parquet.<`/data-path/`>\n * CREATE TABLE `` USING DELTA LOCATION <’/data-path/’>\n* Convert a non-Parquet format such as ORC to Parquet and then to Delta\n * Read dataframe as orc and save as parquet\n * CREATE TABLE `` USING PARQUET LOCATION <’/data-path/’>\n * CONVERT TO DELTA ``\n* Generate a manifest file that can be read by other processing engines\n * GENERATE symlink_format_manifest FOR TABLE ``"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"9233cfc0-7150-4703-a4d8-ee5fadb7ed42"}}},{"cell_type":"markdown","source":["## Parquet to Delta\n* Conversion in place\n* Creates a Delta Lake transaction log that tracks all files in provided directory, automatically infers the schema\n* Collects statistics to improve query performance on the converted Delta table. \n* If table name is provided, the metastore is also updated \n* Run Additional OPTIMIZE/ZORDER for better performance on Delta\n* Caution\n * Avoid changes to the data files during the conversion process. \n * If multiple external tables share the same underlying Parquet directory, all of them should be converted\n * Turning off stats collection using NO STATISTICS will hasten the conversion process\n * If table has partitions, use the PARTITIONED BY clause on appropriate column/s"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"4fa4cfba-646a-439e-81e8-10f0c8a71921"}}},{"cell_type":"code","source":["#simulate a parquet data path and parquet table\ncolumns = [\"State\",\"Name\", \"Age\"]\ndata = [(\"TX\",\"Jack\", 25), (\"NV\",\"Jane\",66), (\"CO\",\"Bill\",79),(\"CA\",\"Tom\",53), (\"WY\",\"Shawn\",45)]\nage_df = spark.sparkContext.parallelize(data).toDF(columns)\nage_location = DELTALAKE_PATH+'/demographic'\nage_df.write.format('parquet').save(age_location)\n\ns_sql = \"CREATE TABLE IF NOT EXISTS demographic USING parquet LOCATION '\" + age_location + \"'\"\nspark.sql(s_sql)\nspark.sql('DESCRIBE EXTENDED demographic')"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"81f0d086-b555-4881-ac32-76e13a4a1de8"}},"outputs":[],"execution_count":0},{"cell_type":"code","source":["df = spark.sql('DESCRIBE EXTENDED demographic')\ndf.filter(df.col_name.like('Provider')).show()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"92f007ee-5487-4993-89dc-ddf3e6332e39"}},"outputs":[],"execution_count":0},{"cell_type":"code","source":["# Convert a Parquet table to Delta\nspark.sql('CONVERT TO DELTA demographic')"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"f3f22778-549a-45ad-8052-68c489502e80"}},"outputs":[],"execution_count":0},{"cell_type":"code","source":["df = spark.sql('DESCRIBE EXTENDED demographic')\ndf.filter(df.col_name.like('Provider')).show()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"bded19f8-68af-48a5-b5c4-55f6807643e6"}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["## Non-Parquet, Eg. ORC to Delta\n* first convert to parquet, then to delta"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"5a931578-b0be-48b8-b9f3-be759e880570"}}},{"cell_type":"code","source":["age_location_orc = DELTALAKE_PATH+'/demographic_orc'\nage_df.write.format('orc').save(age_location_orc)\n\ns_sql = \"CREATE TABLE IF NOT EXISTS demographic_orc USING ORC LOCATION '\" + age_location_orc + \"'\"\nspark.sql(s_sql)\n\ndf_orc = spark.sql('DESCRIBE EXTENDED demographic_orc')\ndf_orc.filter(df_orc.col_name.like('Provider')).show()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"b75745d3-b8c4-4bf0-876a-8dfeaab42a85"}},"outputs":[],"execution_count":0},{"cell_type":"code","source":["orc_df = spark.sql(\"SELECT * FROM demographic_orc\")\nage_location_orc_pq = DELTALAKE_PATH+'/demographic_orc_pq'\norc_df.write.format('parquet').save(age_location_orc_pq)\ns_sql = \"CREATE TABLE IF NOT EXISTS demographic_orc_pq USING parquet LOCATION '\" + age_location_orc_pq + \"'\"\nspark.sql(s_sql)\n\ndf_orc_pq = spark.sql('DESCRIBE EXTENDED demographic_orc_pq')\ndf_orc_pq.filter(df_orc_pq.col_name.like('Provider')).show()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"2b70a5f6-03d2-4d66-ad2e-f47ec862a1f1"}},"outputs":[],"execution_count":0},{"cell_type":"code","source":["spark.sql('CONVERT TO DELTA demographic_orc_pq')\ndf_orc_pq_delta = spark.sql('DESCRIBE EXTENDED demographic_orc_pq')\ndf_orc_pq_delta.filter(df_orc_pq_delta.col_name.like('Provider')).show()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"24120f92-358d-41bb-a850-a1a31d648dd3"}},"outputs":[],"execution_count":0},{"cell_type":"markdown","source":["## Undo Conversion Operation\n* VACUUM delta.`` RETAIN 0 HOURS\n* Delete the /_delta_log directory."],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"ea68ce5b-271c-441a-8827-0c6b09e34ea8"}}},{"cell_type":"markdown","source":["# Capacity Planning\n* Understand Consumption patterns\n * Number of total users & concurrent users\n * Benchmark known and reepresentative queries\n* Understand Data Volumes & Processing needs \n * per day/per month/per year\n * Establish a yarly projection\n * Extend to multi-year with buffer for growth of use cases \n* Benchmarking to ascertain Cluster type and sizing\n* How many environments ?\n * Dev/Staging/Prod ?\n * Planning for Disaster Recovery"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"eee92f3c-09b0-4188-bde8-ee7403639538"}}},{"cell_type":"markdown","source":["# Data Democratization\n* Via Policy & Process\n* Identify needs for Delta Data Sharing"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"595defa6-0149-4571-8f6a-c3f394a04b6c"}}},{"cell_type":"markdown","source":["# Managing & Monitoring\n* Audit Logs\n* Cluster Logs\n* Spark Metrics\n* Sytem Metrics\n* Custom Logging"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"1c2c3fc0-6c80-49ff-acfe-0e9255dd466a"}}},{"cell_type":"markdown","source":["# Establish COE\n* Responsible for\n * Infrastructure, on-prem/cloud strategy\n * Centralized Governance and Security\n * Approving architecture blueprints\n * Enablement/Training\n * Reporting on usage and chargeback\n * Automation"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"167fb4a8-419c-4915-8ad6-d0b00bfc6a21"}}}],"metadata":{"application/vnd.databricks.v1+notebook":{"notebookName":"Ch-13-MappingYourDeltaJourney","dashboards":[],"notebookMetadata":{"pythonIndentUnit":2,"mostRecentlyExecutedCommandWithImplicitDF":{"commandId":1267259906084475,"dataframes":["_sqldf"]}},"language":"python","widgets":{},"notebookOrigID":844573653885267}},"nbformat":4,"nbformat_minor":0} 2 | -------------------------------------------------------------------------------- /Chapter13/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Simplifying-Data-Engineering-and-Analytics-with-Delta/5a427fe53b2372030161a57e3485b44757a90149/Chapter13/README.md -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Packt 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | ### [Packt Conference : Put Generative AI to work on Oct 11-13 (Virtual)](https://packt.link/JGIEY) 3 | 4 |

[![Packt Conference](https://hub.packtpub.com/wp-content/uploads/2023/08/put-generative-ai-to-work-packt.png)](https://packt.link/JGIEY)

5 | 3 Days, 20+ AI Experts, 25+ Workshops and Power Talks 6 | 7 | Code: USD75OFF 8 | 9 | 10 | 11 | 12 | # Simplifying Data Engineering and Analytics with Delta 13 | 14 | Simplifying Data Engineering and Analytics with Delta 15 | 16 | This is the code repository for [Simplifying Data Engineering and Analytics with Delta](https://www.packtpub.com/product/simplifying-data-engineering-and-analytics-with-delta/9781801814867?utm_source=github&utm_medium=repository&utm_campaign=9781801814867), published by Packt. 17 | 18 | **Create analytics-ready data that fuels artificial intelligence and business intelligence** 19 | 20 | ## What is this book about? 21 | Delta helps you generate reliable insights at scale and simplifies architecture around data pipelines, allowing you to focus primarily on refining the use cases being worked on. This is especially important when you consider that existing architecture is frequently reused for new use cases 22 | 23 | This book covers the following exciting features: 24 | * Explore the key challenges of traditional data lakes 25 | * Appreciate the unique features of Delta that come out of the box 26 | * Address reliability, performance, and governance concerns using Delta 27 | * Analyze the open data format for an extensible and pluggable architecture 28 | * Handle multiple use cases to support BI, AI, streaming, and data discovery 29 | * Discover how common data and machine learning design patterns are executed on Delta 30 | * Build and deploy data and machine learning pipelines at scale using Delta 31 | 32 | If you feel this book is for you, get your [copy](https://www.amazon.com/dp/B09NC5XJ6D) today! 33 | 34 | https://www.packtpub.com/ 36 | 37 | 38 | ## Instructions and Navigations 39 | All of the code is organized into folders. 40 | 41 | The code will look like the following: 42 | ``` 43 | SELECT COUNT(*) FROM some _ parquet _ table 44 | ``` 45 | 46 | **Following is what you need for this book:** 47 | Data engineers, data scientists, ML practitioners, BI analysts, or anyone in the data domain working with big data will be able to put their knowledge to work with this practical guide to executing pipelines and supporting diverse use cases using the Delta protocol. Basic knowledge of SQL, Python programming, and Spark is required to get the most out of this book. 48 | 49 | With the following software and hardware list you can run all code files present in the book (Chapter 1-13). 50 | 51 | ### Software and Hardware List 52 | 53 | 54 | Basic knowledge of SQL, Python programming, and Spark is required to get the most 55 | out of this book. Delta is open source and can be run both on-prem and in the cloud. 56 | Because of the rise in cloud data platforms, a lot of the descriptions and examples are in 57 | the context of cloud storage. 58 | Use the following GitHub link for the Delta Lake documentation and quickstart guide 59 | to help you set up your environment and become familiar with the necessary APIs: 60 | https://github.com/delta-io/delta. 61 | Databricks is the original creator of Delta, which was open sourced to the Linux 62 | Foundation and is supported by a large user community. Examples in this book cover 63 | some Databricks-specific features to provide a complete view of features and capabilities. 64 | Newer features continue to be ported from Databricks to open source Delta. Please refer 65 | to the proposed roadmap for the feature migration details: https://github.com/ 66 | delta-io/delta/issues/920. 67 | 68 | 69 | We also provide a PDF file that has color images of the screenshots/diagrams used in this book. [Click here to download it](https://packt.link/UI11F). 70 | 71 | 72 | ### Related products 73 | * Data Engineering with AWS [[Packt]](https://www.packtpub.com/product/data-engineering-with-aws/9781800560413?utm_source=github&utm_medium=repository&utm_campaign=9781800560413) [[Amazon]](https://www.amazon.com/dp/B09C2MN5DV) 74 | 75 | * Data Engineering with Apache Spark, Delta Lake, and Lakehouse [[Packt]](https://www.packtpub.com/product/data-engineering-with-apache-spark-delta-lake-and-lakehouse/9781801077743?utm_source=github&utm_medium=repository&utm_campaign=9781801077743) [[Amazon]](https://www.amazon.com/dp/B098X63L4V) 76 | 77 | ## Get to Know the Author 78 | **Anindita Mahapatra** 79 | is a lead solutions architect at Databricks in the data and AI space helping clients across all industry verticals reap value from their data infrastructure investments. She teaches a data engineering and analytics course at Harvard University as part of their extension school program. She has extensive big data and Hadoop consulting experience from Think Big/Teradata, prior to which she was managing the development of algorithmic app discovery and promotion for both Nokia and Microsoft stores. She holds a master’s degree in liberal arts and management from Harvard Extension School, a master’s in computer science from Boston University, and a bachelor’s in computer science from BITS Pilani, India. 80 | ### Download a free PDF 81 | 82 | If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.
Simply click on the link to claim your free PDF.
83 |

https://packt.link/free-ebook/9781801814867

--------------------------------------------------------------------------------