├── PYSPARK_TESTING
    └── PYT01 - Unit Testing.py
├── PySpark_ETL
    ├── PS00-Introduction.py
    ├── PS01-Read Files.py
    ├── PS02-Schema Handling.py
    ├── PS03-Creating Dataframes.py
    ├── PS04-Basic Transformation.py
    ├── PS05-Handling JSON.py
    ├── PS06-JOINS.py
    ├── PS07-Grouping & Aggregation.py
    ├── PS08-Ordering Data.py
    ├── PS09-String Functions.py
    ├── PS10-Date & Time Functions.py
    ├── PS11-Partitioning & Repartitioning.py
    ├── PS12-Missing Value Handling.py
    ├── PS13-Deduplication.py
    ├── PS14-Data Profiling using PySpark.py
    ├── PS15-Data Caching.py
    ├── PS16-User Defined Functions.py
    ├── PS17-Write Data.py
    └── Z01- Case Study Sales Order Analysis.py
├── README.md
├── SETUP
    ├── _clean_up.py
    ├── _initial_setup.py
    ├── _pyspark_clean_up.py
    ├── _pyspark_init_setup.py
    ├── _pyspark_setup_files.py
    ├── _setup_database.py
    └── _setup_demo_table.py
└── SQL Refresher
    ├── PS000-INTRODUCTION.py
    ├── SR000-Introduction.py
    ├── SR001-Basic CRUD.py
    ├── SR002-Select & Filtering.py
    ├── SR003-JOINS.py
    ├── SR004-Order & Grouping.py
    ├── SR005-Sub Queries.py
    ├── SR006-Views & Temp Views.py
    ├── SR007-Common Table Expressions.py
    ├── SR008 - EXCEPT, UNION, UNION ALL, INTERSECTION.py
    ├── SR009-External Tables.py
    ├── SR010-Drop database & tables.py
    ├── SR011-Check Table & Database Details.py
    └── SR012-Versioning, Time Travel & Optimization.py


/PYSPARK_TESTING/PYT01 - Unit Testing.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC 
  4 | # MAGIC ### What is unit testing?
  5 | # MAGIC UNIT TESTING is a type of software testing where individual units or components of a software are tested. The purpose is to validate that each unit of the software code performs as expected. Unit Testing is done during the development (coding phase) of an application by the developers.
  6 | # MAGIC 
  7 | # MAGIC unit testing
  8 | # MAGIC By
  9 | # MAGIC TechTarget Contributor
 10 | # MAGIC Unit testing is a software development process in which the smallest testable parts of an application, called units, are individually and independently scrutinized for proper operation. This testing methodology is done during the development process by the software developers and sometimes QA staff.  The main objective of unit testing is to isolate written code to test and determine if it works as intended.
 11 | # MAGIC 
 12 | # MAGIC Unit testing is an important step in the development process, because if done correctly, it can help detect early flaws in code which may be more difficult to find in later testing stages.
 13 | # MAGIC 
 14 | # MAGIC Unit testing is a component of test-driven development (TDD), a pragmatic methodology that takes a meticulous approach to building a product by means of continual testing and revision. This testing method is also the first level of software testing, which is performed before other testing methods such as integration testing. Unit tests are typically isolated to ensure a unit does not rely on any external code or functions. Testing can be done manually but is often automated.
 15 | # MAGIC 
 16 | # MAGIC How unit tests work
 17 | # MAGIC A unit test typically comprises of three stages: plan, cases and scripting and the unit test itself. In the first step, the unit test is prepared and reviewed. The next step is for the test cases and scripts to be made, then the code is tested.
 18 | # MAGIC 
 19 | # MAGIC Test-driven development requires that developers first write failing unit tests. Then they write code and refactor the application until the test passes. TDD typically results in an explicit and predictable code base.
 20 | # MAGIC 
 21 | # MAGIC 
 22 | # MAGIC Each test case is tested independently in an isolated environment, as to ensure a lack of dependencies in the code. The software developer should code criteria to verify each test case, and a testing framework can be used to report any failed tests. Developers should not make a test for every line of code, as this may take up too much time. Developers should then create tests focusing on code which could affect the behavior of the software being developed.
 23 | # MAGIC 
 24 | # MAGIC Unit testing involves only those characteristics that are vital to the performance of the unit under test. This encourages developers to modify the source code without immediate concerns about how such changes might affect the functioning of other units or the program as a whole. Once all of the units in a program have been found to be working in the most efficient and error-free manner possible, larger components of the program can be evaluated by means of integration testing. Unit tests should be performed frequently, and can be done manually or can be automated.
 25 | # MAGIC 
 26 | # MAGIC ### Types of unit testing
 27 | # MAGIC * Unit tests can be performed manually or automated. Those employing a manual method may have an instinctual document made detailing each step in the process; however, automated testing is the more common method to unit tests. Automated approaches commonly use a testing framework to develop test cases. These frameworks are also set to flag and report any failed test cases while also providing a summary of test cases.
 28 | # MAGIC 
 29 | # MAGIC ### Advantages and disadvantages of unit testing
 30 | # MAGIC Advantages to unit testing include:                              
 31 | # MAGIC * The earlier a problem is identified, the fewer compound errors occur.
 32 | # MAGIC * Costs of fixing a problem early can quickly outweigh the cost of fixing it later.
 33 | # MAGIC * Debugging processes are made easier.
 34 | # MAGIC * Developers can quickly make changes to the code base.
 35 | # MAGIC * Developers can also re-use code, migrating it to new projects.
 36 | # MAGIC 
 37 | # MAGIC ### Concepts in an object-oriented way for Python Unittest
 38 | # MAGIC * Test fixture- the preparation necessary to carry out test(s) and related cleanup actions.
 39 | # MAGIC * Test case- the individual unit of testing.
 40 | # MAGIC * A Test suite- collection of test cases, test suites, or both.
 41 | # MAGIC * Test runner- component for organizing the execution of tests and for delivering the outcome to the user.
 42 | # MAGIC 
 43 | # MAGIC __*always start unittest function name with "test_".*__
 44 | 
 45 | # COMMAND ----------
 46 | 
 47 | #Creating Add function for addition
 48 | def add(a, b):
 49 |     return a + b
 50 | #Creating multi function for multiplication
 51 | def multi(a,b):
 52 |     return a*b
 53 | 
 54 | def subt(a,b):
 55 |     return a-b
 56 | 
 57 | # COMMAND ----------
 58 | 
 59 | import unittest 
 60 | 
 61 | #Creating Class for Unit Testing
 62 | class test_class(unittest.TestCase):
 63 |     def test_add(self):
 64 |           self.assertEqual(10, add(7, 3))
 65 |             
 66 |     def test_multi(self):
 67 |           self.assertEqual(25,multi(5,5))
 68 |             
 69 |     @unittest.skip("OBSELETE METHOD")
 70 |     def test_multi(self):
 71 |           self.assertEqual(25,multi(5,5))
 72 |     
 73 | # create a test suite  using loadTestsFromTestCase()
 74 | suite = unittest.TestLoader().loadTestsFromTestCase(test_class)
 75 | #Running test cases using Test Cases Suit.
 76 | p = (unittest.TextTestRunner(verbosity=2).run(suite))
 77 | 
 78 | # COMMAND ----------
 79 | 
 80 | # MAGIC %md
 81 | # MAGIC ### Demo
 82 | # MAGIC Let's perform unit testing with real dataset.
 83 | 
 84 | # COMMAND ----------
 85 | 
 86 | # MAGIC %md
 87 | # MAGIC ### Read Data
 88 | 
 89 | # COMMAND ----------
 90 | 
 91 | # MAGIC %run ../SETUP/_pyspark_init_setup
 92 | 
 93 | # COMMAND ----------
 94 | 
 95 | df_ol = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderlist.csv")
 96 | df_od = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderdetails.csv")
 97 | df_st = spark.read.option("header", "true").csv("/FileStore/datasets/sales/salestarget.csv")
 98 | 
 99 | # COMMAND ----------
100 | 
101 | from pyspark.sql.functions import col, round
102 | 
103 | # COMMAND ----------
104 | 
105 | # MAGIC %md
106 | # MAGIC ### joining order list and order details.
107 | 
108 | # COMMAND ----------
109 | 
110 | df_join = df_ol\
111 |         .join(df_od, df_ol["Order ID"]==df_od["Order ID"], "inner")\
112 |         .withColumn("Profit", col("Profit").cast("decimal(10,2)"))\
113 |         .withColumn("Amount", col("Amount").cast("decimal(10,2)"))\
114 |         .select(df_ol["Order ID"], "Amount", "Profit", "Quantity", "Category", "State", "City")\
115 |         .limit(50)
116 | display(df_join)
117 | 
118 | # COMMAND ----------
119 | 
120 | # MAGIC %md
121 | # MAGIC ### Preparing smaller dataset for unit testing
122 | 
123 | # COMMAND ----------
124 | 
125 | df_ut = df_join.limit(50)  # defining dataframe to perform our unit test. We will tes our functions for a smaller dataset that is why we are only taking 50 records.
126 | 
127 | display(df_ut)
128 | 
129 | # COMMAND ----------
130 | 
131 | # MAGIC %md
132 | # MAGIC ### Functions to calculate total & average sales
133 | 
134 | # COMMAND ----------
135 | 
136 | # function calculates average profit for each state, round upto 2 decimal digits
137 | def state_avg_profit(df):
138 |     return df.groupBy("State").mean("Profit").withColumn("avg(Profit)", round(col("avg(Profit)"), 2)).orderBy("State")
139 | 
140 | # calculates total sales for each state
141 | def state_total_sales(df):
142 |     return df.groupBy("State").sum("Amount").withColumn("sum(Amount)", round(col("sum(Amount)"), 2)).orderBy("State")
143 | 
144 | # COMMAND ----------
145 | 
146 | # df_avg_profit consists average sale for each state. Calculating average & total sales for our testing dataframe
147 | df_avg_profit = state_avg_profit(df_ut)
148 | display(df_avg_profit)
149 | 
150 | df_total_sales = state_total_sales(df_ut)
151 | display(df_total_sales)
152 | 
153 | # COMMAND ----------
154 | 
155 | # MAGIC %md
156 | # MAGIC ###  Now our aim is to test our state_avg_profit() function.
157 | # MAGIC We will compare Gujarat's average sale & total sale. for testing purpose I have calculated Gujarat's average profit in an excel sheet(for out unit testing dataframe (50 records) ). It should be -225.6.
158 | 
159 | # COMMAND ----------
160 | 
161 | # Defining variable which include average & total sales of gujarat.
162 | 
163 | # COMMAND ----------
164 | 
165 | avg_sale_guj = float(df_avg_profit.where(col("State")=="Gujarat" ).collect()[0]["avg(Profit)"])
166 | total_sale_guj = float(df_total_sales.where(col("State")=="Gujarat" ).collect()[0]["sum(Amount)"])
167 | 
168 | # COMMAND ----------
169 | 
170 | # MAGIC %md
171 | # MAGIC ### Defining unit test  & performing 
172 | 
173 | # COMMAND ----------
174 | 
175 | import unittest 
176 | 
177 | #Creating Class for Unit Testing
178 | class sales_unit_class(unittest.TestCase):
179 |     
180 |     def test_avg_profit(self):
181 |         expected_avg_sales = -225.60 # This is the value which is expected we calculated using excel
182 |         self.assertEqual(expected_avg_sales, avg_sale_guj)
183 |     
184 |     def test_record_count(self):
185 |         # this is the expected sale, I have put wrong value to fail this test. Correct value should be 1782.0
186 |         #If you will change it to correct value, it will pass the unit test.
187 |         expected_total_sales = 1789.0 
188 |         self.assertEqual(expected_total_sales, total_sale_guj)
189 |     
190 | # create a test suite  for test_class using  loadTestsFromTestCase()
191 | suite = unittest.TestLoader().loadTestsFromTestCase(sales_unit_class)
192 | #Running test cases using Test Cases Suit..
193 | unittest.TextTestRunner(verbosity=2).run(suite)
194 | 
195 | # COMMAND ----------
196 | 
197 | # MAGIC %md
198 | # MAGIC ### Attention:
199 | # MAGIC __*This is for demo purpose that is why we are writing unit testing in the same notebook but it is always prefered to write unit testin in a separate notebook.*__
200 | 
201 | # COMMAND ----------
202 | 
203 | # MAGIC %run ../SETUP/_pyspark_clean_up
204 | 
205 | # COMMAND ----------
206 | 
207 | 
208 | 


--------------------------------------------------------------------------------
/PySpark_ETL/PS00-Introduction.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ### PySpark ETL
 4 | # MAGIC In this tutorial we will learn basic ETL pipelines using pyspark. ETL stands for Extract, Transform & Load.
 5 | # MAGIC 
 6 | # MAGIC * Extract - Extracting data from a source to your storage. e.g csv to dataframe
 7 | # MAGIC * Tranform - applying transformation e.g. adding new column, rename column, change type, drop columns, derived columns, filter operations, joins etc.
 8 | # MAGIC * Load - Load back processed or transformed data to the sink/destination storage.
 9 | # MAGIC 
10 | # MAGIC All the code is available in github, feel free to pull or fork the repository:
11 | # MAGIC 
12 | # MAGIC __https://github.com/martandsingh/ApacheSpark__
13 | # MAGIC 
14 | # MAGIC follow me on linkedin for more updates:
15 | # MAGIC 
16 | # MAGIC __https://www.linkedin.com/in/martandsays/__
17 | # MAGIC 
18 | # MAGIC ![PYSPARK_ETL](https://raw.githubusercontent.com/martandsingh/images/master/etl_banner.gif)
19 | 
20 | # COMMAND ----------
21 | 
22 | 
23 | 


--------------------------------------------------------------------------------
/PySpark_ETL/PS01-Read Files.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC ### Read CSV file using PySpark
  4 | # MAGIC 
  5 | # MAGIC ### What is SparkContext?
  6 | # MAGIC A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. It is Main entry point for Spark functionality.
  7 | # MAGIC 
  8 | # MAGIC *Note: Only one SparkContext should be active per JVM. You must stop() the active SparkContext before creating a new one.*
  9 | # MAGIC 
 10 | # MAGIC ### What is SparkSession?
 11 | # MAGIC SparkSession is the entry point to Spark SQL. It is one of the very first objects you create while developing a Spark SQL application.
 12 | # MAGIC 
 13 | # MAGIC As a Spark developer, you create a SparkSession using the SparkSession.builder method (that gives you access to Builder API that you use to configure the session).
 14 | # MAGIC 
 15 | # MAGIC 
 16 | # MAGIC ### Spark Context Vs Spark Session
 17 | # MAGIC SparkSession vs SparkContext – Since earlier versions of Spark or Pyspark, SparkContext (JavaSparkContext for Java) is an entry point to Spark programming with RDD and to connect to Spark Cluster, Since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset.
 18 | # MAGIC 
 19 | # MAGIC 
 20 | # MAGIC By default, Databricks notebook provides a spark context object named "spark". This is the prebuild context object, we can use it directly.
 21 | # MAGIC ![PYSPARK_CSV](https://raw.githubusercontent.com/martandsingh/images/master/pyspark-read-csv.png)
 22 | 
 23 | # COMMAND ----------
 24 | 
 25 | spark
 26 | 
 27 | # COMMAND ----------
 28 | 
 29 | # MAGIC %run ../SETUP/_pyspark_init_setup
 30 | 
 31 | # COMMAND ----------
 32 | 
 33 | # MAGIC %md
 34 | # MAGIC ### Read CSV file
 35 | # MAGIC We will read a CSV file from our __DBFS (Databricks File Storage)__. I have upload my CSV file to FileStore/tables/cancer.csv
 36 | # MAGIC You can find all the dataset used in these tutorials at https://github.com/martandsingh/datasets.
 37 | 
 38 | # COMMAND ----------
 39 | 
 40 | # Import CSV file. For csv file, by default delimiter is comma so no need to mention in case of comma separated values. But for other delmiter you have to mention it.
 41 | df_csv = spark \
 42 |     .read \
 43 |     .option("encoding", "UTF-8") \
 44 |     .option("delmiter", ",") \
 45 |     .option("header", "True") \
 46 |     .csv("/FileStore/datasets/cancer.csv")
 47 | 
 48 | # COMMAND ----------
 49 | 
 50 | # display function will visualize your dataset in a beautiful way. This is databricks notebook function which will not work in your spark scripts.
 51 | display(df_csv)
 52 | 
 53 | # COMMAND ----------
 54 | 
 55 | df_csv.show() # prints top 20 records. It does not return anyting.
 56 | 
 57 | # COMMAND ----------
 58 | 
 59 | df_csv.show(2) # showing only top 2 rows with header.
 60 | 
 61 | # COMMAND ----------
 62 | 
 63 | li = df_csv.take(2) # this will return a list containing 2 rows
 64 | print(li)
 65 | 
 66 | # COMMAND ----------
 67 | 
 68 | # you can access the list
 69 | print(len(li))
 70 | print(li[0]["State"])
 71 | 
 72 | # COMMAND ----------
 73 | 
 74 | all_record = df_csv.collect() # return whole dataset. Do not use it until you need it. If you have very big dataset, this will take a huge time & computation to finish.
 75 | #print(all_record) # I am commenting the command, you may uncomment and run, but make sure you do not have a very big dataset
 76 | 
 77 | # COMMAND ----------
 78 | 
 79 | # MAGIC %md
 80 | # MAGIC ### READ JSON
 81 | 
 82 | # COMMAND ----------
 83 | 
 84 | df_json_sales = spark.read.option("multiline", "true").json("/FileStore/datasets/unece.json")
 85 | 
 86 | # COMMAND ----------
 87 | 
 88 | display(df_json_sales)
 89 | 
 90 | # COMMAND ----------
 91 | 
 92 | # MAGIC %md
 93 | # MAGIC Now we just loaded a JSON file to our spark dataframe. Let's try one more...
 94 | 
 95 | # COMMAND ----------
 96 | 
 97 | df_json = spark.read.option("multiline", "true").json("/FileStore/datasets/used_cars_nested.json")
 98 | 
 99 | # COMMAND ----------
100 | 
101 | display(df_json)
102 | 
103 | # COMMAND ----------
104 | 
105 | # MAGIC %md
106 | # MAGIC ### Oopss....
107 | # MAGIC What kind of json is this? 
108 | # MAGIC 
109 | # MAGIC You must be thinking the same thing. But actually, this is the correct behaviour. If you will compare the structure of our Sales.json & used_cars_nested.json, you will see that later one is nested JSON with complicate structure & complicate means you have to do some more work to clean this json to convert it into a tabular form. In real life, you will not get a simple json like Sales.json, most of the time real life datasets are much complicate. 
110 | # MAGIC 
111 | # MAGIC *__So in our next notebook we will see how to deal with nested or complex json files.__*
112 | 
113 | # COMMAND ----------
114 | 
115 | # MAGIC %md
116 | # MAGIC ### Read Parquet File
117 | # MAGIC ### What is Parquet?
118 | # MAGIC Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Apache Parquet is designed to be a common interchange format for both batch and interactive workloads. It is similar to other columnar-storage file formats available in Hadoop, namely RCFile and ORC.
119 | # MAGIC #### Characteristics of Parquet
120 | # MAGIC 1. Free and open source file format.
121 | # MAGIC 1. Language agnostic.
122 | # MAGIC 1. Column-based format - files are organized by column, rather than by row, which saves storage space and speeds up analytics queries.
123 | # MAGIC 1. Used for analytics (OLAP) use cases, typically in conjunction with traditional OLTP databases.
124 | # MAGIC 1. Highly efficient data compression and decompression.
125 | # MAGIC 1. Supports complex data types and advanced nested data structures.
126 | # MAGIC 
127 | # MAGIC #### Benefits of Parquet
128 | # MAGIC 1. Good for storing big data of any kind (structured data tables, images, videos, documents).
129 | # MAGIC 1. Saves on cloud storage space by using highly efficient column-wise compression, and flexible encoding schemes for columns with different data types.
130 | # MAGIC 1. Increased data throughput and performance using techniques like data skipping, whereby queries that fetch specific column values need not read the entire row of data.
131 | 
132 | # COMMAND ----------
133 | 
134 | df_par = spark.read.parquet("/FileStore/datasets/USED_CAR_PARQUET/")
135 | display(df_par)
136 | 
137 | # COMMAND ----------
138 | 
139 | # MAGIC %run ../SETUP/_pyspark_clean_up
140 | 
141 | # COMMAND ----------
142 | 
143 | # MAGIC %md
144 | # MAGIC In this notebook, we simply learnt how to read different format files. This was very basic notebook, so you may wondering is it worth creating a separate notebook for reading files? I would say "yes!!", as databricks is a vast technology growing frequently. So in future there me othere use cases which we can add in our "Read Files" notebook. So it is worth creating a notebook for it.
145 | # MAGIC 
146 | # MAGIC ### Assignment: 
147 | # MAGIC 1. Try loading JSON (simple & nested). We will learn more about nested json in our next notebook.
148 | # MAGIC 2. Try loading TSV(tab separated file) or any other flat file has a delimiter other than comma(',')
149 | 
150 | # COMMAND ----------
151 | 
152 | 
153 | 


--------------------------------------------------------------------------------
/PySpark_ETL/PS02-Schema Handling.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md 
  3 | # MAGIC ### Schema Handling
  4 | # MAGIC In this notebook, we will learn:
  5 | # MAGIC 1. What is inferSchema?
  6 | # MAGIC 1. How to check dataframe schema?
  7 | # MAGIC 1. How to define custom schema?
  8 | # MAGIC 
  9 | # MAGIC ### Whats is inferSchema?
 10 | # MAGIC Infer schema will automatically guess the data types for each field. If we set this option to TRUE, the API will read some sample records from the file to infer the schema. 
 11 | # MAGIC 
 12 | # MAGIC InferSchema option is __false by default__ that is why all the columns are string by default. By setting __inferSchema=true__, Spark will automatically go through the csv file and infer the schema of each column. This requires an extra pass over the file which will result in reading a file with __inferSchema set to true being slower__. But in return the dataframe will most likely have a correct schema given its input.
 13 | # MAGIC 
 14 | # MAGIC ### How to check dataframe schema?
 15 | # MAGIC To check dataframe schema you can simply use printSchema().
 16 | # MAGIC 
 17 | # MAGIC example: df.printSchema()
 18 | 
 19 | # COMMAND ----------
 20 | 
 21 | # MAGIC %run ../SETUP/_pyspark_init_setup
 22 | 
 23 | # COMMAND ----------
 24 | 
 25 | from pyspark.sql.types import StringType, IntegerType, StructField, StructType
 26 | 
 27 | # COMMAND ----------
 28 | 
 29 | # MAGIC %md
 30 | # MAGIC ### Method 1: Inferschema
 31 | 
 32 | # COMMAND ----------
 33 | 
 34 | df_infer = spark \
 35 |             .read \
 36 |             .option("header", "true") \
 37 |             .csv("/FileStore/datasets/cancer.csv")
 38 | 
 39 | df_infer.printSchema()
 40 | 
 41 | # You can see all the columns has string type as inferSchema is set false by default.
 42 | 
 43 | # COMMAND ----------
 44 | 
 45 | df_infer = spark \
 46 |             .read \
 47 |             .option("header", "true") \
 48 |             .option("inferSchema", "true")\
 49 |             .csv("/FileStore/datasets/cancer.csv")
 50 | 
 51 | df_infer.printSchema()
 52 | # So now you will see datatypes are different, schema will check all the data and will choose corrct type for each columns. Keep in mind this option is slower as spark has to traverse whole data to identity data type. So avoid this for big data sets
 53 | 
 54 | # COMMAND ----------
 55 | 
 56 | # MAGIC %md
 57 | # MAGIC ### Method 2: Custom Schema
 58 | 
 59 | # COMMAND ----------
 60 | 
 61 | # Let's create a test dataframe & a custom schema. We can you the csv file to create dataframe also but just for the demo I want to use smaller number of columns which will help you to understand better & save time also.
 62 | 
 63 | # Create python list of data
 64 | data = [("James","","Smith","36636","M",50,3000),
 65 |     ("Michael","Rose","","40288","M",43,4000),
 66 |     ("Robert","","Williams","42114","M",23,4000),
 67 |     ("Maria","Anne","Jones","39192","F",56,4000),
 68 |     ("Jen","Mary","Brown","33341","F",34,1500)
 69 |   ]
 70 | 
 71 | # defining custom schema. Your schema will be StructType of StructField array. then you can define any custom name for your column & type.
 72 | # StructField("middlename",StringType(),True) - this means our column name is middlename which is String type. The third True parameter determines whether the column can contain null values or not? True mean column allows null values.
 73 | 
 74 | schema = StructType([ 
 75 |     StructField("firstname",StringType(),True), 
 76 |     StructField("middlename",StringType(),True), 
 77 |     StructField("lastname",StringType(),True), 
 78 |     StructField("id", StringType(), True), 
 79 |     StructField("gender", StringType(), True), 
 80 |     StructField("age", IntegerType(), True), 
 81 |     StructField("salary", IntegerType(), True) 
 82 |   ])
 83 |  
 84 | 
 85 | 
 86 | # COMMAND ----------
 87 | 
 88 | df = spark.createDataFrame(data=data,schema=schema)
 89 | df.printSchema()
 90 | # so you can see our age & salary columns are integer type while all other have string type. This way of assigning schema is faster than inferSchema.
 91 | 
 92 | # COMMAND ----------
 93 | 
 94 | display(df)
 95 | 
 96 | # COMMAND ----------
 97 | 
 98 | # MAGIC %md 
 99 | # MAGIC Let's read one CSV with custom schema & header.
100 | 
101 | # COMMAND ----------
102 | 
103 | # Lets define a custom schema first
104 | custom_schema = StructType(
105 |     [StructField("customer_id", IntegerType(), False),
106 |     StructField("gender", StringType(), True),
107 |      StructField("age", IntegerType(), True),
108 |      StructField("annual_income_k_usd", IntegerType(), True),
109 |      StructField("score", IntegerType(), True),
110 |     ]
111 | )
112 | 
113 | # COMMAND ----------
114 | 
115 | # assigning custome schema
116 | df_mall = spark \
117 |         .read \
118 |         .schema(custom_schema) \
119 |         .option("header", "true") \
120 |         .csv("/FileStore/datasets/Mall_Customers.csv")
121 | display(df_mall)
122 | 
123 | # COMMAND ----------
124 | 
125 | # we can see the column names & types are correct. It is always recommended to check dataframe schema before performing other operations. It will be helpful to decide whether your dataframe needs any type casting. We will study about type casting in further notebooks. So do not worry if you do not understand any piece of code. This is just introductory notebook. There is a separate notebook to explain all the basic transformation using pyspark.
126 | 
127 | df_mall.printSchema()
128 | 
129 | # COMMAND ----------
130 | 
131 | # MAGIC %run ../SETUP/_pyspark_clean_up
132 | 
133 | # COMMAND ----------
134 | 
135 | 
136 | 


--------------------------------------------------------------------------------
/PySpark_ETL/PS03-Creating Dataframes.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ### Dataframe
 4 | # MAGIC 
 5 | # MAGIC A distributed collection of data grouped into named columns.
 6 | # MAGIC A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession.
 7 | # MAGIC 
 8 | # MAGIC __createDataFrame()__ and __toDF()__ methods are two different way’s to create DataFrame in spark. By using toDF() method, we don’t have the control over schema customization whereas in createDataFrame() method we have complete control over the schema customization. Use toDF() method only for local testing. But we can use createDataFrame() method for both local testings as well as for running the code in production.
 9 | # MAGIC 
10 | # MAGIC __toDF()__ is used to convert rdd to dataframe.
11 | # MAGIC 
12 | # MAGIC 
13 | # MAGIC ![DATAFRAME](https://raw.githubusercontent.com/martandsingh/images/master/dataframe.jpeg)
14 | 
15 | # COMMAND ----------
16 | 
17 | # MAGIC %run ../SETUP/_pyspark_init_setup
18 | 
19 | # COMMAND ----------
20 | 
21 | # MAGIC %md
22 | # MAGIC ### createDataFrame()
23 | 
24 | # COMMAND ----------
25 | 
26 | from pyspark.sql.types import StructType, StructField, StringType, IntegerType
27 | 
28 | 
29 | # COMMAND ----------
30 | 
31 | # Dataframe using list of tuples. Here we can apply custom header
32 | schema = StructType( [\
33 |                  StructField("lang", StringType(),True),\
34 |                  StructField("user", IntegerType(),True)\
35 | ])
36 | data = [("English", 12413), ("Hindi", 455543)]
37 | df = spark.createDataFrame(data, schema)
38 | display(df)
39 | 
40 | # COMMAND ----------
41 | 
42 | # CreateDataFrame using dictionary list
43 | 
44 | data_tuple = [{"lang":"English", "user": 10000}, {"lang":"Spanish", "user": 12452}]
45 | df = spark.createDataFrame(data_tuple)
46 | display(df)
47 | 
48 | # COMMAND ----------
49 | 
50 | # MAGIC %md
51 | # MAGIC ### toDF() 
52 | # MAGIC This is used to convert rdd to dataframes.
53 | 
54 | # COMMAND ----------
55 | 
56 | #Lets create an rdd with a list
57 | data = [("Football", 34566), ("Cricket", 2536), ("Baseball", 1234)]
58 | rdd = sc.parallelize(data)
59 | df_rdd =rdd.toDF()
60 | display(df_rdd)
61 | 
62 | # COMMAND ----------
63 | 
64 | # MAGIC %md 
65 | # MAGIC ### using files
66 | # MAGIC This type of dataframes we are going to see in our whole course. We will create dataframe using CSv, JSON & parquet files. but for this particular notebook demo I will use CSV source.
67 | 
68 | # COMMAND ----------
69 | 
70 | df_csv = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderlist.csv")
71 | display(df_csv)
72 | 
73 | # COMMAND ----------
74 | 
75 | # MAGIC %run ../SETUP/_pyspark_clean_up
76 | 


--------------------------------------------------------------------------------
/PySpark_ETL/PS04-Basic Transformation.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC 
  4 | # MAGIC ### Basic Data Transformation
  5 | # MAGIC In this notebook we will learn:
  6 | # MAGIC 1. Read parquet file
  7 | # MAGIC 1. Check row & columns
  8 | # MAGIC 1. Describe function
  9 | # MAGIC 1. Select columns
 10 | # MAGIC 1. Filter columns
 11 | # MAGIC 1. Case statement
 12 | # MAGIC 1. Add new column
 13 | # MAGIC 1. Rename column
 14 | # MAGIC 1. Drop column
 15 | # MAGIC 1. String comparison
 16 | # MAGIC 1. AND & OR condition
 17 | # MAGIC 1. EXPR
 18 | # MAGIC 
 19 | # MAGIC ![DATA_TRANSFORMATION](https://raw.githubusercontent.com/martandsingh/images/master/transformation.png)
 20 | 
 21 | # COMMAND ----------
 22 | 
 23 | # MAGIC %run ../SETUP/_pyspark_init_setup
 24 | 
 25 | # COMMAND ----------
 26 | 
 27 | # MAGIC %md
 28 | # MAGIC ### Read Parquet File
 29 | # MAGIC pyspark_init_setup will setup data files in your dbfs location including parquet file. We will use it to perform basic operations. 
 30 | 
 31 | # COMMAND ----------
 32 | 
 33 | df_raw = spark.read.parquet("/FileStore/datasets/USED_CAR_PARQUET/")
 34 | display(df_raw)
 35 | 
 36 | # COMMAND ----------
 37 | 
 38 | # MAGIC %md 
 39 | # MAGIC ### Print Schema
 40 | 
 41 | # COMMAND ----------
 42 | 
 43 | # Check dataframe schema
 44 | df_raw.printSchema()
 45 | 
 46 | # COMMAND ----------
 47 | 
 48 | # MAGIC %md
 49 | # MAGIC ### Describe function
 50 | # MAGIC As name suggests it will give you a quick summary of your dataframe e.g. count, mean, median and other information.  
 51 | 
 52 | # COMMAND ----------
 53 | 
 54 | df_describe = df_raw.describe() # returns a dataframe
 55 | display(df_describe)
 56 | 
 57 | # COMMAND ----------
 58 | 
 59 | # MAGIC %md 
 60 | # MAGIC ### Check Rows & Columns
 61 | 
 62 | # COMMAND ----------
 63 | 
 64 | # df.columns returns python list including all the columns
 65 | all_columns = df_raw.columns 
 66 | print(all_columns)
 67 | 
 68 | # COMMAND ----------
 69 | 
 70 | total_rows = df_raw.count()
 71 | print(total_rows)
 72 | 
 73 | # COMMAND ----------
 74 | 
 75 | # MAGIC %md
 76 | # MAGIC ### Select Columns
 77 | 
 78 | # COMMAND ----------
 79 | 
 80 | # Choose only required columns
 81 | display(df_raw.select("vehicle_type", "brand_name", "model", "price"))
 82 | 
 83 | # COMMAND ----------
 84 | 
 85 | # MAGIC %md 
 86 | # MAGIC ### col function
 87 | # MAGIC Returns a Column based on the given column name. We can use col("columnname") to select column
 88 | 
 89 | # COMMAND ----------
 90 | 
 91 | # include col function
 92 | from pyspark.sql.functions import col
 93 | 
 94 | # COMMAND ----------
 95 | 
 96 | # below query will return sae resultset as above
 97 | display(df_raw.select( col("vehicle_type"), col("brand_name"), col("model"), col("price") ))
 98 | 
 99 | # COMMAND ----------
100 | 
101 | # MAGIC %md 
102 | # MAGIC ### LIMIT
103 | # MAGIC limit function is used to limit number of rows. If you want to select top n records, you can use limit functions.
104 | 
105 | # COMMAND ----------
106 | 
107 | # select top 5 records
108 | df_limit = df_raw.select( col("vehicle_type"), col("brand_name"), col("model"), col("price") ).limit(5)
109 | display(df_limit)
110 | 
111 | # COMMAND ----------
112 | 
113 | # MAGIC %md
114 | # MAGIC ### Add New Column
115 | # MAGIC We can add new column using withColumn("{column-name}", {value})
116 | 
117 | # COMMAND ----------
118 | 
119 | # Lets create a new column full_name with brand_name + model name in capital letters & select only top 5 full_name & price column. 
120 | # withColumn is used to add new column.
121 | # import concat_ws functions. this function concat two string with the provided separator.
122 | from pyspark.sql.functions import concat_ws, upper
123 | 
124 | df_processed = df_raw \
125 |             .withColumn("full_name", concat_ws(' ', col("brand_name"), col("model")) ) 
126 |             
127 |             
128 | display(df_processed)
129 | # You can see the full_name column added to the dataframe (scroll right).
130 | 
131 | # COMMAND ----------
132 | 
133 | # MAGIC %md
134 | # MAGIC ### Rename Column
135 | # MAGIC You can rename your column with 
136 | 
137 | # COMMAND ----------
138 | 
139 | df_processed = df_processed \
140 |             .withColumnRenamed("full_name", "vehicle_name")
141 | 
142 | display(df_processed.limit(5))
143 | # so we can see our output has new column name.
144 | 
145 | # COMMAND ----------
146 | 
147 | # MAGIC %md 
148 | # MAGIC ### Drop Column(s)
149 | # MAGIC You can drop one or more column using drop()
150 | 
151 | # COMMAND ----------
152 | 
153 | display(df_raw.drop("description", "ad_title", "seller_location"))
154 | # so here we used drop to delete few columns. But there is one very interesting thing with spark dataframe. Spark dataframes are immutable. This means whenever you perform any transformation in your dataset, it creates a new dataset. This mean dropping those columns will generate a new dataframe, it will not delete those columns from the original dataframe(df_raw). let try displa(df_raw) in next cell.
155 | 
156 | # COMMAND ----------
157 | 
158 | display(df_raw.limit(4)) 
159 | # we can see those 4 columns are still there. This is because of immutabile characterstics of spark dataframe. This is one of the major differences between spark & pandas dataframe.
160 | # So everytime you perform any transformation, you have to create a new dataframe to persist those changes.
161 | 
162 | # COMMAND ----------
163 | 
164 | # thats why we will create a new dataframe 
165 | df_processed = df_raw.drop("description", "ad_title", "seller_location")
166 | df_processed.printSchema() # now this dataframe will not include dropped columns.
167 | 
168 | # COMMAND ----------
169 | 
170 | # MAGIC %md 
171 | # MAGIC ### Filter Data
172 | # MAGIC You can use filter() to filter you data based on specific condition. It is simillar to SQL WHERE keyword.
173 | 
174 | # COMMAND ----------
175 | 
176 | # Lets select all the cars with body_type as SUV. Only select brand_name, body_type, price, displacement
177 | df_SUV = df_processed \
178 |         .filter(col("body_type")== "SUV")\
179 |         .select("brand_name", "body_type", "price", "displacement")
180 | 
181 | display(df_SUV)
182 | 
183 | # COMMAND ----------
184 | 
185 | # MAGIC %md
186 | # MAGIC ### CASE Statement
187 | # MAGIC The SQL CASE Statement. The CASE statement goes through conditions and returns a value when the first condition is met (like an if-then-else statement).
188 | # MAGIC 
189 | # MAGIC Let's categorise our data. Create a new column called "power" based on displacement value.  
190 | # MAGIC 
191 | # MAGIC - > Displacement < 1500 (less than 1500) -> Low 
192 | # MAGIC 
193 | # MAGIC - > 1500 >= Displacement < 2500 -> Medium  
194 | # MAGIC 
195 | # MAGIC - > Displacement >= 2500 -> Strong
196 | 
197 | # COMMAND ----------
198 | 
199 | from pyspark.sql.functions import when, regexp_replace
200 | 
201 | # we are adding a new column "power" which is being calculated using displacement. displacement contains cc. So first we will use regexp_replace to replace text cc. i.e. 100cc -> 100, 800cc -> 800. Then we will use "when" to perform conditional check.
202 | df_power = df_processed \
203 |         .withColumn("power", 
204 |                     when(regexp_replace(df_raw.displacement, "cc", "") < 1500, "LOW")
205 |                     .when(( regexp_replace(df_raw.displacement, "cc", "") >= 1500) & (df_raw.displacement < 2500),  "MEDIUM")
206 |                     .otherwise("HIGH")
207 |                    ).select("vehicle_type", "brand_name", "displacement", "power")
208 | 
209 | display(df_power)
210 | 
211 | # COMMAND ----------
212 | 
213 | # MAGIC %md
214 | # MAGIC ### AND & OR
215 | # MAGIC These keyword comes handy when we are defining multiple filter condition.
216 | # MAGIC 
217 | # MAGIC 1. AND- represented by &. it means all the condition in the expression must be true. 
218 | # MAGIC 
219 | # MAGIC example: I need a White Suzuki car. It mean my car must be white & brand must be Suzuki. It has to satisfy both the criteria. I will not accept Black Suzuki or White Honda.
220 | # MAGIC 
221 | # MAGIC TRUE & TRUE = TRUE ( this is and condition in binary form. So both of the expression must be true)
222 | # MAGIC 
223 | # MAGIC 2. OR - represent by |. It means any one of multiple condition is true.
224 | # MAGIC 
225 | # MAGIC TRUE + FALSE = TRUE
226 | 
227 | # COMMAND ----------
228 | 
229 | # Let's see one more example. But this time we will use 2 filter conditions. Select all white suzuki cars. 
230 | df_suz_wh = df_processed.filter( (col("brand_name")=="Suzuki") & (col("color")=="White") )
231 | display(df_suz_wh)
232 | # we can see our resultset only include White Suzuki. 
233 | 
234 | # COMMAND ----------
235 | 
236 | # I want to buy a car, but I have one condition. It should be either Honda or Suzuki. let's write the query. Here we will use OR condition as I have two options Honda & Suzuki. Any one of them is acceptable. Keep in mind comparison is always case sensitive. Honda is different then honda. So col("brand_name") == "Honda" & col("brand_name") == "honda" will produce different resultset. Try this in your assignment.
237 | 
238 | df_car = df_processed.filter( (col("brand_name") == "Honda") | (col("brand_name") == "Suzuki" ) )
239 | display(df_car)
240 | 
241 | # COMMAND ----------
242 | 
243 | # MAGIC %md
244 | # MAGIC ## String comparison is case sensitive !!!
245 | # MAGIC As I mentioned earlier text based comparison is always case sensitive. Honda is different than honda. So what is the best possible way to compare string? 
246 | # MAGIC 
247 | # MAGIC There are multiple ways but which I prefer is using __UPPER OR LOWER__ case function while comparing. Let's see the using example. We will slightly change above query.
248 | 
249 | # COMMAND ----------
250 | 
251 | from pyspark.sql.functions import upper
252 | 
253 | # I added upper function & if you noticed, I changed the case of "Honda" to "honda". It is still giving me same result. The query will convert both side of comparison string to upper case and then compare. In this way you guarantee that both sides will always have same case. you can use lower function also for this.
254 | df_car = df_processed.filter( (upper(col("brand_name")) == "honda".upper() ) | ( upper(col("brand_name")) == "Suzuki".upper() ) )
255 | display(df_car)
256 | 
257 | # COMMAND ----------
258 | 
259 | # MAGIC %md
260 | # MAGIC ### EXPR - expression
261 | # MAGIC Using EXPR we can run SQL statements in pyspark statements.
262 | 
263 | # COMMAND ----------
264 | 
265 | # lets categorize our data using SQL expression.
266 | from pyspark.sql.functions import expr
267 | df_categorized_expr = df_processed \
268 |                        .withColumn("power", expr('''
269 |                            CASE WHEN displacement < 1500 THEN "LOW"
270 |                            WHEN displacement >= 1500 AND displacement < 2500 THEN "MEDIUM"
271 |                            ELSE "HIGH" END''' )) \
272 |                        .select("vehicle_type", "body_type", "brand_name", "displacement", "power")
273 | 
274 | display(df_categorized_expr)
275 | 
276 | # COMMAND ----------
277 | 
278 | # MAGIC %md
279 | # MAGIC ### Chaining
280 | # MAGIC In our examples we apply different kind of operations in different cell, but apache spark allows you to chain multiple statements.
281 | # MAGIC e.g. df.operation1.operation2.operation3.....
282 | # MAGIC 
283 | # MAGIC __Lets do following operations on our dataset:__
284 | # MAGIC 1. delete column description, manufacturer & adtitle
285 | # MAGIC 1. update column displacement, remove cc from values. e.g. 800cc -> 800
286 | # MAGIC 1. create a new column price_in_k = price/1000. it will proce price in 1000s.
287 | # MAGIC 1. select only body_type, brand_name, color, displacement, price_in_k
288 | 
289 | # COMMAND ----------
290 | 
291 | from pyspark.sql.functions import regexp_replace, col
292 | 
293 | df_processed = df_raw\
294 |     .withColumn("displacement", regexp_replace(col("displacement"), "cc", ""))\
295 |     .withColumn("price_in_k", col("price")/1000)\
296 |     .drop("description", "manufacturer", "adtitle")\
297 |     .select("body_type", "brand_name", "color", "displacement", "price_in_k")
298 | 
299 | display(df_processed)
300 | 
301 | # COMMAND ----------
302 | 
303 | # MAGIC %run ../SETUP/_pyspark_clean_up
304 | 
305 | # COMMAND ----------
306 | 
307 | # MAGIC %md
308 | # MAGIC ### Assignment 1
309 | # MAGIC 1. Add a new column price_usd to your dataframe df_raw. This column will save price of car in USD. price_usd = price * (0.0051);
310 | # MAGIC 2. Add one more column brand_name_upper, which will have brand_name in upper cases.
311 | # MAGIC 3. Select only brand_name_upper, color, price_usd columns to your final resultset.
312 | # MAGIC 
313 | # MAGIC ### Assignment 2
314 | # MAGIC 1. Select all the Grey Honda Hatchback cars.
315 | 
316 | # COMMAND ----------
317 | 
318 | 
319 | 


--------------------------------------------------------------------------------
/PySpark_ETL/PS05-Handling JSON.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC ### JSON Handling
  4 | # MAGIC 1. How to read simple & nested JSON. 
  5 | # MAGIC 2. How to create new columns using nested json
  6 | # MAGIC 
  7 | # MAGIC ![PYSPARK_JSON](https://raw.githubusercontent.com/martandsingh/images/master/json_pyspark.png)
  8 | 
  9 | # COMMAND ----------
 10 | 
 11 | # MAGIC %run ../SETUP/_pyspark_init_setup
 12 | 
 13 | # COMMAND ----------
 14 | 
 15 | df = spark \
 16 |     .read \
 17 |     .option("multiline", "true")\
 18 |     .json("/FileStore/datasets/used_cars_nested.json")
 19 | 
 20 | display(df)
 21 | 
 22 | # COMMAND ----------
 23 | 
 24 | # Imports
 25 | from pyspark.sql.functions import explode, col
 26 | 
 27 | # COMMAND ----------
 28 | 
 29 | df_exploded = df \
 30 |             .withColumn("usedCars", explode(df["usedCars"]))
 31 | 
 32 | display(df_exploded)
 33 | 
 34 | # COMMAND ----------
 35 | 
 36 | # Now we will read JSON values and add new columns, later we will delete usedCars(Raw json) column as we do not need it.
 37 | df_clean = df_exploded \
 38 |             .withColumn("vehicle_type", col("usedCars")["@type"])\
 39 |             .withColumn("body_type", col("usedCars")["bodyType"])\
 40 |             .withColumn("brand_name", col("usedCars")["brand"]["name"])\
 41 |             .withColumn("color", col("usedCars")["color"])\
 42 |             .withColumn("description", col("usedCars")["description"])\
 43 |             .withColumn("model", col("usedCars")["model"])\
 44 |             .withColumn("manufacturer", col("usedCars")["manufacturer"])\
 45 |             .withColumn("ad_title", col("usedCars")["name"])\
 46 |             .withColumn("currency", col("usedCars")["priceCurrency"])\
 47 |             .withColumn("seller_location", col("usedCars")["sellerLocation"])\
 48 |             .withColumn("displacement", col("usedCars")["vehicleEngine"]["engineDisplacement"])\
 49 |             .withColumn("transmission", col("usedCars")["vehicleTransmission"])\
 50 |             .withColumn("price", col("usedCars")["price"]) \
 51 |             .drop("usedCars")
 52 | display(df_clean)
 53 | 
 54 | # COMMAND ----------
 55 | 
 56 | # MAGIC %md
 57 | # MAGIC So now we have our clean dataframe df_clean. So we saw how explode function create a row for each element of an array. In our case, our array had struct items. So each row created had a struct type item (df_exploded), which we later used to create new columns (df_clean). 
 58 | # MAGIC 
 59 | # MAGIC ### Explode
 60 | # MAGIC The explode() method converts each element of the specified column(s) into a row. 
 61 | # MAGIC 
 62 | # MAGIC __Syntax:__
 63 | # MAGIC 
 64 | # MAGIC dataframe.explode(column, ignore_index)
 65 | # MAGIC 
 66 | # MAGIC But there may some cases where we have a string type column which consist of JSON string. how to deal with it? Obviously, you can convert it to struct type and then follow the same process we did earlier. Apart from this, there is one more cleaner way to achieve this.
 67 | # MAGIC 
 68 | # MAGIC We can use json_tuple, get_json_object functions.
 69 | # MAGIC 
 70 | # MAGIC __get_json_object()__: This is used to query json object inline.
 71 | # MAGIC 
 72 | # MAGIC __json_tuple()__: *We can use this if json has only one level of nesting*
 73 | # MAGIC 
 74 | # MAGIC Confused????? Let's try with an example.
 75 | 
 76 | # COMMAND ----------
 77 | 
 78 | from pyspark.sql.functions import get_json_object, json_tuple
 79 | 
 80 | # COMMAND ----------
 81 | 
 82 | # lets create a test dataframe, which contains JSON string. Range function creates a simple dataframe with given number of rows.
 83 | df_json_string = spark.range(1)\
 84 | .selectExpr("""
 85 | '{"Car" : {"Model" : ["i10", "i20", "Verna"], "Brand":"Hyundai" }}' as Cars
 86 | """)
 87 | 
 88 | display(df_json_string)
 89 | 
 90 | # COMMAND ----------
 91 | 
 92 | # MAGIC %md 
 93 | # MAGIC Our task is to create a new dataframe with 2 columns:
 94 | # MAGIC * Model - take only first item in array. Just for the sake of tutorial. It will tell you how to pick a specific item using json_tuple
 95 | # MAGIC * Brand
 96 | 
 97 | # COMMAND ----------
 98 | 
 99 | df_cars = df_json_string \
100 |         .withColumn("Brand", json_tuple(col("Cars"), "Car") )\
101 |         .withColumn("Model", get_json_object(col("Cars"), "$.Car.Model[1]"))
102 | 
103 | display(df_cars)
104 | 
105 | # COMMAND ----------
106 | 
107 | # MAGIC %run ../SETUP/_pyspark_clean_up
108 | 
109 | # COMMAND ----------
110 | 
111 | # MAGIC %md
112 | # MAGIC ### Assignment
113 | # MAGIC 1. Download https://github.com/martandsingh/datasets/blob/master/person_details.json &  try to read it using spark
114 | # MAGIC 2. Create a dataframe with column: name, age, cars(Array type), city, state, country
115 | 
116 | # COMMAND ----------
117 | 
118 | 
119 | 


--------------------------------------------------------------------------------
/PySpark_ETL/PS06-JOINS.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC 
  4 | # MAGIC ### What are the joins?
  5 | # MAGIC PySpark Join is used to combine two DataFrames and by chaining these you can join multiple DataFrames. In real life projects, you will be having more than one dataset or table (normalized data). To calculate or get final data sometime you may need to join different datasets.
  6 | # MAGIC 
  7 | # MAGIC There are many types of joins are available, but we will discuss only most common table join types:
  8 | # MAGIC 1. Inner join - returns record which are common in both the tables.
  9 | # MAGIC 1. Left outer join - common records + unmatched records from left table (left side of join clause)
 10 | # MAGIC 1. Right outer join - common records + unmatched records from right table
 11 | # MAGIC 1. Cross join - Cartesian product of two tables. It will create MxN records (M - no of records in left table, N - no of records in right side table).
 12 | # MAGIC 1. Full outer join - all the records from both the tables except the common records.
 13 | # MAGIC 
 14 | # MAGIC ![JOINS](https://raw.githubusercontent.com/martandsingh/images/master/joins.jpg)
 15 | 
 16 | # COMMAND ----------
 17 | 
 18 | # MAGIC %run ../SETUP/_pyspark_init_setup
 19 | 
 20 | # COMMAND ----------
 21 | 
 22 | # MAGIC %sql
 23 | # MAGIC -- Here we are creating SQL tables. Do not worry about the code. We have a separate tutorial for this SQL_Refresher. Check out our githb url: https://github.com/martandsingh/ApacheSpark
 24 | # MAGIC 
 25 | # MAGIC CREATE TABLE IF NOT EXISTS  T1
 26 | # MAGIC (
 27 | # MAGIC   id VARCHAR(10)
 28 | # MAGIC );
 29 | # MAGIC CREATE TABLE IF NOT EXISTS T2
 30 | # MAGIC (
 31 | # MAGIC   id VARCHAR(10)
 32 | # MAGIC );
 33 | # MAGIC CREATE TABLE IF NOT EXISTS T3
 34 | # MAGIC (
 35 | # MAGIC   id VARCHAR(10)
 36 | # MAGIC );
 37 | # MAGIC 
 38 | # MAGIC INSERT INTO T1 VALUES ('1'), ('2'), ('3'), ('4'), (NULL);
 39 | # MAGIC INSERT INTO T2 VALUES ('1'), ('2');
 40 | # MAGIC INSERT INTO T3 VALUES ( '3'), ('4'), ('5'), (NULL);
 41 | 
 42 | # COMMAND ----------
 43 | 
 44 | # convert tables to dataframe
 45 | df_1 = spark.sql("SELECT * FROM T1");
 46 | df_2 = spark.sql("SELECT * FROM T2");
 47 | df_3 = spark.sql("SELECT * FROM T3");
 48 | 
 49 | # COMMAND ----------
 50 | 
 51 | display(df_1)
 52 | display(df_2)
 53 | display(df_3)
 54 | 
 55 | # COMMAND ----------
 56 | 
 57 | # MAGIC %md
 58 | # MAGIC ### INNER JOIN
 59 | 
 60 | # COMMAND ----------
 61 | 
 62 | # INNER JOIN
 63 | # This join will give only matching records. It will return only the records which are present on both the table.
 64 | df_inner = df_1.join(df_2, df_1["id"]==df_2["id"], "inner")
 65 | display(df_inner)
 66 | 
 67 | # COMMAND ----------
 68 | 
 69 | # MAGIC %md
 70 | # MAGIC ### LEFT OUTER JOIN
 71 | 
 72 | # COMMAND ----------
 73 | 
 74 | # LEFT JOIN
 75 | # It will return only the records which are present on both the table and all non-matching records from left table.
 76 | df_left = df_1.join(df_2, df_1["id"]==df_2["id"], "left")
 77 | display(df_left)
 78 | 
 79 | # COMMAND ----------
 80 | 
 81 | # MAGIC %md
 82 | # MAGIC ### RIGHT OUTER JOIN
 83 | 
 84 | # COMMAND ----------
 85 | 
 86 | # RIGHT JOIN
 87 | # It will return only the records which are present on both the table and all non-matching records from right table.
 88 | df_right = df_1.join(df_3, df_1["id"]==df_3["id"], "right")
 89 | display(df_right)
 90 | 
 91 | # COMMAND ----------
 92 | 
 93 | # MAGIC %md
 94 | # MAGIC ### FULL OUTER JOIN
 95 | 
 96 | # COMMAND ----------
 97 | 
 98 | # FULL OUTER JOIN
 99 | df_full = df_1.join(df_3, df_1["id"]==df_3["id"], "full")
100 | display(df_full)
101 | 
102 | # COMMAND ----------
103 | 
104 | # MAGIC %md
105 | # MAGIC ### CROSS JOIN
106 | 
107 | # COMMAND ----------
108 | 
109 | # CROSS JOIN, it will return cartesian product of both the tables
110 | df_cross = df_1.crossJoin(df_3)
111 | display(df_cross)
112 | 
113 | # COMMAND ----------
114 | 
115 | # MAGIC %md
116 | # MAGIC ### Example
117 | # MAGIC 
118 | # MAGIC Let's see an example using our sales order dataset.
119 | 
120 | # COMMAND ----------
121 | 
122 | # We will use sales dataset. We have three tables: order list, order details & sales target. Let's load data first. Order list and details are linked using order id. Order details & Sales target are linked with category.
123 | df_ol = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderlist.csv")
124 | df_od = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderdetails.csv")
125 | df_st = spark.read.option("header", "true").csv("/FileStore/datasets/sales/salestarget.csv")
126 | 
127 | # COMMAND ----------
128 | 
129 | display(df_ol.limit(3))
130 | display(df_od.limit(3))
131 | display(df_st.limit(3))
132 | 
133 | # COMMAND ----------
134 | 
135 | df_inner = df_ol \
136 |         .join(df_od, df_ol["Order Id"] == df_od["Order Id"], "inner")\
137 |         .select(df_ol["Order Id"], df_ol["Order Date"], df_od["Amount"], df_od["Profit"], df_od["Category"])
138 | display(df_inner)
139 | 
140 | # COMMAND ----------
141 | 
142 | # left_outer, left as same
143 | df_left_outer = df_ol \
144 |         .join(df_od, df_ol["Order Id"] == df_od["Order Id"], "left_outer")\
145 |         .select(df_ol["Order Id"], df_ol["Order Date"], df_od["Amount"], df_od["Profit"], df_od["Category"])
146 | display(df_left_outer)
147 | 
148 | # COMMAND ----------
149 | 
150 | # MAGIC %run ../SETUP/_pyspark_clean_up
151 | 
152 | # COMMAND ----------
153 | 
154 | 
155 | 


--------------------------------------------------------------------------------
/PySpark_ETL/PS07-Grouping & Aggregation.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ### Group  & Aggregation
 4 | # MAGIC Similar to SQL GROUP BY clause, PySpark groupBy() function is used to collect the identical data into groups on DataFrame and perform aggregate functions on the grouped data.
 5 | # MAGIC 
 6 | # MAGIC PYSPARK AGG is an aggregate function that is functionality provided in PySpark that is used for operations. The aggregate operation operates on the data frame of a PySpark and generates the result for the same. It operates on a group of rows and the return value is then calculated back for every group.
 7 | # MAGIC 
 8 | # MAGIC ![GROUPING](https://raw.githubusercontent.com/martandsingh/images/master/grouping.png)
 9 | 
10 | # COMMAND ----------
11 | 
12 | # MAGIC %run ../SETUP/_pyspark_init_setup
13 | 
14 | # COMMAND ----------
15 | 
16 | df = spark.read.parquet('/FileStore/datasets/USED_CAR_PARQUET/')
17 | display(df)
18 | 
19 | # COMMAND ----------
20 | 
21 | from pyspark.sql.functions import col
22 | 
23 | # COMMAND ----------
24 | 
25 | # Group by body type. Query will writted body_type and total count for that particular body type
26 | df_type = df.groupBy("body_type").count().orderBy(col("count").desc())
27 | display(df_type)
28 | 
29 | # COMMAND ----------
30 | 
31 | # Grouping using multiple columns. Below query will group your data with body_type & brand_name. e.g. How many suzuki hatchback are there?
32 | display(df.groupBy("brand_name", "body_type" ).count().orderBy(col("count").desc()))
33 | 
34 | # COMMAND ----------
35 | 
36 | # Average price of each brand name. Get the average price for each brand name
37 | df_avgPrice = df.groupBy("brand_name").mean("price").orderBy("avg(price)")
38 | display(df_avgPrice)
39 | 
40 | # with this analysis we can see Daewoo, Suzuki is relatively cheaper car & MG, Haval, Kia are expensive cars
41 | 
42 | # COMMAND ----------
43 | 
44 | # Now lets check does body type affect the price of car, what kind of body types are cheaper or expensive.
45 | df_body_price = df.groupBy("brand_name", "body_type").mean("price").orderBy(col("avg(price)").desc())
46 | display(df_body_price)
47 | 
48 | # so toyota SUV are generally expensive, daewoo & suzuki sedan are cheaper
49 | 
50 | # COMMAND ----------
51 | 
52 | # MAGIC %md
53 | # MAGIC ### Agg function
54 | 
55 | # COMMAND ----------
56 | 
57 | df_agg = df.agg({"brand_name":"count", "body_type":"count", "price": "avg"})
58 | display(df_agg)
59 | 
60 | # COMMAND ----------
61 | 
62 | df_agg = df.agg({"brand_name":"count", "body_type":"count", "price": "avg"}) \
63 |         .withColumnRenamed("avg(price)", "avg_price") \
64 |         .withColumnRenamed("count(body_type)", "total_types")\
65 |         .withColumnRenamed("count(brand_name)", "total_brands")
66 | display(df_agg)
67 | 
68 | # COMMAND ----------
69 | 
70 | # MAGIC %run ../SETUP/_pyspark_clean_up
71 | 


--------------------------------------------------------------------------------
/PySpark_ETL/PS08-Ordering Data.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ### Ordering Data
 4 | # MAGIC The __ORDER BY__ keyword is used to sort the records in ascending order by default. To sort the records in descending order, use the DESC keyword
 5 | 
 6 | # COMMAND ----------
 7 | 
 8 | # MAGIC %run ../SETUP/_pyspark_init_setup
 9 | 
10 | # COMMAND ----------
11 | 
12 | df_ol = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderlist.csv")
13 | display(df_ol)
14 | 
15 | # COMMAND ----------
16 | 
17 | from pyspark.sql.functions import col
18 | 
19 | # COMMAND ----------
20 | 
21 | # Sort in ascending order
22 | df_cust_order = df_ol.filter(col("CustomerName").isNotNull()).orderBy(col("CustomerName"))
23 | display(df_cust_order)
24 | 
25 | # COMMAND ----------
26 | 
27 | # Sort in descending order
28 | df_cust_order = df_ol.filter(col("CustomerName").isNotNull()).orderBy(col("CustomerName").desc())
29 | display(df_cust_order)
30 | 
31 | # COMMAND ----------
32 | 
33 | # Sort with more than one column - ascending order
34 | df_sort = df_ol\
35 |         .filter(col("CustomerName").isNotNull())\
36 |         .orderBy([col("CustomerName"), col("State")], ascending=True)
37 | display(df_sort)
38 | 
39 | # COMMAND ----------
40 | 
41 | # Sort with more than one column - descending order
42 | df_sort = df_ol\
43 |         .filter(col("CustomerName").isNotNull())\
44 |         .orderBy([col("CustomerName"), col("State")], ascending=False)
45 | display(df_sort)
46 | 
47 | # COMMAND ----------
48 | 
49 | 
50 | # Sort in ascending order. If column has null value you can control where to show null values. by default null value will always be on top. 
51 | df_cust_order = df_ol.orderBy(col("CustomerName"))
52 | display(df_cust_order)
53 | 
54 | # COMMAND ----------
55 | 
56 | # You can choose where to place null values using  asc_nulls_last(place null at last), asc_nulls_first (place null at first)
57 | df_cust_order = df_ol.orderBy(col("CustomerName").asc_nulls_last())
58 | display(df_cust_order)
59 | 
60 | # COMMAND ----------
61 | 
62 | # MAGIC %run ../SETUP/_pyspark_clean_up
63 | 


--------------------------------------------------------------------------------
/PySpark_ETL/PS09-String Functions.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ### String Functions
 4 | # MAGIC 
 5 | # MAGIC In this demo we will learn basic string functions which we use in our daily life projects.
 6 | 
 7 | # COMMAND ----------
 8 | 
 9 | # MAGIC %run ../SETUP/_pyspark_init_setup
10 | 
11 | # COMMAND ----------
12 | 
13 | df = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderlist.csv")
14 | display(df)
15 | 
16 | # COMMAND ----------
17 | 
18 | # MAGIC %md
19 | # MAGIC * Length
20 | # MAGIC * Case change & Initcap
21 | # MAGIC * Replace
22 | # MAGIC * Substring
23 | # MAGIC * Concatenation (concat, concat_ws)
24 | # MAGIC * Right/Left padding
25 | # MAGIC * String split
26 | # MAGIC * Trim
27 | # MAGIC * Repeat
28 | 
29 | # COMMAND ----------
30 | 
31 | from pyspark.sql.functions import col, length, upper, lower, regexp_extract, regexp_replace, trim, repeat, substring, substring_index, concat_ws, concat, initcap, split, lpad, rpad
32 | 
33 | # COMMAND ----------
34 | 
35 | #Add new column cust_length,  the length of customer name 
36 | df_trans = df.withColumn("cust_length", length(col("CustomerName")) )
37 | display(df_trans)
38 | 
39 | # COMMAND ----------
40 | 
41 | #Add new column cust_upper & cust_lower which will contain customer name in upper & lower case respectively.
42 | df_trans = df_trans \
43 |         .withColumn("cust_upper", upper("CustomerName") )\
44 |         .withColumn("cust_lower", lower("CustomerName") )\
45 |         .withColumn("cust_initcap", initcap("cust_lower") )
46 | display(df_trans)
47 | 
48 | # COMMAND ----------
49 | 
50 | #Add new column hidden_name which include customer name with "a" replaced as *.
51 | df_trans = df_trans \
52 |         .withColumn("hidden_name", regexp_replace(col("CustomerName"), "a", "*" ))
53 | display(df_trans)
54 | 
55 | # COMMAND ----------
56 | 
57 | # get starting & last three characters of customer name
58 | df_trans = (df_trans
59 |         .withColumn("customer_name_start", substring(col("CustomerName"), 1, 3))
60 |         .withColumn("customer_name_last", substring(col("CustomerName"), -3, 2))
61 |        )
62 | display(df_trans)
63 | 
64 | # COMMAND ----------
65 | 
66 | # Extract year from order date using split functions
67 | df_trans = df_trans.withColumn("Year", split(col("Order Date"), "-")[2] ) 
68 | display(df_trans)
69 | 
70 | # COMMAND ----------
71 | 
72 | # Extract year from order date using split functions
73 | df_trans = df_trans.withColumn("repeat_date", repeat("Order Date", 2 ) )
74 | display(df_trans)
75 | 
76 | # COMMAND ----------
77 | 
78 | df_trans = df_trans.withColumn("trim_name", trim("CustomerName") )
79 | display(df_trans)
80 | 
81 | # COMMAND ----------
82 | 
83 | df_trans = df_trans\
84 |             .withColumn("lpad_name", lpad("CustomerName", 20, "$") )\
85 |             .withColumn("rpad_name", rpad("CustomerName", 20, "#") )
86 | display(df_trans)
87 | 
88 | # COMMAND ----------
89 | 
90 | # MAGIC %run ../SETUP/_pyspark_clean_up
91 | 


--------------------------------------------------------------------------------
/PySpark_ETL/PS10-Date & Time Functions.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ### Date Time Functions
 4 | # MAGIC Date is an important data type when it comes to reporting. In most of the time, reports are arranged based on date, month, year or a particular amount of time. To answer all those questions we will see how can we use date & time in pyspark.
 5 | # MAGIC 
 6 | # MAGIC ![DATETIME](https://raw.githubusercontent.com/martandsingh/images/master/datetime.jpg)
 7 | 
 8 | # COMMAND ----------
 9 | 
10 | # MAGIC %run ../SETUP/_pyspark_init_setup
11 | 
12 | # COMMAND ----------
13 | 
14 | df = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderlist.csv")
15 | display(df)
16 | 
17 | # COMMAND ----------
18 | 
19 | # MAGIC %md
20 | # MAGIC * Current data time
21 | # MAGIC * Convert string to date
22 | # MAGIC * Extract date part e.g. day, month, year, week
23 | # MAGIC * convert to/from unixtimestamp
24 | # MAGIC * Change date format
25 | # MAGIC * Date difference
26 | # MAGIC * Add/Subtract date
27 | 
28 | # COMMAND ----------
29 | 
30 | from pyspark.sql.functions import current_date, current_timestamp, to_date, col, dayofmonth, month, year, quarter, date_add, date_sub, date_trunc, add_months, months_between, datediff
31 | 
32 | # COMMAND ----------
33 | 
34 | df_trans = df \
35 |         .withColumn("current_date", current_date())\
36 |         .withColumn("current_timestamp", current_timestamp())
37 | 
38 | display(df_trans)
39 | 
40 | # COMMAND ----------
41 | 
42 | # We can see order date is string type.
43 | df.printSchema()
44 | 
45 | # COMMAND ----------
46 | 
47 | # Let's convert Order Date column to date
48 | df_trans = df_trans.withColumn("order_date", to_date(col("Order Date"), "dd-MM-yyyy"))
49 | display(df_trans)
50 | 
51 | # COMMAND ----------
52 | 
53 | df_trans.printSchema()
54 | 
55 | # COMMAND ----------
56 | 
57 | # Add new column Day, Month & Year from order_date column
58 | df_trans = df_trans\
59 |         .withColumn("Day", dayofmonth("order_date") )\
60 | .withColumn("Month", month("order_date") )\
61 | .withColumn("Year", year("order_date") )\
62 | .withColumn("Quarter", quarter("order_date") )
63 | 
64 | display(df_trans)
65 | 
66 | # COMMAND ----------
67 | 
68 | # Add and Subtract days in order day
69 | df_trans = df_trans\
70 |         .withColumn("order_next_10_days", date_add(col("order_date"), 10))\
71 |         .withColumn("order_prev_10_days", date_sub(col("order_date"), 10))\
72 |         .withColumn("order_add_months", add_months(col("order_date"), 2))\
73 |         .withColumn("date_dff", datediff(col("order_next_10_days"), col("order_date")) )
74 | 
75 | 
76 | display(df_trans)
77 | 
78 | # COMMAND ----------
79 | 
80 | # MAGIC %run ../SETUP/_pyspark_clean_up
81 | 


--------------------------------------------------------------------------------
/PySpark_ETL/PS11-Partitioning & Repartitioning.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC 
  4 | # MAGIC ### Partitioning
  5 | # MAGIC In the world of big data, partitioning is an extremely important concept. As name suggest, partitioning means dividing your data into smaller parts based on a partition key. You can also use multiple keys to partition your data.
  6 | # MAGIC 
  7 | # MAGIC we use partitionBy() to parition our data. Partition means when you choose a partition key, you data is divided into smaller parts based on that key & it will store your data into subfolders. example:
  8 | # MAGIC 
  9 | # MAGIC If you have 1 billion rows for 1000 users. Everytime when you use filter based on userid (WHERE userid = 'abc'), the executor will scan whole data. Let say you choose userid as partition key, it will divide or partition your data into 1000 sub folders(as we have 1000 unique users). Now the query (WHERE useri='abc') will scan only once folder which contains abc records.
 10 | # MAGIC 
 11 | # MAGIC ### How to choose a partition key?
 12 | # MAGIC 
 13 | # MAGIC Let's see a demo.
 14 | # MAGIC 
 15 | # MAGIC ![Partition](https://raw.githubusercontent.com/martandsingh/images/master/partitioning.png)
 16 | # MAGIC 
 17 | # MAGIC ### How to decide number of partitions in Spark?
 18 | # MAGIC In Spark, one should carefully choose the number of partitions depending on the cluster design and application requirements. The best technique to determine the number of spark partitions in an RDD is to multiply the number of cores in the cluster with the number of partitions.
 19 | # MAGIC 
 20 | # MAGIC ### How do I create a partition in Spark?
 21 | # MAGIC In Spark, you can create partitions in two ways -
 22 | # MAGIC 1. Repartition - used to increase and decrease the partitions. Results in more or less equal sized partitions. Since a full shuffle takes place, repartition is less performant than coalesce. Repartition always involves a shuffle.
 23 | # MAGIC 1. Coalesce - used to decrease the partition. It creates unequal partitions. Faster than repartition but query performance an be slower. Coalesce doesn’t involve a full shuffle.
 24 | # MAGIC 
 25 | # MAGIC By invoking partitionBy method on an RDD, you can provide an explicit partitioner,
 26 | 
 27 | # COMMAND ----------
 28 | 
 29 | # MAGIC %run ../SETUP/_pyspark_init_setup
 30 | 
 31 | # COMMAND ----------
 32 | 
 33 | from pyspark.sql.types import StructField, StructType, StringType, DecimalType, IntegerType
 34 | 
 35 | # COMMAND ----------
 36 | 
 37 | # We are using a game steam dataset.
 38 | custom_schema = StructType(
 39 | [
 40 |     StructField("gamer_id", IntegerType(), True),
 41 |     StructField("game", StringType(), True),
 42 |     StructField("behaviour", StringType(), True),
 43 |     StructField("play_hours", DecimalType(), True),
 44 |     StructField("rating", IntegerType(), True)
 45 | ])
 46 | df = spark.read.option("header", "true").schema(custom_schema).csv('/FileStore/datasets/steam-200k.csv')
 47 | display(df)
 48 | 
 49 | # COMMAND ----------
 50 | 
 51 | df.count()
 52 | 
 53 | # COMMAND ----------
 54 | 
 55 | df.select("game").distinct().count()
 56 | 
 57 | # COMMAND ----------
 58 | 
 59 | df.rdd.getNumPartitions()
 60 | 
 61 | # COMMAND ----------
 62 | 
 63 | # We are using a game steam dataset. It will partition your data into default values of partitions which is 3 in my case you can check using df.rdd.getNumPartitions(). This process will be quicker as we do not have any partition key so spark does not have to sort and partition data based on key.
 64 | 
 65 | df.write.mode("overwrite").parquet("/FileStore/output/gamelogs_unpart")
 66 | 
 67 | # COMMAND ----------
 68 | 
 69 | display(dbutils.fs.ls('/FileStore/output/gamelogs_unpart'))
 70 | 
 71 | # COMMAND ----------
 72 | 
 73 | # We are using a game steam dataset. This will create multiple folders based on game names. we have 5155 unique game, it will create 5155 folders. This process will take longer time to execute.
 74 | 
 75 | df.write.partitionBy("game").mode("overwrite").parquet("/FileStore/output/gamelogs_part")
 76 | 
 77 | # COMMAND ----------
 78 | 
 79 | df_files =  dbutils.fs.ls('/FileStore/output/gamelogs_part') 
 80 | type(df_files)
 81 | 
 82 | # COMMAND ----------
 83 | 
 84 | len(df_files) # So we can see we have 5156 (5155 for games, 1 for log)
 85 | 
 86 | # COMMAND ----------
 87 | 
 88 | # Lets read from our partition data
 89 | df_game = spark.read.parquet("/FileStore/output/gamelogs_part/")
 90 | display(df_game)
 91 | 
 92 | # COMMAND ----------
 93 | 
 94 | from pyspark.sql.functions import col
 95 | 
 96 | # COMMAND ----------
 97 | 
 98 | display(df_game.filter( (col("game") == "Dota 2") & (col("behaviour") == "purchase") & (col("play_hours") == 1 )  ))
 99 | 
100 | # COMMAND ----------
101 | 
102 | display(df.filter( (col("game") == "Dota 2") & (col("behaviour") == "purchase") & (col("play_hours") == 1 )  ))
103 | 
104 | # COMMAND ----------
105 | 
106 | # MAGIC %md 
107 | # MAGIC ### Repartition
108 | 
109 | # COMMAND ----------
110 | 
111 | df_game.rdd.getNumPartitions()
112 | 
113 | # COMMAND ----------
114 | 
115 | from pyspark.sql.functions import spark_partition_id
116 | 
117 | # COMMAND ----------
118 | 
119 | # Add a new column which include partition id. Count total number of record in each partition. It is recommended to have almost equal number of records in ech partition.
120 | 
121 | display( df_game.withColumn("partitionId", spark_partition_id()).groupBy("partitionId").count().orderBy("count"))
122 | 
123 | 
124 | # COMMAND ----------
125 | 
126 | # repartition() is used to increase or decrease the RDD, DataFrame, Dataset partitions
127 | # If we want to reduce or increade the paritition size we can use it. 
128 | df_repart = df_game.repartition(40)
129 | display(df_repart.limit(10))
130 | 
131 | # COMMAND ----------
132 | 
133 | df_repart.rdd.getNumPartitions() #  40 partitions as we did
134 | 
135 | # COMMAND ----------
136 | 
137 | # we can also increase partition
138 | df_repart2 = df_game.repartition(80)
139 | display(df_repart2.limit(10))
140 | 
141 | # COMMAND ----------
142 | 
143 | # MAGIC %md
144 | # MAGIC ### coalesce()
145 | # MAGIC It is only used to reduce the partition. This is optimized or improved version of repartition() where the movement of the data across the partitions is lower using coalesce.
146 | 
147 | # COMMAND ----------
148 | 
149 | df_col2 = df_game.coalesce(10)
150 | display(df_col2)
151 | 
152 | # COMMAND ----------
153 | 
154 | df_col.rdd.getNumPartitions()#  10 partitions as we did
155 | 
156 | # COMMAND ----------
157 | 
158 | 
159 | 
160 | # COMMAND ----------
161 | 
162 | # if you will try to increase the partition using coalesce, it will throw error.
163 | df_col3 = df_col.coalesce(1000)
164 | display(df_col3)
165 | 
166 | # COMMAND ----------
167 | 
168 | df_col3.rdd.getNumPartitions() #  332 partitions which is the original value. we cannot increase it
169 | 
170 | # COMMAND ----------
171 | 
172 | # MAGIC %run ../SETUP/_pyspark_clean_up
173 | 


--------------------------------------------------------------------------------
/PySpark_ETL/PS12-Missing Value Handling.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC ### Handling Missing Values
  4 | # MAGIC In real life, data is not fine & curated. It is ambigous, dirty. It may contains null values or invalid values. for example. You have a hospital dataset, the age of patient is 1000. I am not sure about other planets but on earth it is impossible for a person to live 1000 years. So clearly it is a mistake. There can be case where firstname of patient is null. 
  5 | # MAGIC 
  6 | # MAGIC So you will see these kind of issues in raw data. These values can hamper your analysis. So we have to find and fix those values.
  7 | # MAGIC 
  8 | # MAGIC ![MISSING_VALUES](https://raw.githubusercontent.com/martandsingh/images/master/missing-values.png)
  9 | # MAGIC 
 10 | # MAGIC 
 11 | # MAGIC There are many ways to handle these values:
 12 | # MAGIC 
 13 | # MAGIC 1. Drop - We can drop the entire row if we find any null value. But as per my experience I do not prefer this method as sometime you delete important information or anamolies if you use this method.
 14 | # MAGIC 
 15 | # MAGIC 1. Fill - Here we try to fill our null values with some valid values. These values can be mean, mode, median or some other logic you define. The concept it to replace null values with most common or likely value. In this way you do not lose the entire row.
 16 | # MAGIC 
 17 | # MAGIC 1. Replace - Replace is more flexible option than fill. It can do the operation with fill() does but apart from that it is also helpful when you want to replace a string with another.
 18 | # MAGIC 
 19 | # MAGIC ### What is Imputation?
 20 | # MAGIC The process of preserving all cases by replacing missing data with an estimated value based on other available information is called imputation. Fill() & Replace() are used for imputation. 
 21 | # MAGIC 
 22 | # MAGIC *Let's not go deep in theory and see the action.
 23 | 
 24 | # COMMAND ----------
 25 | 
 26 | # MAGIC %run ../SETUP/_pyspark_init_setup
 27 | 
 28 | # COMMAND ----------
 29 | 
 30 | from pyspark.sql.functions import col
 31 | 
 32 | # COMMAND ----------
 33 | 
 34 | df = spark.read.option("header", "true").csv('/FileStore/datasets/missing_val_dataset.csv')
 35 | 
 36 | # COMMAND ----------
 37 | 
 38 | display(df)
 39 | 
 40 | # COMMAND ----------
 41 | 
 42 | df.printSchema()
 43 | 
 44 | # COMMAND ----------
 45 | 
 46 | # MAGIC %md
 47 | # MAGIC ## Check Null values count
 48 | 
 49 | # COMMAND ----------
 50 | 
 51 | 
 52 | #Lets check null values for one column. It will give you
 53 | null_count = df.filter(col("cloud").isNull()).count()  # count of all the null values
 54 | not_null_count = df.filter(col("vintage").isNotNull()).count() # count of not null values
 55 | print("Total null values: ", null_count)
 56 | print("Total not null values: ", not_null_count)
 57 | 
 58 | # COMMAND ----------
 59 | 
 60 | # You can get same result using sql expression
 61 | null_count_sql = df.filter("cloud IS NULL").count()  # count of all the null values
 62 | not_null_count_sql = df.filter("vintage IS NOT NULL").count() # count of not null values
 63 | print("Total null values: ", null_count_sql)
 64 | print("Total not null values: ", not_null_count_sql)
 65 | 
 66 | # COMMAND ----------
 67 | 
 68 |  # Check null count based on multiple columns
 69 | 
 70 | null_count_ml = df.filter(col("cloud").isNull() & col("vintage").isNull() ).count()  # count of all the null values
 71 | not_null_count_ml = df.filter(col("cloud").isNotNull() & col("cloud").isNotNull()  ).count() # count of not null values
 72 | print("Total null values: ", null_count_ml)
 73 | print("Total not null values: ", not_null_count_ml)
 74 | 
 75 | # COMMAND ----------
 76 | 
 77 | display(df.filter(col("cloud").isNull()))
 78 | 
 79 | # COMMAND ----------
 80 | 
 81 | df.count()
 82 | 
 83 | # COMMAND ----------
 84 | 
 85 | from pyspark.sql.functions import count, when, isnan, lower
 86 | 
 87 | # COMMAND ----------
 88 | 
 89 | # Here we can see we will not catch 'NA' values from vintage as NA is not a valid null character. To catch that we can 
 90 | df_null = df.select([count(when( isnan(c) | col(c).isNull() , c)).alias(c) for c in df.columns])
 91 | display(df_null)
 92 | 
 93 | # COMMAND ----------
 94 | 
 95 | # So to catch NA we added one more condition. Now in our below condition we are considering "NA" as invalid values.
 96 | df_null = df.select([count(when( isnan(c) | col(c).isNull() | (col(c) == "NA") , c)).alias(c) for c in df.columns])
 97 | display(df_null)
 98 | 
 99 | # COMMAND ----------
100 | 
101 | # MAGIC %md
102 | # MAGIC Fill(), DROP() & REPLACE() function will handle only system null values. So we should replace our "NA" values with system NULL. Below code is doing the same thing. It is replacn "NA" with None.
103 | 
104 | # COMMAND ----------
105 | 
106 | # Here we are replacing all the NA values to null as NA values are not valid missing value. So to process them we are changing them to system null values.
107 | df_trans = df.withColumn("vintage", when(col("vintage") == "NA", None)
108 |                         .otherwise(col("vintage")) )
109 | 
110 | display(df_trans)
111 | 
112 | # COMMAND ----------
113 | 
114 | # Now let's check null value counts after reaplcing "NA" with None.
115 | df_null = df_trans.select([count(when( isnan(c) | col(c).isNull()  , c)).alias(c) for c in df_trans.columns])
116 | display(df_null)
117 | 
118 | # COMMAND ----------
119 | 
120 | # MAGIC %md
121 | # MAGIC ### Handle Missing Values
122 | # MAGIC There are multiple way to deal with missing values.
123 | 
124 | # COMMAND ----------
125 | 
126 | # MAGIC %md
127 | # MAGIC ### Way1 : Drop
128 | # MAGIC 
129 | # MAGIC Drop rows with null values. 
130 | # MAGIC 
131 | # MAGIC It has 2 options __Any__ and __All__.
132 | # MAGIC 
133 | # MAGIC In "Any" row is deleted, if any one of all the columns has a null or invalid values. In contrary, "All" will delete row only when all the columns has null values. You can also use subset if the columns. 
134 | # MAGIC 
135 | # MAGIC *Let's check with examples...
136 | 
137 | # COMMAND ----------
138 | 
139 | # before dropping lets count the total number of rows.
140 | total_count=df_trans.count() # total count
141 | print(total_count)
142 | 
143 | # COMMAND ----------
144 | 
145 | # it will drop complete row if any one column has null value. It will not NA value rows from vintage column
146 | # Using any deleted 59 columns out of 61.
147 | dropna_any = df_trans.na.drop(how="any") 
148 | display(dropna_any)
149 | 
150 | # COMMAND ----------
151 | 
152 | # it will drop complete row if all the column has null value. We can see this statement did not delete any records as there are no records with all column null. It did not delete any as there are no rows with all null values.
153 | df.na.drop(how="all").count() 
154 | 
155 | # COMMAND ----------
156 | 
157 | 
158 | 
159 | # COMMAND ----------
160 | 
161 | # Take null value count in variable so that we can verify later with our drop() function.
162 | cld_all_null = (df_trans.filter("cloud IS NULL AND vintage IS NULL")).count()
163 | print(cld_all_null)
164 | 
165 | cld_any_null =  (df_trans.filter("cloud IS NULL OR vintage IS NULL")).count()
166 | print(cld_any_null)
167 | 
168 | # COMMAND ----------
169 | 
170 | # apply drop to few columns. This will delete rows where either cloud or vintage is null. This will delte 59 rows, you can confirm with variable cld_any_null
171 | df_trans.na.drop(how="any", subset=["cloud", "vintage"]).count() 
172 | # this count is after deleting rows. total_rows - cld_any_null
173 | 
174 | # COMMAND ----------
175 | 
176 | # apply drop to few columns. This will delete rows where cloud and vintage both columns are null. It will delete 8 rows as you can confirm with variable cld_all_null value.
177 | df_trans.na.drop(how="all", subset=["cloud", "vintage"]).count()
178 | # this count is after deleting rows. total_rows - cld_all_null
179 | 
180 | # COMMAND ----------
181 | 
182 | # MAGIC %md
183 | # MAGIC ### Way 2: Fill
184 | # MAGIC Now let's see how to fill null value. This is an imputation technique.
185 | 
186 | # COMMAND ----------
187 | 
188 | # Fill function is used to fill null values. Below code will replace all the null values in all the columns with NULL_VAL_REPLACE value. You can use any custom value. Just take care of column length and type.
189 | 
190 | display(df_trans.na.fill("NULL_VAL_REPLACE").select("cloud", "vintage"))
191 | 
192 | # COMMAND ----------
193 | 
194 | # You can use subset if you want to fill missing values only in few columns.
195 | display(df_trans.na.fill("MISSING_VALUE", subset=["cloud", "vintage"]))
196 | 
197 | # COMMAND ----------
198 | 
199 | # You can define new value for each column. Below code will replace all the null values in cloud with NEW_VALUE & vintage with 2022.
200 | 
201 | missing_values = {
202 |     "cloud": "NEW_VALUE",
203 |     "vintage": "2022"
204 | }
205 | display(df_trans.na.fill(missing_values))
206 | 
207 | # COMMAND ----------
208 | 
209 | # MAGIC %md
210 | # MAGIC ### Way 3: REPLACE
211 | 
212 | # COMMAND ----------
213 | 
214 | display(df_trans.na.replace("GloBI", "NEW_GLOBI_NAME"))
215 | 
216 | # COMMAND ----------
217 | 
218 | # MAGIC %run ../SETUP/_pyspark_clean_up
219 | 


--------------------------------------------------------------------------------
/PySpark_ETL/PS13-Deduplication.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ### Deduplication
 4 | # MAGIC Duplicate rows are a big issue in the world of big data. It can not only affect you analysis but also takes extra storage which in result cost you more.
 5 | # MAGIC 
 6 | # MAGIC __Deduplication__ is the process of removing duplicate data from your dataset.
 7 | # MAGIC 
 8 | # MAGIC So it is important to find out & cure duplicate rows. There may be case where you want to drop duplicate data (but not always). 
 9 | # MAGIC 
10 | # MAGIC We will use:
11 | # MAGIC 1. Distinct - It gives you distinct resultset which mean all the rows are unique. There are no duplicate rows. 
12 | # MAGIC 1. DropDuplicates() - This function will help you to drop duplicate rows from your dataset.
13 | # MAGIC 
14 | # MAGIC ![DUPLICATE_DATA](https://raw.githubusercontent.com/martandsingh/images/master/duplicate.jpg)
15 | 
16 | # COMMAND ----------
17 | 
18 | # MAGIC %run ../SETUP/_pyspark_init_setup
19 | 
20 | # COMMAND ----------
21 | 
22 | from pyspark.sql.types import StructField, StructType, StringType, DecimalType, IntegerType
23 | 
24 | # COMMAND ----------
25 | 
26 | # We are using a game steam dataset.
27 | custom_schema = StructType(
28 | [
29 |     StructField("gamer_id", IntegerType(), True),
30 |     StructField("game", StringType(), True),
31 |     StructField("behaviour", StringType(), True),
32 |     StructField("play_hours", DecimalType(), True),
33 |     StructField("rating", IntegerType(), True)
34 | ])
35 | df = spark.read.option("header", "true").schema(custom_schema).csv('/FileStore/datasets/steam-200k.csv')
36 | display(df)
37 | 
38 | # COMMAND ----------
39 | 
40 | # MAGIC %md
41 | # MAGIC ### DISTINCT()
42 | 
43 | # COMMAND ----------
44 | 
45 | # Lets check if our recordset has any duplicate rows
46 | total_count = df.count()
47 | distinct_count = df.distinct().count()
48 | print('total: ', total_count)
49 | print('distinct: ', distinct_count)
50 | print('duplicate: ', total_count-distinct_count)
51 | # so you can see we have few duplicate records. Keep in mind these duplicate records are comparing whole row which mean there exist two or more rows which have exact same value for all the columns in the table.
52 | 
53 | # COMMAND ----------
54 | 
55 | # Lets find out duplicate values for specific column or multiple columns
56 | # lets find all the distinct game names. there are 5155 distinct games in our dataset.
57 | print(df.select("game").distinct().count())
58 | 
59 | 
60 | # COMMAND ----------
61 | 
62 | # Find distinct record based on game & behaviour.
63 | distinct_selected_column= df.select("game", "behaviour").distinct().count()
64 | print(distinct_selected_column)
65 | 
66 | # COMMAND ----------
67 | 
68 | # So we saw how we can find distinct records based on few columns and all the columns. Now let's see how to drop duplicates values from our dataframe.
69 | 
70 | # COMMAND ----------
71 | 
72 | # MAGIC %md
73 | # MAGIC ### DropDuplicates()
74 | 
75 | # COMMAND ----------
76 | 
77 | # you can match this count with our distinct_count variable.  
78 | df_distinct =  df.drop_duplicates()
79 | df_distinct.count()
80 | 
81 | # COMMAND ----------
82 | 
83 | # Drop duplicates on selected column. It will check the duplicate rows based on combination of the given columns. it should match our distinct_selected_column value.
84 | df_drop_selected = df.drop_duplicates(["game", "behaviour"])
85 | df_drop_selected.count()
86 | 
87 | # COMMAND ----------
88 | 
89 | # MAGIC %run ../SETUP/_pyspark_clean_up
90 | 
91 | # COMMAND ----------
92 | 
93 | 
94 | 


--------------------------------------------------------------------------------
/PySpark_ETL/PS14-Data Profiling using PySpark.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC ### What is Data Profiling?
  4 | # MAGIC Data profiling is the process of examining, analyzing, and creating useful summaries of data. The process yields a high-level overview which aids in the discovery of data quality issues, risks, and overall trends. Data profiling produces critical insights into data that companies can then leverage to their advantage.
  5 | # MAGIC 
  6 | # MAGIC In this notebook, we will learn few methods to generate our data profile. There are many application available in the market which can help you with data profiling. My main motto of this notebook is to explain how can anyone perform data profiling without purchasing third-party softwares.
  7 | # MAGIC 
  8 | # MAGIC Also if you understand the correct concept, then you may design your own custom data quality checks which may not be available in any other softwares, As different organization has different business rules. No single software can cover all the requirement, so it is better to know some core concepts.
  9 | # MAGIC 
 10 | # MAGIC Data profling is a part of Data Quality Checks. If you want to know more about data quality, you can refer:
 11 | # MAGIC 
 12 | # MAGIC https://www.marketingevolution.com/marketing-essentials/data-quality
 13 | # MAGIC 
 14 | # MAGIC ![DQC](https://raw.githubusercontent.com/martandsingh/images/master/dqc.jpg)
 15 | 
 16 | # COMMAND ----------
 17 | 
 18 | # MAGIC %run ../SETUP/_pyspark_init_setup
 19 | 
 20 | # COMMAND ----------
 21 | 
 22 | # MAGIC %md
 23 | # MAGIC ### Create Delta Table
 24 | # MAGIC Let's create a delta table to store our data profiling. We can create table using SQL command. You can write SQL code in the notebook using %sql keyword.
 25 | 
 26 | # COMMAND ----------
 27 | 
 28 | # MAGIC %sql
 29 | # MAGIC -- Create a new database for DQC.
 30 | # MAGIC CREATE DATABASE IF NOT EXISTS DB_DQC;
 31 | # MAGIC USE DB_DQC;
 32 | 
 33 | # COMMAND ----------
 34 | 
 35 | # MAGIC %sql
 36 | # MAGIC --Lets create a delta table to store our profiling data
 37 | # MAGIC CREATE OR REPLACE TABLE data_profiling
 38 | # MAGIC (
 39 | # MAGIC   dqc_name VARCHAR(100),
 40 | # MAGIC   stage VARCHAR(20),
 41 | # MAGIC   db_name VARCHAR(50),
 42 | # MAGIC   table_name VARCHAR(50),
 43 | # MAGIC   column_name VARCHAR(50),
 44 | # MAGIC   dqc_value VARCHAR(50),
 45 | # MAGIC   query VARCHAR(2000),
 46 | # MAGIC   description VARCHAR(100),
 47 | # MAGIC   created_on TIMESTAMP
 48 | # MAGIC  );
 49 | 
 50 | # COMMAND ----------
 51 | 
 52 | df = spark.read.parquet('/FileStore/datasets/USED_CAR_PARQUET/')
 53 | display(df.limit(4))
 54 | 
 55 | # COMMAND ----------
 56 | 
 57 | from pyspark.sql.functions import col, length
 58 | from datetime import datetime
 59 | 
 60 | # COMMAND ----------
 61 | 
 62 | # get list of column & its type
 63 | def get_column_types(df_source):
 64 |     dict_types= {}
 65 |     for column in  df.schema.fields:
 66 |         dict_types[column.name] = str(column.dataType)
 67 |     return dict_types
 68 |            
 69 | # Return dictionary with missing columns & count
 70 | def get_null_check(df_source):
 71 |     list_cols = df_source.columns
 72 |     dict_null = {}
 73 |     for column in list_cols:
 74 |         null_count=(df.filter(col(column).isNull()).count())
 75 |         dict_null[column] = null_count
 76 |     return dict_null
 77 | 
 78 | # get min value only for numeric columns
 79 | def get_min_val(df_source):
 80 |     data = {}
 81 |     for column in df_source.columns:
 82 |         dtype = str(df.schema[column].dataType)
 83 |         #print(dtype)
 84 |         if dtype.lower() in ["longtype", "inttype", "decimaltype", "floattype"]:
 85 |             min_val = df_source.agg({column: "min"}).collect()[0]["min("+column+")"]
 86 |             data[column] = str(min_val)
 87 |         else:
 88 |             data[column]="NA"
 89 |     return data
 90 | 
 91 | # get average value only for numeric columns
 92 | def get_avg_val(df_source):
 93 |     data = {}
 94 |     for column in df_source.columns:
 95 |         dtype = str(df.schema[column].dataType)
 96 |         #print(dtype)
 97 |         if dtype.lower() in ["longtype", "inttype", "decimaltype", "floattype"]:
 98 |             avg_val = df_source.agg({column: "avg"}).collect()[0]["avg("+column+")"]
 99 |             data[column] = str(avg_val)
100 |         else:
101 |             data[column]="NA"
102 |     return data
103 | 
104 | 
105 | # get max value only for numeric columns
106 | def get_max_val(df_source):
107 |     data = {}
108 |     for column in df_source.columns:
109 |         dtype = str(df.schema[column].dataType)
110 |         #print(dtype)
111 |         if dtype.lower() in ["longtype", "inttype", "decimaltype", "floattype"]:
112 |             max_val = df_source.agg({column: "max"}).collect()[0]["max("+column+")"]
113 |             data[column] = str(max_val)
114 |         else:
115 |             data[column]="NA"
116 |     return data
117 | 
118 | # get max length of the value in a string type column
119 | def get_max_length_val(df_source):
120 |     data = {}
121 |     for column in df_source.columns:
122 |         dtype = str(df.schema[column].dataType)
123 |         #print(dtype)
124 |         if dtype.lower() in ["stringtype"]:
125 |             df_len = df.withColumn("length", length(col(column))).select(column, "length")
126 |             max_length = df_len.agg({"length":"max"}).collect()[0]["max(length)"]
127 |             
128 |             data[column] = str(max_length)
129 |         else:
130 |             data[column]="NA"
131 |     return data
132 | 
133 | 
134 | # get min length of the value in a string type column
135 | def get_min_length_val(df_source):
136 |     data = {}
137 |     for column in df_source.columns:
138 |         dtype = str(df.schema[column].dataType)
139 |         #print(dtype)
140 |         if dtype.lower() in ["stringtype"]:
141 |             df_len = df.withColumn("length", length(col(column))).select(column, "length")
142 |             min_length = df_len.agg({"length":"min"}).collect()[0]["min(length)"]
143 |             
144 |             data[column] = str(min_length)
145 |         else:
146 |             data[column]="NA"
147 |     return data
148 | 
149 | #get count of values matching for the given regex. In our case we are counting values with special characters.
150 | def special_character_check(df_source):
151 |     data = {}
152 |     for column in df_source.columns:
153 |         dtype = str(df.schema[column].dataType)
154 |         #print(dtype)
155 |         if dtype.lower() in ["stringtype"]:
156 |             special_character_count = df.filter(col(column).rlike("[()`~/\!@#$%^&*()']")).count()
157 |             data[column] = str(special_character_count)
158 |         else:
159 |             data[column]="NA"
160 |     return data
161 | 
162 | #Pass the required parameter and it will log DQC stats to delta table.
163 | def log_dqc_checks(dict_dqc, dqc_name, db_name, table_name, stage, query="", description=""):
164 |     data =[]
165 |     for key in dict_dqc:
166 |         dict_result = {
167 |           "dqc_name" : dqc_name,
168 |           "stage" : stage ,
169 |           "db_name" : db_name,
170 |           "table_name" :table_name,
171 |           "column_name": key,
172 |           "dqc_value" : dict_dqc[key],
173 |           "query" : query,
174 |           "description": description,
175 |           "created_on" : datetime.now()
176 |         }
177 |         data.append(dict_result)
178 |     df = spark.createDataFrame(data)
179 |     df_sort = df \
180 |     .select("dqc_name", "stage", "db_name", "table_name", "column_name", "dqc_value", "query", "description", "created_on")
181 |     df_sort.write.insertInto("DB_DQC.data_profiling", overwrite=False)
182 |     
183 | 
184 | # COMMAND ----------
185 | 
186 | def dqc_pipeline(df_source):
187 |     DB_NAME = "DB_DQC"
188 |     TABLE_NAME = "data_profiling"
189 |     STAGE = "PRE_ETL"
190 |     
191 |     print("Column type dqc in progress...")
192 |     dict_types = get_column_types(df_source)
193 |     log_dqc_checks(dict_types, "COLUMN_TYPE_DQC", DB_NAME, TABLE_NAME, STAGE)
194 |     print("Column type dqc completed.")
195 |     
196 |     # Log NULL CHECK DQC
197 |     print("Missing value dqc in progress...")
198 |     dict_null = get_null_check(df_source)
199 |     log_dqc_checks(dict_null, "NULL_COUNT_DQC", DB_NAME, TABLE_NAME, STAGE)
200 |     print("Missing value dqc completed.")
201 |     
202 |     print("Minimum value dqc in progress...")
203 |     dict_min = get_min_val(df_source)
204 |     log_dqc_checks(dict_min, "MIN_VAL_DQC", DB_NAME, TABLE_NAME, STAGE)
205 |     print("Minimum value dqc completed.")
206 |     
207 |     print("Maximum value dqc in progress...")
208 |     dict_max = get_max_val(df_source)
209 |     log_dqc_checks(dict_max, "MAX_VAL_DQC", DB_NAME, TABLE_NAME, STAGE)
210 |     print("Maximum value dqc completed.")
211 |     
212 |     print("Average value dqc in progress...")
213 |     dict_avg = get_avg_val(df_source)
214 |     log_dqc_checks(dict_max, "AVG_VAL_DQC", DB_NAME, TABLE_NAME, STAGE)
215 |     print("Average value dqc completed.")
216 |     
217 |     print("Maximum length dqc in progress...")
218 |     dict_max_length = get_max_length_val(df_source)
219 |     log_dqc_checks(dict_max_length, "MAX_LENGTH_DQC", DB_NAME, TABLE_NAME, STAGE)
220 |     print("Maximum length dqc completed.")
221 |     
222 |     print("Minimum length dqc in progress...")
223 |     dict_min_length = get_min_length_val(df_source)
224 |     log_dqc_checks(dict_min_length, "MIN_LENGTH_DQC", DB_NAME, TABLE_NAME, STAGE)
225 |     print("Minimum length dqc completed.")
226 |     
227 |     print("Special character dqc in progress...")
228 |     dict_special_character = special_character_check(df_source)
229 |     log_dqc_checks(dict_special_character, "SPECIAL_CHAR_DQC", DB_NAME, TABLE_NAME, STAGE)
230 |     print("Special character dqc completed.")
231 | 
232 | # COMMAND ----------
233 | 
234 | dqc_pipeline(df)
235 | 
236 | # COMMAND ----------
237 | 
238 | # MAGIC %sql
239 | # MAGIC SELECT * FROM data_profiling ORDER BY dqc_name
240 | 
241 | # COMMAND ----------
242 | 
243 | # MAGIC %sql
244 | # MAGIC SELECT dqc_name, COUNT(dqc_value) FROM data_profiling WHERE (dqc_value) > 1 GROUP BY dqc_name 
245 | 
246 | # COMMAND ----------
247 | 
248 | # MAGIC %md
249 | # MAGIC ### What we will do?
250 | # MAGIC We are trying to profile our data which mean we are trying to record table statistic so that we can compare it later with post transformation data.
251 | # MAGIC 
252 | # MAGIC e.g. You have a unprocessed table, it has 5 columns & 100000 records. So we will calculate few basic statistics like total row count, total number of null values for each column, total distinct count, duplicate rows, datatype etc.
253 | # MAGIC 
254 | # MAGIC ### How it is useful?
255 | # MAGIC Once you record these stats, then later we can calculate same statistic after cleaning data. In this way we can compare pre & post transformation stats to see how our data has changed during data pipeline.
256 | 
257 | # COMMAND ----------
258 | 
259 | # MAGIC %sql
260 | # MAGIC DROP DATABASE DB_DQC CASCADE;
261 | 
262 | # COMMAND ----------
263 | 
264 | # MAGIC %run ../SETUP/_pyspark_clean_up
265 | 


--------------------------------------------------------------------------------
/PySpark_ETL/PS15-Data Caching.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC ### Caching - Optimize spark performance
  4 | # MAGIC Data caching is a very important technique when it comes to optimize your spark performance. Sometimes we have to reuse our big dataframe multiple times. It is not always prefered to load them frequently. SO what else we can do?
  5 | # MAGIC 
  6 | # MAGIC We can create a local or cached copy to quickly retreive data from it. In this case our original dataframe will not be used. We will be using the cached copy. There may be the cases, new updates are there in original dataframe but you still have stale or older version of data cached. We have to take care of it also.
  7 | # MAGIC 
  8 | # MAGIC ### Databricks provides two types of caching:
  9 | # MAGIC 1. Spark Caching
 10 | # MAGIC 1. Delta Caching
 11 | # MAGIC 
 12 | # MAGIC ### Delta and Apache Spark caching
 13 | # MAGIC 
 14 | # MAGIC The Delta cache accelerates data reads by creating copies of remote files in nodes’ local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then performed locally, which results in significantly improved reading speed.
 15 | # MAGIC 
 16 | # MAGIC The Delta cache works for all Parquet files and is not limited to Delta Lake format files. The Delta cache supports reading Parquet files in Amazon S3, DBFS, HDFS, Azure Blob storage, Azure Data Lake Storage Gen1, and Azure Data Lake Storage Gen2. It does not support other storage formats such as CSV, JSON, and ORC.
 17 | # MAGIC 
 18 | # MAGIC Here are the characteristics of each type:
 19 | # MAGIC 
 20 | # MAGIC * __Type of stored data__: The Delta cache contains local copies of remote data. It can improve the performance of a wide range of queries, but cannot be used to store results of arbitrary subqueries. The Spark cache can store the result of any subquery data and data stored in formats other than Parquet (such as CSV, JSON, and ORC).
 21 | # MAGIC 
 22 | # MAGIC * __Performance__: The data stored in the Delta cache can be read and operated on faster than the data in the Spark cache. This is because the Delta cache uses efficient decompression algorithms and outputs data in the optimal format for further processing using whole-stage code generation.
 23 | # MAGIC 
 24 | # MAGIC * __Automatic vs manual control__: When the Delta cache is enabled, data that has to be fetched from a remote source is automatically added to the cache. This process is fully transparent and does not require any action. However, to preload data into the cache beforehand, you can use the CACHE SELECT command (see Cache a subset of the data). When you use the Spark cache, you must manually specify the tables and queries to cache.
 25 | # MAGIC 
 26 | # MAGIC * __Disk vs memory-based__: The Delta cache is stored on the local disk, so that memory is not taken away from other operations within Spark. Due to the high read speeds of modern SSDs, the Delta cache can be fully disk-resident without a negative impact on its performance. In contrast, the Spark cache uses memory.
 27 | # MAGIC 
 28 | # MAGIC *Attention: In this chapter  you may not able to see caching effect as our dataset is very small. To see a significant difference, your dataset must be big and must have complex processing. so this notebook is just to show you how we can use caching.*
 29 | # MAGIC 
 30 | # MAGIC __for more details visits:__
 31 | # MAGIC 
 32 | # MAGIC https://docs.databricks.com/delta/optimizations/delta-cache.html
 33 | 
 34 | # COMMAND ----------
 35 | 
 36 | # MAGIC %run ../SETUP/_pyspark_init_setup
 37 | 
 38 | # COMMAND ----------
 39 | 
 40 | # MAGIC %md
 41 | # MAGIC ### PySpark Caching
 42 | # MAGIC We have two methods to perform pyspark caching:
 43 | # MAGIC 1. cache()
 44 | # MAGIC 1. persist()
 45 | # MAGIC 
 46 | # MAGIC Both caching and persisting are used to save the Spark RDD, Dataframe, and Dataset’s. But, the difference is, RDD cache() method default saves it to memory (MEMORY_ONLY) whereas persist() method is used to store it to the user-defined storage level.
 47 | # MAGIC 
 48 | # MAGIC __Storage class__
 49 | # MAGIC 
 50 | # MAGIC class pyspark.StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication = 1)
 51 | # MAGIC 
 52 | # MAGIC Storage level in persist():
 53 | # MAGIC 
 54 | # MAGIC Now, to decide the storage of RDD, there are different storage levels, which are given below −
 55 | # MAGIC 
 56 | # MAGIC * DISK_ONLY = StorageLevel(True, False, False, False, 1)
 57 | # MAGIC * DISK_ONLY_2 = StorageLevel(True, False, False, False, 2)
 58 | # MAGIC * MEMORY_AND_DISK = StorageLevel(True, True, False, False, 1)
 59 | # MAGIC * MEMORY_AND_DISK_2 = StorageLevel(True, True, False, False, 2)
 60 | # MAGIC * MEMORY_AND_DISK_SER = StorageLevel(True, True, False, False, 1)
 61 | # MAGIC * MEMORY_AND_DISK_SER_2 = StorageLevel(True, True, False, False, 2)
 62 | # MAGIC * MEMORY_ONLY = StorageLevel(False, True, False, False, 1)
 63 | # MAGIC * MEMORY_ONLY_2 = StorageLevel(False, True, False, False, 2)
 64 | # MAGIC * MEMORY_ONLY_SER = StorageLevel(False, True, False, False, 1)
 65 | # MAGIC * MEMORY_ONLY_SER_2 = StorageLevel(False, True, False, False, 2)
 66 | # MAGIC * OFF_HEAP = StorageLevel(True, True, True, False, 1)
 67 | 
 68 | # COMMAND ----------
 69 | 
 70 | from pyspark.sql.types import StructField, StructType, StringType, DecimalType, IntegerType
 71 | 
 72 | # COMMAND ----------
 73 | 
 74 | # We are using a game steam dataset.
 75 | custom_schema = StructType(
 76 | [
 77 |     StructField("gamer_id", IntegerType(), True),
 78 |     StructField("game", StringType(), True),
 79 |     StructField("behaviour", StringType(), True),
 80 |     StructField("play_hours", DecimalType(), True),
 81 |     StructField("rating", IntegerType(), True)
 82 | ])
 83 | df = spark.read.option("header", "true").schema(custom_schema).csv('/FileStore/datasets/steam-200k.csv')
 84 | display(df)
 85 | 
 86 | # COMMAND ----------
 87 | 
 88 | from pyspark.sql.functions import col
 89 | 
 90 | # COMMAND ----------
 91 | 
 92 | df_pre_cache =  df\
 93 |             .groupBy("game", "behaviour")\
 94 |             .mean("play_hours")
 95 | 
 96 | 
 97 | 
 98 | # COMMAND ----------
 99 | 
100 | display(df_pre_cache)
101 | 
102 | # COMMAND ----------
103 | 
104 | df_post_cache = df_pre_cache.cache()
105 | 
106 | # COMMAND ----------
107 | 
108 | display(df_post_cache)
109 | 
110 | # COMMAND ----------
111 | 
112 | # What will happen if we make changes in original dataset? Will it automaticlly update the cached copy. Let's drop a column in original datset and then compare both.
113 | 
114 | # step1: cache the original copy
115 | df_cach = df.cache()
116 | df = df.drop("gamer_id")
117 | 
118 | 
119 | # COMMAND ----------
120 | 
121 | display(df.limit(2))
122 | display(df_cach.limit(2))
123 | 
124 | # so you can see cached data does not gets updated. So always keep this in mind. If you have any changes in original dataframe then you have to delete the cached copy and create a new cache. Do not forget to delete the previous one as it will take extra storage. Always use caching for the dataset which does not gets updated frequently.
125 | 
126 | # COMMAND ----------
127 | 
128 | # MAGIC %md
129 | # MAGIC ### Persist()
130 | # MAGIC Persist is like cache but in this case you can define custom storage level.
131 | 
132 | # COMMAND ----------
133 | 
134 | from pyspark.storagelevel import StorageLevel
135 | df_persist = df.persist( StorageLevel.MEMORY_AND_DISK_2)
136 | 
137 | # COMMAND ----------
138 | 
139 | display(df_persist)
140 | 
141 | # COMMAND ----------
142 | 
143 | # MAGIC %md
144 | # MAGIC ### Delta Caching
145 | # MAGIC Delta caching is stored as local file on worker node. The Delta cache automatically detects when data files are created or deleted and updates its content accordingly. You can write, modify, and delete table data with no need to explicitly invalidate cached data.
146 | # MAGIC 
147 | # MAGIC The Delta cache automatically detects files that have been modified or overwritten after being cached. Any stale entries are automatically invalidated and evicted from the cache.
148 | # MAGIC 
149 | # MAGIC You have to enable delta caching using:
150 | # MAGIC 
151 | # MAGIC spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]")
152 | 
153 | # COMMAND ----------
154 | 
155 | spark.conf.get("spark.databricks.io.cache.enabled") # it is default in my case. let's enable this
156 | 
157 | # COMMAND ----------
158 | 
159 | spark.conf.set("spark.databricks.io.cache.enabled", "true")
160 | 
161 | # COMMAND ----------
162 | 
163 | # Lets create a delta table and see how delta caching works
164 | df.write.format("delta").mode("overwrite").saveAsTable("default.gamestats")
165 | 
166 | # COMMAND ----------
167 | 
168 | # MAGIC %sql -- we have our table
169 | # MAGIC SELECT * FROM default.gamestats
170 | # MAGIC WHERE play_hours > 10
171 | 
172 | # COMMAND ----------
173 | 
174 | # MAGIC %sql
175 | # MAGIC -- lets cache above query using delta caching
176 | # MAGIC CACHE
177 | # MAGIC SELECT * FROM default.gamestats
178 | # MAGIC WHERE play_hours > 10
179 | 
180 | # COMMAND ----------
181 | 
182 | # MAGIC %sql
183 | # MAGIC -- After caching this query will take lesser time.
184 | # MAGIC SELECT * FROM default.gamestats
185 | # MAGIC WHERE play_hours > 10
186 | 
187 | # COMMAND ----------
188 | 
189 | # MAGIC %sql
190 | # MAGIC -- now lets make some changes in our table and then rerun the same query. Will cache be able to catch new changes?
191 | # MAGIC UPDATE default.gamestats
192 | # MAGIC SET rating = CASE WHEN play_hours <5 THEN 2.5 
193 | # MAGIC WHEN play_hours >=5 AND play_hours<10 THEN 3.5 
194 | # MAGIC ELSE 4.8 END
195 | 
196 | # COMMAND ----------
197 | 
198 | # MAGIC %sql
199 | # MAGIC -- so now we have updated rating column. Let;s see whether these new changes will be available in our cached copy?
200 | # MAGIC SELECT * FROM default.gamestats LIMIT 10;
201 | # MAGIC -- Yes, we can see updated changes in our cached query result.
202 | 
203 | # COMMAND ----------
204 | 
205 | # MAGIC %sql
206 | # MAGIC SELECT * FROM default.gamestats
207 | # MAGIC WHERE play_hours > 10
208 | 
209 | # COMMAND ----------
210 | 
211 | # MAGIC %run ../SETUP/_pyspark_clean_up
212 | 
213 | # COMMAND ----------
214 | 
215 | 
216 | 


--------------------------------------------------------------------------------
/PySpark_ETL/PS16-User Defined Functions.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ### User defined functions (UDF)
 4 | # MAGIC UDF will allow us to apply the functions directly in the dataframes and SQL databases in python, without making them registering individually. It can also help us to create new columns to our dataframe, by applying a function via UDF to the dataframe column(s), hence it will extend our functionality of dataframe. It can be created using the udf() method.
 5 | 
 6 | # COMMAND ----------
 7 | 
 8 | # MAGIC %run ../SETUP/_pyspark_init_setup
 9 | 
10 | # COMMAND ----------
11 | 
12 | df = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderlist.csv")
13 | display(df)
14 | 
15 | # COMMAND ----------
16 | 
17 | # MAGIC %md
18 | # MAGIC Let's create a UDF to extract order number from order id column. for example: In B-25601, 25601 is the order id.
19 | 
20 | # COMMAND ----------
21 | 
22 | # Step1: first create a python function
23 | def extract_order_no(order_id):
24 |     if order_id != None and '-' in order_id:
25 |         return order_id.split("-")[1]
26 |     else:
27 |         return 'NA'
28 | 
29 | 
30 | # COMMAND ----------
31 | 
32 | print(extract_order_no("B-25601"))
33 | print(extract_order_no("25601"))
34 | 
35 | # COMMAND ----------
36 | 
37 | # Step 2: convert python function to udf
38 | from pyspark.sql.functions import udf, col
39 | from pyspark.sql.types import StringType
40 | extract_order_no = udf(extract_order_no)
41 | 
42 | # COMMAND ----------
43 | 
44 | df_trans = df.withColumn("order_number", extract_order_no(col("Order ID")) ) 
45 | display(df_trans)
46 | 
47 | # COMMAND ----------
48 | 
49 | # MAGIC %run ../SETUP/_pyspark_clean_up
50 | 


--------------------------------------------------------------------------------
/PySpark_ETL/PS17-Write Data.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC ### Write DataFrame
  4 | # MAGIC This is one of the most important but easy topic in ETL pipelines. write data frame is "L - Load" in ETL. After you transform your data, you need t write it to db or some location (datalake, dbfs). We can use write function to do so.
  5 | # MAGIC 
  6 | # MAGIC 
  7 | # MAGIC df.write.mode("overwrite/append/ignore").csv("/path/file/")
  8 | # MAGIC 
  9 | # MAGIC __We have three write modes:__
 10 | # MAGIC * __append__: Append content of the dataframe to existing data or table.
 11 | # MAGIC * __overwrite__: Overwrite existing data with the content of dataframe.
 12 | # MAGIC * __ignore__: Ignore current write operation if data / table already exists without any error
 13 | # MAGIC * __error or errorifexists__: (default case): Throw an exception if data already exists.
 14 | 
 15 | # COMMAND ----------
 16 | 
 17 | # MAGIC %run ../SETUP/_pyspark_init_setup
 18 | 
 19 | # COMMAND ----------
 20 | 
 21 | df_ol = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderlist.csv")
 22 | df_od = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderdetails.csv")
 23 | 
 24 | display(df_ol.limit(3))
 25 | display(df_od.limit(3))
 26 | 
 27 | # COMMAND ----------
 28 | 
 29 | # MAGIC %md
 30 | # MAGIC ### Problem
 31 | # MAGIC Let's say our business user wants us to get all the orders which are above $500 & category is Clothing in Maharashtra state.
 32 | 
 33 | # COMMAND ----------
 34 | 
 35 | 
 36 | df_mah = df_ol\
 37 |             .join(df_od, df_ol["Order ID"] == df_od["Order ID"], "inner")\
 38 |             .filter((df_od["Category"] == "Clothing") & (df_od["Amount"] > 500) & (df_ol["State"]=="Maharashtra"))\
 39 |             .withColumn("Amount", df_od["Amount"].cast("decimal(10, 2)"))\
 40 |             .withColumn("order_id", df_ol["Order ID"])\
 41 |             .select("order_id", "State", "City", "Category", "Amount")
 42 | display(df_mah)
 43 | 
 44 | # COMMAND ----------
 45 | 
 46 | df_mah.printSchema() # so you can see we have rename column & Amount type is changed to decimal(10, 2)
 47 | 
 48 | # COMMAND ----------
 49 | 
 50 | # MAGIC %md
 51 | # MAGIC ### Write as parquet files
 52 | 
 53 | # COMMAND ----------
 54 | 
 55 | # So now after doing this analysis & transformation we need to write this result to somewhere. For the sake of demo I will use delta lake to write the output.
 56 | 
 57 | # Write parquet files
 58 | df_mah.write.mode("overwrite").parquet("/FileStore/output/ClothingSalesMah_par/")
 59 | 
 60 | # COMMAND ----------
 61 | 
 62 | # MAGIC %md
 63 | # MAGIC ### Write as CSV
 64 | 
 65 | # COMMAND ----------
 66 | 
 67 | 
 68 | # Write CSV files
 69 | df_mah.write.mode("overwrite").csv("/FileStore/output/ClothingSalesMah_csv/")
 70 | 
 71 | # COMMAND ----------
 72 | 
 73 | # MAGIC %md
 74 | # MAGIC ### Write as JSON
 75 | 
 76 | # COMMAND ----------
 77 | 
 78 | # Write JSON files
 79 | df_mah.write.mode("overwrite").json("/FileStore/output/ClothingSalesMah_json/")
 80 | 
 81 | # COMMAND ----------
 82 | 
 83 | # MAGIC %md
 84 | # MAGIC ### Write as delta table (Managed)
 85 | 
 86 | # COMMAND ----------
 87 | 
 88 | # Write delta lake table. This will create a delta table in default db (you can choose any db). The table name is OrderSalesMah.
 89 | #df.write.format("delta").saveAsTable("default.people10m")
 90 | df_mah.write.format("delta").saveAsTable("default.OrderSalesMah")
 91 | 
 92 | # COMMAND ----------
 93 | 
 94 | # MAGIC %sql
 95 | # MAGIC -- you can query above table
 96 | # MAGIC SELECT * FROM default.OrderSalesMah LIMIT 10;
 97 | 
 98 | # COMMAND ----------
 99 | 
100 | # MAGIC %sql
101 | # MAGIC -- You can observer type, location and owner. It tells us that this is a managed delta table.
102 | # MAGIC DESCRIBE EXTENDED OrderSalesMah
103 | 
104 | # COMMAND ----------
105 | 
106 | # MAGIC %md
107 | # MAGIC ### Write to delta lake
108 | 
109 | # COMMAND ----------
110 | 
111 | ## Write to delta lake. It will save by default in parquet
112 | df_mah.write.format("delta").mode("overwrite").save("/FileStore/output/ClothingSalesMah_delta/")
113 | 
114 | # COMMAND ----------
115 | 
116 | # MAGIC %run ../SETUP/_pyspark_clean_up
117 | 


--------------------------------------------------------------------------------
/PySpark_ETL/Z01- Case Study Sales Order Analysis.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC ### Case Study - Sales Order Analysis
  4 | # MAGIC 
  5 | # MAGIC ![SALES_ORDER](https://raw.githubusercontent.com/martandsingh/images/master/case_study_1.jpg)
  6 | # MAGIC 
  7 | # MAGIC We have Order-Sales dataset. It includes three dataset:
  8 | # MAGIC 
  9 | # MAGIC __Order List - it contains the list of all the order with amount, city, state.__
 10 | # MAGIC 
 11 | # MAGIC Order List
 12 | # MAGIC * Order ID: string (nullable = true)
 13 | # MAGIC * Order Date: string (nullable = true)
 14 | # MAGIC * CustomerName: string (nullable = true)
 15 | # MAGIC * State: string (nullable = true)
 16 | # MAGIC * City: string (nullable = true)
 17 | # MAGIC 
 18 | # MAGIC 
 19 | # MAGIC __Order Details - detail of the order. Order list has 1-to-many relationship with this dataset.__
 20 | # MAGIC 
 21 | # MAGIC Order Details
 22 | # MAGIC * Order ID: string (nullable = true)
 23 | # MAGIC * Amount: string (nullable = true)
 24 | # MAGIC * Profit: string (nullable = true)
 25 | # MAGIC * Quantity: string (nullable = true)
 26 | # MAGIC * Category: string (nullable = true)
 27 | # MAGIC * Sub-Category: string (nullable = true)
 28 | # MAGIC 
 29 | # MAGIC 
 30 | # MAGIC __Sales Target - This contains the monthly sales target of product category.__
 31 | # MAGIC 
 32 | # MAGIC Sales Target
 33 | # MAGIC * Month of Order Date: string (nullable = true)
 34 | # MAGIC * Category: string (nullable = true)
 35 | # MAGIC * Target: string (nullable = true)
 36 | # MAGIC  
 37 | # MAGIC  
 38 | # MAGIC __We will try to answer following question asked by our business user:__
 39 | # MAGIC 1. Top 10 most selling categories & sub-categories (based on number of orders).
 40 | # MAGIC 1. Which order has the highest & lowest profit.
 41 | # MAGIC 1. Top 10 states & cities with highest total bill amount
 42 | # MAGIC 1. In which month & year we received most number of orders with total amount (show top 10).
 43 | # MAGIC 1. Which category fullfiled the month target. Add one extra column "IsTargetCompleted" with values Yes or No.
 44 | 
 45 | # COMMAND ----------
 46 | 
 47 | # MAGIC %md
 48 | # MAGIC ### Load datasets
 49 | 
 50 | # COMMAND ----------
 51 | 
 52 | # MAGIC %run ../SETUP/_pyspark_init_setup
 53 | 
 54 | # COMMAND ----------
 55 | 
 56 | df_ol = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderlist.csv")
 57 | df_od = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderdetails.csv")
 58 | df_st = spark.read.option("header", "true").csv("/FileStore/datasets/sales/salestarget.csv")
 59 | 
 60 | # COMMAND ----------
 61 | 
 62 | # MAGIC %md
 63 | # MAGIC #### Check Schema
 64 | 
 65 | # COMMAND ----------
 66 | 
 67 | df_ol.printSchema()
 68 | df_od.printSchema()
 69 | df_st.printSchema()
 70 | 
 71 | # COMMAND ----------
 72 | 
 73 | # MAGIC %md
 74 | # MAGIC ####  Top 10 most selling categories & sub-categories.
 75 | 
 76 | # COMMAND ----------
 77 | 
 78 | from pyspark.sql.functions import col, lit
 79 | 
 80 | # COMMAND ----------
 81 | 
 82 | df_most_selling_cat = df_od \
 83 |         .groupBy("category", "Sub-Category")\
 84 |         .count()\
 85 |         .withColumnRenamed("count", "total_records")\
 86 |         .orderBy(col("total_records").desc())\
 87 |         .limit(10)
 88 | display(df_most_selling_cat)
 89 | 
 90 | # COMMAND ----------
 91 | 
 92 | # MAGIC %md
 93 | # MAGIC ####  Which order has the highest & lowest profit.
 94 | 
 95 | # COMMAND ----------
 96 | 
 97 | df_order_profit= df_od\
 98 |             .withColumn("Profit_Numeric", col('Profit').cast("decimal") )\
 99 |             .groupBy("Order ID")\
100 |             .sum("Profit_Numeric").withColumnRenamed("sum(Profit_Numeric)", "total_profit")\
101 | 
102 | lowest = df_order_profit\
103 |         .withColumn("Type", lit("Lowest"))\
104 |         .orderBy("total_profit")\
105 |         .limit(1)\
106 |       
107 | 
108 | highest= df_order_profit\
109 |          .withColumn("Type", lit("Highest"))\
110 |          .orderBy(col("total_profit").desc())\
111 |          .limit(1)\
112 |          
113 | 
114 | df_profit_stats =lowest.union(highest)
115 | display(df_profit_stats)
116 | 
117 | # COMMAND ----------
118 | 
119 | # MAGIC %md
120 | # MAGIC #### Top 10 states & cities with highest total bill amount
121 | 
122 | # COMMAND ----------
123 | 
124 | df_high_city = df_ol\
125 |             .join(df_od, df_ol["Order ID"] == df_od["Order ID"], "inner")\
126 |             .selectExpr("State", "City", "CAST(Amount AS Decimal) AS amount_decimal")\
127 |             .groupBy("State", "City")\
128 |             .sum("amount_decimal")\
129 |             .withColumnRenamed("sum(amount_decimal)", "total_amount")\
130 |             .orderBy(col("total_amount").desc())\
131 |             .limit(10)
132 | 
133 | 
134 | display(df_high_city)
135 | 
136 | # COMMAND ----------
137 | 
138 | # MAGIC %md
139 | # MAGIC #### In which month & year we received most number of orders with total amount (show top 10)
140 | 
141 | # COMMAND ----------
142 | 
143 | # to do this first we have to add a new columne which contain order date in date format
144 | df_date =  df_ol\
145 |             .join(df_od, df_ol["Order ID"] == df_od["Order ID"], "inner")\
146 |             .select(df_ol["Order ID"], "Order Date", "Amount")
147 | display(df_date)
148 | 
149 | # COMMAND ----------
150 | 
151 | df_date.printSchema()
152 | 
153 | # COMMAND ----------
154 | 
155 | from pyspark.sql.functions import to_date, month, year, date_format
156 | 
157 | # COMMAND ----------
158 | 
159 | df_year_month = df_date\
160 |         .withColumn("order_date", to_date("Order Date", "dd-MM-yyyy"))\
161 |         .withColumn("order_month", date_format("order_date", "MMM"))\
162 |         .withColumn("order_year", year("order_date"))\
163 |         .groupBy("order_year", "order_month")\
164 |         .agg({"Amount":"sum", "Order ID":"count"})\
165 |         .withColumnRenamed("count(Order ID)", "order_count")\
166 |         .withColumnRenamed("sum(Amount)", "total_amount")\
167 |         .orderBy(col("order_count").desc(), col("total_amount").desc())\
168 |         .limit(10)
169 | 
170 | 
171 | display(df_year_month)
172 | 
173 | # COMMAND ----------
174 | 
175 | # MAGIC %md
176 | # MAGIC #### Which category fullfiled the month target. Add one extra column "IsTargetCompleted" with values Yes or No.
177 | 
178 | # COMMAND ----------
179 | 
180 | from pyspark.sql.functions import concat_ws, substring, when
181 | 
182 | # COMMAND ----------
183 | 
184 | df_order_details  = df_ol\
185 |                 .join(df_od, df_ol["Order ID"]==df_od["Order ID"], "inner")\
186 |                 .select(df_ol["Order ID"], "Order Date", "Amount", "Category")\
187 |                 .withColumn("order_date", to_date("Order Date", "dd-MM-yyyy"))\
188 |                 .withColumn("target_month"\
189 |                             , concat_ws("-", date_format("order_date", "MMM"), substring(year("order_date"), 3, 2) ) )\
190 |                 .withColumn("amount_decimal", col("Amount").cast("decimal"))\
191 |                 .groupBy("target_month", "Category")\
192 |                 .sum("amount_decimal")\
193 |                 .withColumnRenamed("sum(amount_decimal)", "total_month_sales_amount")
194 | 
195 | df_final_target = df_order_details\
196 |             .join(df_st\
197 |                   , (df_order_details["target_month"]==df_st["Month of Order Date"]) &\
198 |                   (df_order_details["Category"]==df_st["Category"]), "inner")\
199 |             .select(\
200 |                     df_order_details["Category"]\
201 |                     , "target_month"\
202 |                     , "total_month_sales_amount"\
203 |                     , "Target")\
204 |             .withColumn("TargetAcheived", when(col("Target") < col("total_month_sales_amount"), "No" )\
205 |                        .otherwise("Yes"))
206 | 
207 | 
208 | display(df_final_target)
209 | 
210 | # COMMAND ----------
211 | 
212 | # you can see how many category achevied the targets
213 | display(df_final_target.groupBy("TargetAcheived").count())
214 | 
215 | # COMMAND ----------
216 | 
217 | # you can see how many category achevied the targets with Category name
218 | display(df_final_target.groupBy("Category", "TargetAcheived").count().orderBy("Category"))
219 | 
220 | # COMMAND ----------
221 | 
222 | # MAGIC %run ../SETUP/_pyspark_clean_up
223 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## Data Engineering Using Azure Databricks
 2 | 
 3 | ### Introduction
 4 | 
 5 | This course include multiple sections. We are mainly focusing on Databricks Data Engineer certification exam. We have following tutorials:
 6 | 1. Spark SQL ETL
 7 | 2. Pyspark ETL
 8 | 
 9 | ### DATASETS
10 | All the datasets used in the tutorials are available at: https://github.com/martandsingh/datasets
11 | 
12 | ### HOW TO USE?
13 | follow below article to learn how to clone this repository to your databricks workspace.
14 | 
15 | https://www.linkedin.com/pulse/databricks-clone-github-repo-martand-singh/
16 | 
17 | ### Spark SQL
18 | This course is the first installment of databricks data engineering course. In this course you will learn basic SQL concept which include:
19 | 1. Create, Select, Update, Delete tables
20 | 1. Create database
21 | 1. Filtering data
22 | 1. Group by & aggregation
23 | 1. Ordering
24 | 1. [SQL joins](https://www.scaler.com/topics/sql/joins-in-sql/)
25 | 1. Common table expression (CTE)
26 | 1. External tables
27 | 1. [Sub queries](https://www.geeksforgeeks.org/sql-subquery/)
28 | 1. Views & temp views
29 | 1. UNION, INTERSECT, EXCEPT keywords
30 | 1. Versioning, time travel & optimization
31 | 
32 | ### PySpark ETL
33 | This course will teach you how to perform ETL pipelines using pyspark. ETL stands for Extract, Load & Transformation. We will see how to load data from various sources & process it and finally will load the process data to our destination.
34 | 
35 | This course includes:
36 | 1. Read files
37 | 2. Schema handling
38 | 3. Handling JSON files
39 | 4. Write files
40 | 5. Basic transformations
41 | 6. partitioning
42 | 7. caching
43 | 8. joins
44 | 9. missing value handling
45 | 10. Data profiling
46 | 11. date time functions
47 | 12. string function
48 | 13. deduplication 
49 | 14. grouping & aggregation
50 | 15. User defined functions
51 | 16. Ordering data
52 | 17. Case study - sales order analysis
53 | 
54 | 
55 | 
56 | you can download all the notebook from our 
57 | 
58 | github repo: https://github.com/martandsingh/ApacheSpark
59 | 
60 | facebook: https://www.facebook.com/codemakerz
61 | 
62 | email: martandsays@gmail.com
63 | 
64 | ### SETUP folder
65 | you will see initial_setup & clean_up notebooks called in every notebooks. It is mandatory to run both the scripts in defined order. initial script will create all the mandatory tables & database for the demo. After you finish your notebook, execute clean up notebook, it will clean all the db objects.
66 | 
67 | pyspark_init_setup - this notebook will copy dataset from my github repo to dbfs. It will also generate used car parquet dataset. All the datasets will be avalable at
68 | 
69 | **/FileStore/datasets**
70 | 
71 | 
72 | ![d5859667-databricks-logo](https://user-images.githubusercontent.com/32331579/174993501-dc93102a-ec36-4607-a3dc-ab67a54a341b.png)
73 | 


--------------------------------------------------------------------------------
/SETUP/_clean_up.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | print('Cleaning up all the database & tables....')
 3 | 
 4 | # COMMAND ----------
 5 | 
 6 | # MAGIC %sql
 7 | # MAGIC DROP DATABASE IF EXISTS DB_DEMO CASCADE;
 8 | 
 9 | # COMMAND ----------
10 | 
11 | print('Tables & database deleted sucessfully.')
12 | 


--------------------------------------------------------------------------------
/SETUP/_initial_setup.py:
--------------------------------------------------------------------------------
1 | # Databricks notebook source
2 | # MAGIC %run ./_setup_database
3 | 
4 | # COMMAND ----------
5 | 
6 | # MAGIC %run ./_setup_demo_table
7 | 


--------------------------------------------------------------------------------
/SETUP/_pyspark_clean_up.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | dbutils.fs.rm('/FileStore/datasets/', True)
 3 | 
 4 | # COMMAND ----------
 5 | 
 6 | 
 7 | 
 8 | # COMMAND ----------
 9 | 
10 | 
11 | 


--------------------------------------------------------------------------------
/SETUP/_pyspark_init_setup.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | #spark.conf.set("da", self.username)
 3 | 
 4 | # Defining datafile paths
 5 | CANCER_FILE_NAME='cancer.csv'
 6 | UNECE_FILE_NAME='unece.json'
 7 | USED_CAR_FILE_NAME='used_cars_nested.json'
 8 | MALL_CUSTOMER_FILE_NAME = 'Mall_Customers.csv'
 9 | DBFS_DATASET_LOCATION = '/FileStore/datasets/'
10 | CANCER_CSV_PATH = 'https://raw.githubusercontent.com/martandsingh/datasets/master/' + CANCER_FILE_NAME # CSV
11 | UNECE_JSON_PATH = 'https://raw.githubusercontent.com/martandsingh/datasets/master/' + UNECE_FILE_NAME # simple JSON
12 | USED_CAR_JSON_PATH = 'https://raw.githubusercontent.com/martandsingh/datasets/master/' + USED_CAR_FILE_NAME # complex JSON
13 | MALL_CUSTOMER_PATH='https://raw.githubusercontent.com/martandsingh/datasets/master/' + MALL_CUSTOMER_FILE_NAME
14 | DBFS_PARQUET_FILE = '/FileStore/datasets/USED_CAR_PARQUET/'
15 | HOUSE_PRICE_FILE = 'missing_val_dataset.csv'
16 | HOUSE_PRICE_PATH = 'https://raw.githubusercontent.com/martandsingh/datasets/master/' + HOUSE_PRICE_FILE
17 | GAME_STREAM_FILE = 'steam-200k.csv'
18 | GAME_STREAM_PATH = 'https://raw.githubusercontent.com/martandsingh/datasets/master/' + GAME_STREAM_FILE
19 | ORDER_DETAIL_FILE = 'orderdetails.csv'
20 | ORDER_DETAIL_PATH = 'https://raw.githubusercontent.com/martandsingh/datasets/master/Sales-Order/'+ORDER_DETAIL_FILE
21 | ORDER_LIST_FILE = 'orderlist.csv'
22 | ORDER_LIST_PATH = 'https://raw.githubusercontent.com/martandsingh/datasets/master/Sales-Order/' + ORDER_LIST_FILE
23 | SALES_TARGET_FILE = 'salestarget.csv'
24 | SALES_TARGET_PATH = 'https://raw.githubusercontent.com/martandsingh/datasets/master/Sales-Order/'+SALES_TARGET_FILE
25 | DA = {
26 |     "ORDER_DETAIL_FILE": ORDER_DETAIL_FILE,
27 |     "ORDER_DETAIL_PATH": ORDER_DETAIL_PATH,
28 |     "ORDER_LIST_FILE":ORDER_LIST_FILE,
29 |     "ORDER_LIST_PATH":ORDER_LIST_PATH,
30 |     "SALES_TARGET_FILE": SALES_TARGET_FILE,
31 |     "SALES_TARGET_PATH": SALES_TARGET_PATH,
32 |     "CANCER_FILE_NAME": CANCER_FILE_NAME,
33 |     "UNECE_FILE_NAME": UNECE_FILE_NAME,
34 |     "USED_CAR_FILE_NAME": USED_CAR_FILE_NAME,
35 |     "CANCER_CSV_PATH": CANCER_CSV_PATH,
36 |     "UNECE_JSON_PATH": UNECE_JSON_PATH,
37 |     "USED_CAR_JSON_PATH": USED_CAR_JSON_PATH,
38 |     "DBFS_DATASET_LOCATION": DBFS_DATASET_LOCATION,
39 |     "DBFS_PARQUET_FILE": DBFS_PARQUET_FILE,
40 |     "MALL_CUSTOMER_FILE_NAME": MALL_CUSTOMER_FILE_NAME,
41 |     "MALL_CUSTOMER_PATH": MALL_CUSTOMER_PATH,
42 |     "HOUSE_PRICE_FILE": HOUSE_PRICE_FILE,
43 |     "HOUSE_PRICE_PATH": HOUSE_PRICE_PATH,
44 |     "GAME_STREAM_FILE": GAME_STREAM_FILE,
45 |     "GAME_STREAM_PATH": GAME_STREAM_PATH
46 | }
47 | 
48 | 
49 | 
50 | # COMMAND ----------
51 | 
52 | print('Loading data files...')
53 | 
54 | # COMMAND ----------
55 | 
56 | dbutils.notebook.run(path="../SETUP/_pyspark_setup_files", timeout_seconds=60, arguments= DA)
57 | 
58 | # COMMAND ----------
59 | 
60 | print('Data files loaded.')
61 | 


--------------------------------------------------------------------------------
/SETUP/_pyspark_setup_files.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ### INTIAL SETUP
 4 | # MAGIC This will copy files from github repository to your DBFS. 
 5 | # MAGIC 
 6 | # MAGIC Repo: https://github.com/martandsingh/datasets
 7 | # MAGIC 
 8 | # MAGIC You can customize your DBFS location by changin DBFS_DATASET_LOCATION variable.
 9 | 
10 | # COMMAND ----------
11 | 
12 | 
13 | 
14 | # COMMAND ----------
15 | 
16 | dbutils.fs.mkdirs('/FileStore/datasets')
17 | 
18 | # COMMAND ----------
19 | 
20 | cancer_file= dbutils.widgets.get("DBFS_DATASET_LOCATION")+dbutils.widgets.get("CANCER_FILE_NAME")
21 | unece_file = dbutils.widgets.get("DBFS_DATASET_LOCATION")+dbutils.widgets.get("UNECE_FILE_NAME")
22 | used_car_file=dbutils.widgets.get("DBFS_DATASET_LOCATION")+dbutils.widgets.get("USED_CAR_FILE_NAME")
23 | mall_customer_file=dbutils.widgets.get("DBFS_DATASET_LOCATION")+dbutils.widgets.get("MALL_CUSTOMER_FILE_NAME")
24 | house_price_file=dbutils.widgets.get("DBFS_DATASET_LOCATION")+dbutils.widgets.get("HOUSE_PRICE_FILE")
25 | game_stream_file = dbutils.widgets.get("DBFS_DATASET_LOCATION")+dbutils.widgets.get("GAME_STREAM_FILE")
26 | order_list_file = dbutils.widgets.get("DBFS_DATASET_LOCATION")+'/sales/'+dbutils.widgets.get("ORDER_LIST_FILE")
27 | order_details_file = dbutils.widgets.get("DBFS_DATASET_LOCATION")+'/sales/'+dbutils.widgets.get("ORDER_DETAIL_FILE")
28 | sales_target_file = dbutils.widgets.get("DBFS_DATASET_LOCATION")+'/sales/'+dbutils.widgets.get("SALES_TARGET_FILE")
29 | print(cancer_file)
30 | print(unece_file)
31 | print(used_car_file)
32 | 
33 | 
34 | # COMMAND ----------
35 | 
36 | 
37 | 
38 | # COMMAND ----------
39 | 
40 | dbutils.fs.cp(dbutils.widgets.get("CANCER_CSV_PATH"), cancer_file)
41 | 
42 | dbutils.fs.cp(dbutils.widgets.get("UNECE_JSON_PATH"),  unece_file)
43 | 
44 | dbutils.fs.cp(dbutils.widgets.get("USED_CAR_JSON_PATH"), used_car_file)
45 | 
46 | dbutils.fs.cp(dbutils.widgets.get("MALL_CUSTOMER_PATH"), mall_customer_file)
47 | 
48 | dbutils.fs.cp(dbutils.widgets.get("HOUSE_PRICE_PATH"), house_price_file)
49 | 
50 | dbutils.fs.cp(dbutils.widgets.get("GAME_STREAM_PATH"), game_stream_file)
51 | 
52 | dbutils.fs.cp(dbutils.widgets.get("ORDER_LIST_PATH"), order_list_file)
53 | 
54 | dbutils.fs.cp(dbutils.widgets.get("ORDER_DETAIL_PATH"), order_details_file)
55 | 
56 | dbutils.fs.cp(dbutils.widgets.get("SALES_TARGET_PATH"), sales_target_file)
57 | 
58 | 
59 | # COMMAND ----------
60 | 
61 | from pyspark.sql.functions import explode, col
62 | 
63 | # COMMAND ----------
64 | 
65 | parquet_path = dbutils.widgets.get("DBFS_PARQUET_FILE")
66 | print("Writing parquet file to "+ parquet_path)
67 | df = spark \
68 |     .read \
69 |     .option("multiline", "true")\
70 |     .json(used_car_file)
71 | 
72 | df_exploded = df \
73 |             .withColumn("usedCars", explode(df["usedCars"]))
74 | 
75 | df_clean = df_exploded \
76 |             .withColumn("vehicle_type", col("usedCars")["@type"])\
77 |             .withColumn("body_type", col("usedCars")["bodyType"])\
78 |             .withColumn("brand_name", col("usedCars")["brand"]["name"])\
79 |             .withColumn("color", col("usedCars")["color"])\
80 |             .withColumn("description", col("usedCars")["description"])\
81 |             .withColumn("model", col("usedCars")["model"])\
82 |             .withColumn("manufacturer", col("usedCars")["manufacturer"])\
83 |             .withColumn("ad_title", col("usedCars")["name"])\
84 |             .withColumn("currency", col("usedCars")["priceCurrency"])\
85 |             .withColumn("seller_location", col("usedCars")["sellerLocation"])\
86 |             .withColumn("displacement", col("usedCars")["vehicleEngine"]["engineDisplacement"])\
87 |             .withColumn("transmission", col("usedCars")["vehicleTransmission"])\
88 |             .withColumn("price", col("usedCars")["price"]) \
89 |             .drop("usedCars")
90 | df_clean.write.mode("overwrite").parquet(parquet_path)
91 | print("Parquet file is read.")
92 | 
93 | # COMMAND ----------
94 | 
95 | print('File loaded to DBFS ' + dbutils.widgets.get("DBFS_DATASET_LOCATION"))
96 | 


--------------------------------------------------------------------------------
/SETUP/_setup_database.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | print('Creating database DB_DEMO...')
 3 | 
 4 | # COMMAND ----------
 5 | 
 6 | # MAGIC %sql
 7 | # MAGIC CREATE DATABASE IF NOT EXISTS DB_DEMO;
 8 | # MAGIC USE DB_DEMO;
 9 | 
10 | # COMMAND ----------
11 | 
12 | print('Database DB_DEMO created successfully.')
13 | 


--------------------------------------------------------------------------------
/SETUP/_setup_demo_table.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %sql
  3 | # MAGIC CREATE TABLE IF NOT EXISTS club (
  4 | # MAGIC   club_id VARCHAR(10),
  5 | # MAGIC   club_name VARCHAR(50)
  6 | # MAGIC );
  7 | # MAGIC 
  8 | # MAGIC 
  9 | # MAGIC CREATE TABLE IF NOT EXISTS department (
 10 | # MAGIC   dept_id VARCHAR(10),
 11 | # MAGIC   dept_name VARCHAR(50)
 12 | # MAGIC );
 13 | # MAGIC 
 14 | # MAGIC CREATE TABLE IF NOT EXISTS employee
 15 | # MAGIC (
 16 | # MAGIC   empcode VARCHAR(10),
 17 | # MAGIC   firstname VARCHAR(50),
 18 | # MAGIC   lastname VARCHAR(50),
 19 | # MAGIC   dept_id VARCHAR(10),
 20 | # MAGIC   club_id VARCHAR(10)
 21 | # MAGIC );
 22 | # MAGIC 
 23 | # MAGIC CREATE TABLE meal
 24 | # MAGIC (
 25 | # MAGIC   meal_id VARCHAR(10),
 26 | # MAGIC   meal_name VARCHAR(50)
 27 | # MAGIC );
 28 | # MAGIC CREATE TABLE drink
 29 | # MAGIC (
 30 | # MAGIC   drink_id VARCHAR(10),
 31 | # MAGIC   drink_name VARCHAR(50)
 32 | # MAGIC );
 33 | # MAGIC 
 34 | # MAGIC CREATE TABLE emp_salary
 35 | # MAGIC (
 36 | # MAGIC   empcode VARCHAR(10),
 37 | # MAGIC   basic_salary DECIMAL(10, 2),
 38 | # MAGIC   transport DECIMAL(10, 2),
 39 | # MAGIC   accomodation DECIMAL(10, 2),
 40 | # MAGIC   food DECIMAL(10, 2),
 41 | # MAGIC   extra DECIMAL(10, 2)
 42 | # MAGIC );
 43 | # MAGIC CREATE TABLE supplier_india
 44 | # MAGIC (
 45 | # MAGIC   supp_id VARCHAR(10),
 46 | # MAGIC   supp_name VARCHAR(50),
 47 | # MAGIC   city VARCHAR(50)
 48 | # MAGIC );
 49 | # MAGIC CREATE TABLE supplier_nepal
 50 | # MAGIC (
 51 | # MAGIC   supp_id VARCHAR(10),
 52 | # MAGIC   supp_name VARCHAR(50),
 53 | # MAGIC   city VARCHAR(50)
 54 | # MAGIC );
 55 | 
 56 | # COMMAND ----------
 57 | 
 58 | # MAGIC %python
 59 | # MAGIC print('Preparing tables...')
 60 | # MAGIC print('Success: table club created')
 61 | # MAGIC print('Success: table department created')
 62 | # MAGIC print('Success: table employee created')
 63 | # MAGIC print('Success: table meal created')
 64 | # MAGIC print('Success: table drink created')
 65 | # MAGIC print('Success: table emp_salary created')
 66 | # MAGIC print('Success: table supplier_india created')
 67 | # MAGIC print('Success: table supplier_nepal created')
 68 | 
 69 | # COMMAND ----------
 70 | 
 71 | # MAGIC %sql
 72 | # MAGIC TRUNCATE TABLE club;
 73 | # MAGIC TRUNCATE TABLE department;
 74 | # MAGIC TRUNCATE TABLE employee;
 75 | # MAGIC TRUNCATE TABLE meal;
 76 | # MAGIC TRUNCATE TABLE drink;
 77 | # MAGIC TRUNCATE TABLE emp_salary;
 78 | # MAGIC TRUNCATE TABLE supplier_india;
 79 | # MAGIC TRUNCATE TABLE supplier_nepal;
 80 | 
 81 | # COMMAND ----------
 82 | 
 83 | # MAGIC %python
 84 | # MAGIC print('Success: table club truncated')
 85 | # MAGIC print('Success: table department truncated')
 86 | # MAGIC print('Success: table employee truncated')
 87 | # MAGIC print('Success: table meal truncated')
 88 | # MAGIC print('Success: table drink truncated')
 89 | # MAGIC print('Success: table emp_salary truncated')
 90 | # MAGIC print('Success: table supplier_india truncated')
 91 | # MAGIC print('Success: table supplier_nepal truncated')
 92 | 
 93 | # COMMAND ----------
 94 | 
 95 | # MAGIC %sql
 96 | # MAGIC INSERT INTO club
 97 | # MAGIC (club_id, club_name)
 98 | # MAGIC VALUES
 99 | # MAGIC ('C1', 'Cricket'),
100 | # MAGIC ('C2', 'Football'),
101 | # MAGIC ('C3', 'Golf'),
102 | # MAGIC ('C4', 'Wildlife & Nature'),
103 | # MAGIC ('C5', 'Photography'),
104 | # MAGIC ('C6', 'Art & Music');
105 | # MAGIC 
106 | # MAGIC INSERT INTO department
107 | # MAGIC (dept_id, dept_name)
108 | # MAGIC VALUES
109 | # MAGIC ('DEP001', 'IT'),
110 | # MAGIC ('DEP002', 'Marketing'),
111 | # MAGIC ('DEP003', 'Finance'),
112 | # MAGIC ('DEP004', 'BI'),
113 | # MAGIC ('DEP005', 'Admin'),
114 | # MAGIC ('DEP006', 'HR');
115 | # MAGIC 
116 | # MAGIC INSERT INTO employee
117 | # MAGIC (empcode, firstname, lastname, dept_id, club_id)
118 | # MAGIC VALUES
119 | # MAGIC ('EMP001', 'Albert', 'Einstein', 'DEP001', 'C1'),
120 | # MAGIC ('EMP002', 'Isaac', 'Newton', 'DEP001', 'C1'),
121 | # MAGIC ('EMP003', 'Elvis', 'Bose', 'DEP001', 'C2'),
122 | # MAGIC ('EMP004', 'Jose', 'Baldwin', 'DEP001', 'C3'),
123 | # MAGIC ('EMP005', 'Christian', 'Baldwin', 'DEP002', 'C1'),
124 | # MAGIC ('EMP006', 'Stephenie', 'Margarete', 'DEP002', 'C3'),
125 | # MAGIC ('EMP007', 'P.K', 'Chand', 'DEP003', 'C1'),
126 | # MAGIC ('EMP008', 'Eric', 'Clapton', 'DEP004', 'C6'),
127 | # MAGIC ('EMP009', 'Eric', 'Jhonson', 'DEP001', 'C9'),
128 | # MAGIC ('EMP010', 'Martand', 'Singh', 'DEP010', 'C3'),
129 | # MAGIC ('EMP011', 'Rajiv', 'Singh', 'DEP0010', 'C31'),
130 | # MAGIC ('EMP012', 'Jose', 'Peter', 'DEP0011', 'C1');
131 | # MAGIC 
132 | # MAGIC INSERT INTO meal
133 | # MAGIC (meal_id, meal_name)
134 | # MAGIC VALUES
135 | # MAGIC ('M001', 'Pizza'),
136 | # MAGIC ('M002', 'Burger'),
137 | # MAGIC ('M003', 'Sandwich'),
138 | # MAGIC ('M004', 'Pasta');
139 | # MAGIC 
140 | # MAGIC INSERT INTO drink
141 | # MAGIC (drink_id, drink_name)
142 | # MAGIC VALUES
143 | # MAGIC ('D001', 'Coke'),
144 | # MAGIC ('D002', 'Pepsi'),
145 | # MAGIC ('D003', 'Beer'),
146 | # MAGIC ('D004', 'Water');
147 | # MAGIC 
148 | # MAGIC INSERT INTO emp_salary
149 | # MAGIC (empcode,  basic_salary, transport, accomodation, food , extra)
150 | # MAGIC VALUES
151 | # MAGIC ('EMP001', 25000.99, 2000, 3000.99, 4500, 4500),
152 | # MAGIC ('EMP002', 35000.99, 1200, 3000.99, 3500, 4500),
153 | # MAGIC ('EMP003', 45000.99, 2500.99, 3000, 3560, 4500),
154 | # MAGIC ('EMP004', 15000.99, 2670.50, 3500, 3580, 7500),
155 | # MAGIC ('EMP005', 25600.99, 2120.50, 3000, 3589, 6500),
156 | # MAGIC ('EMP006', 67000.99, 2760, 4000, 3590, 5500),
157 | # MAGIC ('EMP007', 89000.99, 2000, 4000, 3511, 5500);
158 | # MAGIC 
159 | # MAGIC INSERT INTO supplier_india
160 | # MAGIC (supp_id, supp_name, city)
161 | # MAGIC VALUES
162 | # MAGIC ('SI001', 'Martand Singh', 'New Delhi'),
163 | # MAGIC ('SI002', 'Gaurav Chandawani', 'Mumbai'),
164 | # MAGIC ('SI003', 'Shweta Gupta', 'U.P'),
165 | # MAGIC ('SI004', 'Naresh Chawla', 'Punjab');
166 | # MAGIC 
167 | # MAGIC INSERT INTO supplier_nepal
168 | # MAGIC (supp_id, supp_name, city)
169 | # MAGIC VALUES
170 | # MAGIC ('SN001', 'Himal Gurung', 'Kathmandu'),
171 | # MAGIC ('SN002', 'Naina Shah', 'Pokhara'),
172 | # MAGIC ('SN003', 'Vicky Magar', 'Gorkha'),
173 | # MAGIC ('SN004', 'Martand Singh', 'Surkhet'),
174 | # MAGIC ('SN005', 'Gaurav Chandawani', 'Thankot'),
175 | # MAGIC ('SN006', 'Barkha Tiwari', 'Butwal');
176 | 
177 | # COMMAND ----------
178 | 
179 | 
180 | 
181 | # COMMAND ----------
182 | 
183 | print('Success: demo tables are ready.')
184 | 


--------------------------------------------------------------------------------
/SQL Refresher/PS000-INTRODUCTION.py:
--------------------------------------------------------------------------------
1 | # Databricks notebook source
2 | 
3 | 


--------------------------------------------------------------------------------
/SQL Refresher/SR000-Introduction.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ### Introduction
 4 | # MAGIC This course is the first installment of databricks data engineering course. In this course you will learn basic SQL concept which include:
 5 | # MAGIC 1. Create, Select, Update, Delete tables
 6 | # MAGIC 1. Create database
 7 | # MAGIC 1. Filtering data
 8 | # MAGIC 1. Group by & aggregation
 9 | # MAGIC 1. Ordering
10 | # MAGIC 1. SQL joins
11 | # MAGIC 1. Common table expression (CTE)
12 | # MAGIC 1. External tables
13 | # MAGIC 1.  Sub queries
14 | # MAGIC 1. Views & temp views
15 | # MAGIC 1. UNION, INTERSECT, EXCEPT keywords
16 | # MAGIC 
17 | # MAGIC you can download all the notebook from our 
18 | # MAGIC 
19 | # MAGIC github repo: https://github.com/martandsingh/ApacheSpark
20 | # MAGIC 
21 | # MAGIC facebook: https://www.facebook.com/codemakerz
22 | # MAGIC 
23 | # MAGIC email: martandsays@gmail.com
24 | # MAGIC 
25 | # MAGIC ### SETUP folder
26 | # MAGIC you will see initial_setup & clean_up notebooks called in every notebooks. It is mandatory to run both the scripts in defined order. initial script will create all the mandatory tables & database for the demo. After you finish your notebook, execute clean up notebook, it will clean all the db objects.
27 | # MAGIC 
28 | # MAGIC ![SQL](https://raw.githubusercontent.com/martandsingh/images/master/sql.png)
29 | 
30 | # COMMAND ----------
31 | 
32 | # MAGIC %
33 | 


--------------------------------------------------------------------------------
/SQL Refresher/SR001-Basic CRUD.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ### What is CRUD?
 4 | # MAGIC CRUD stands for CREATE/RETRIEVE/UPDATE/DELETE. In this demo we will see how can we create a table and perform basic operations on it.
 5 | # MAGIC 
 6 | # MAGIC ![CRUD](https://raw.githubusercontent.com/martandsingh/images/master/crud.png)
 7 | 
 8 | # COMMAND ----------
 9 | 
10 | # MAGIC %sql
11 | # MAGIC -- CREATE A TABLE. Below command will create student table with three columns. IF NOT EXISTS clause will create table only if it does not exists, if you have table then it will ignore the statement.
12 | # MAGIC CREATE  TABLE IF NOT EXISTS students(
13 | # MAGIC   student_id VARCHAR(10),
14 | # MAGIC   student_name VARCHAR(50),
15 | # MAGIC   course VARCHAR(50)
16 | # MAGIC );
17 | 
18 | # COMMAND ----------
19 | 
20 | # MAGIC %sql
21 | # MAGIC -- check table details. it will return you table information like column name, datatypes
22 | # MAGIC DESCRIBE students
23 | 
24 | # COMMAND ----------
25 | 
26 | # MAGIC %sql
27 | # MAGIC -- Let's add some data.
28 | # MAGIC INSERT INTO students
29 | # MAGIC (student_id, student_name, course)
30 | # MAGIC VALUES
31 | # MAGIC ('ST001', 'ABC', 'BBA'),
32 | # MAGIC ('ST002', 'XYZ', 'MBA'),
33 | # MAGIC ('ST003', 'PQR', 'BCA');
34 | # MAGIC --above statement will insert three records to student tables
35 | 
36 | # COMMAND ----------
37 | 
38 | # MAGIC %sql
39 | # MAGIC -- query students table
40 | # MAGIC SELECT * FROM students;
41 | 
42 | # COMMAND ----------
43 | 
44 | # MAGIC %sql
45 | # MAGIC -- now lets update cours for student_id ST003. The student wants to change his/her course in the middle of semester. Now we have to update course in the student table.
46 | # MAGIC UPDATE students
47 | # MAGIC SET course = 'B.Tech'
48 | # MAGIC WHERE student_id = 'ST003'
49 | 
50 | # COMMAND ----------
51 | 
52 | # MAGIC %sql
53 | # MAGIC SELECT * FROM students
54 | # MAGIC -- course changed.
55 | 
56 | # COMMAND ----------
57 | 
58 | # MAGIC %sql
59 | # MAGIC -- lets delete ST003 as he was not happy with the college, he decided to move to another college. So we have to remove his name from the table.
60 | # MAGIC DELETE FROM students WHERE student_id = 'ST003'
61 | 
62 | # COMMAND ----------
63 | 
64 | # MAGIC %sql
65 | # MAGIC SELECT * FROM students
66 | 
67 | # COMMAND ----------
68 | 
69 | # This was a very basic CRUD demo to give you a quick start. You will more detailed queries in further demos. So keep going... You can do it.
70 | 
71 | # COMMAND ----------
72 | 
73 | 
74 | 


--------------------------------------------------------------------------------
/SQL Refresher/SR002-Select & Filtering.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ### What is filtering?
 4 | # MAGIC Selecting a subset of data based on some business logic is called filtering.
 5 | # MAGIC e.g. You have data for multiple countries, then you may want to select data only for one particular country or city or both.
 6 | # MAGIC 
 7 | # MAGIC ![Filtering](https://raw.githubusercontent.com/martandsingh/images/master/filtering.png)
 8 | 
 9 | # COMMAND ----------
10 | 
11 | # MAGIC %run ../SETUP/_initial_setup
12 | 
13 | # COMMAND ----------
14 | 
15 | # MAGIC %sql 
16 | # MAGIC -- * astrix will select all the columns and rows. As a big data engineer, you should avoid this because in real life scenario your table will have billions of records which you do not want to fetch frequently.
17 | # MAGIC SELECT * FROM club;
18 | 
19 | # COMMAND ----------
20 | 
21 | # MAGIC %sql
22 | # MAGIC SELECT * FROM department;
23 | 
24 | # COMMAND ----------
25 | 
26 | # MAGIC %sql
27 | # MAGIC SELECT * FROM employee;
28 | 
29 | # COMMAND ----------
30 | 
31 | # MAGIC %sql -- Projection: select only few columns. this is a prefferd practice. Only select the columns which you need. It will optimize your query.
32 | # MAGIC SELECT
33 | # MAGIC   firstname,
34 | # MAGIC   lastname,
35 | # MAGIC   dept_id
36 | # MAGIC FROM
37 | # MAGIC   employee;
38 | 
39 | # COMMAND ----------
40 | 
41 | # MAGIC %sql 
42 | # MAGIC -- SELECT top 5 records. 
43 | # MAGIC SELECT
44 | # MAGIC   firstname,
45 | # MAGIC   lastname,
46 | # MAGIC   dept_id
47 | # MAGIC FROM
48 | # MAGIC   employee
49 | # MAGIC LIMIT
50 | # MAGIC   5;
51 | 
52 | # COMMAND ----------
53 | 
54 | # MAGIC %sql 
55 | # MAGIC -- apply filters using WHERE keyword choose all the employee of department DEP001
56 | # MAGIC SELECT
57 | # MAGIC   *
58 | # MAGIC FROM
59 | # MAGIC   employee
60 | # MAGIC WHERE
61 | # MAGIC   dept_id = 'DEP001'
62 | 
63 | # COMMAND ----------
64 | 
65 | # MAGIC %sql 
66 | # MAGIC -- apply filters using WHERE keyword choose all the employee of department DEP001 & club C1
67 | # MAGIC SELECT
68 | # MAGIC   *
69 | # MAGIC FROM
70 | # MAGIC   employee
71 | # MAGIC WHERE
72 | # MAGIC   dept_id = 'DEP001'
73 | # MAGIC   AND club_id = 'C1'
74 | 
75 | # COMMAND ----------
76 | 
77 | # MAGIC %sql
78 | # MAGIC -- Find all the employees from club C1, C2 & C3
79 | # MAGIC SELECT * FROM employee
80 | # MAGIC WHERE club_id IN ('C1', 'C2', 'C3')
81 | 
82 | # COMMAND ----------
83 | 
84 | # MAGIC %sql
85 | # MAGIC -- Find all the employees which are not in club C1, C2 & C3
86 | # MAGIC SELECT * FROM employee
87 | # MAGIC WHERE club_id NOT IN ('C1', 'C2', 'C3')
88 | 
89 | # COMMAND ----------
90 | 
91 | # MAGIC %run ../SETUP/_clean_up
92 | 
93 | # COMMAND ----------
94 | 
95 | 
96 | 


--------------------------------------------------------------------------------
/SQL Refresher/SR003-JOINS.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC ### What are Joins?
  4 | # MAGIC Joins are used to combine two or more tables based on one or more column. This is used to select data from multiple table.
  5 | # MAGIC 
  6 | # MAGIC ### Types of join
  7 | # MAGIC There are 4 major kind of joins:
  8 | # MAGIC 1. Inner join
  9 | # MAGIC 1. Left outer join
 10 | # MAGIC 1. Right outer join
 11 | # MAGIC 1. Full outer join
 12 | # MAGIC 1. Cross join
 13 | # MAGIC ![SQL_JOIN](https://raw.githubusercontent.com/martandsingh/images/master/join_demo.png)
 14 | 
 15 | # COMMAND ----------
 16 | 
 17 | # MAGIC %run ../SETUP/_initial_setup
 18 | 
 19 | # COMMAND ----------
 20 | 
 21 | # MAGIC %md
 22 | # MAGIC ### Database & table details
 23 | # MAGIC _initial_setup command will run our setup notebook where we are creating DB_DEMO database & 3 table (employee, department, club). employee table include employee details including department id(dept_id) & club id (club_id) which are the Foreign key related with department & club table respectively. Below entity diagram shows the relation between tables.
 24 | # MAGIC 
 25 | # MAGIC 
 26 | # MAGIC ![my_test_image](https://raw.githubusercontent.com/martandsingh/images/master/entity_diag.png)
 27 | 
 28 | # COMMAND ----------
 29 | 
 30 | # MAGIC %md 
 31 | 
 32 | # COMMAND ----------
 33 | 
 34 | # MAGIC %md
 35 | # MAGIC #### INNER JOIN
 36 | # MAGIC Returns rows that have matching values in both the table (LEFT table, RIGHT table). Left table is the one mentioned before JOIN clause & RIGHT table is the one mentioned after JOIN clause.
 37 | # MAGIC 
 38 | # MAGIC Syntax:
 39 | # MAGIC 
 40 | # MAGIC SELECT A.col, B.col
 41 | # MAGIC 
 42 | # MAGIC FROM {LEFT_TABLE} A
 43 | # MAGIC 
 44 | # MAGIC INNER {JOIN RIGHT_TABLE} B
 45 | # MAGIC 
 46 | # MAGIC ON A.{col} = B.{col}
 47 | 
 48 | # COMMAND ----------
 49 | 
 50 | # MAGIC %sql 
 51 | # MAGIC select * from employee
 52 | 
 53 | # COMMAND ----------
 54 | 
 55 | # MAGIC %sql -- We can see, we have department_id in our employee table which tell us about department of the employee. But what if we want department
 56 | # MAGIC -- name instead? We have to put an inner join employee table with department table based on dept id
 57 | # MAGIC SELECT
 58 | # MAGIC   E.firstname, E.lastname, D.dept_name AS department
 59 | # MAGIC FROM
 60 | # MAGIC   employee E
 61 | # MAGIC   INNER JOIN department D ON E.dept_id = D.dept_id
 62 | # MAGIC   
 63 | 
 64 | # COMMAND ----------
 65 | 
 66 | # MAGIC %sql
 67 | # MAGIC SELECT
 68 | # MAGIC   E.firstname, E.lastname, D.dept_name AS department
 69 | # MAGIC FROM
 70 | # MAGIC   employee E
 71 | # MAGIC   LEFT JOIN department D ON E.dept_id = D.dept_id
 72 | 
 73 | # COMMAND ----------
 74 | 
 75 | # MAGIC %sql
 76 | # MAGIC SELECT
 77 | # MAGIC   E.firstname, E.lastname, D.dept_name AS department
 78 | # MAGIC FROM
 79 | # MAGIC   employee E
 80 | # MAGIC   RIGHT JOIN department D ON E.dept_id = D.dept_id
 81 | 
 82 | # COMMAND ----------
 83 | 
 84 | # MAGIC %sql
 85 | # MAGIC SELECT
 86 | # MAGIC   E.firstname, E.lastname, D.dept_name AS department
 87 | # MAGIC FROM
 88 | # MAGIC   employee E
 89 | # MAGIC   FULL JOIN department D ON E.dept_id = D.dept_id
 90 | 
 91 | # COMMAND ----------
 92 | 
 93 | # MAGIC %sql
 94 | # MAGIC SELECT M.meal_name, D.drink_name
 95 | # MAGIC FROM meal M CROSS JOIN drink D
 96 | 
 97 | # COMMAND ----------
 98 | 
 99 | # MAGIC %run ../SETUP/_clean_up
100 | 
101 | # COMMAND ----------
102 | 
103 | 
104 | 


--------------------------------------------------------------------------------
/SQL Refresher/SR004-Order & Grouping.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC ### What is Grouping?
  4 | # MAGIC The GROUP BY statement groups rows that have the same values into summary rows, like "find the number of customers in each country". The GROUP BY statement is often used with aggregate functions ( COUNT() , MAX() , MIN() , SUM() , AVG() ) to group the result-set by one or more columns.
  5 | # MAGIC 
  6 | # MAGIC Keyword: GROUP BY
  7 | # MAGIC 
  8 | # MAGIC ### What is ordering?
  9 | # MAGIC The SQL ORDER BY clause is used to sort the data in ascending or descending order, based on one or more columns. Some databases sort the query results in an ascending order by default.
 10 | # MAGIC 
 11 | # MAGIC Keyword: ORDER BY
 12 | # MAGIC 
 13 | # MAGIC ![Grouping](https://raw.githubusercontent.com/martandsingh/images/master/grouping.png)
 14 | 
 15 | # COMMAND ----------
 16 | 
 17 | # MAGIC %run ../SETUP/_initial_setup
 18 | 
 19 | # COMMAND ----------
 20 | 
 21 | # MAGIC %sql
 22 | # MAGIC SELECT * FROM club
 23 | 
 24 | # COMMAND ----------
 25 | 
 26 | # MAGIC %sql -- Let's caculate number of employees for each club
 27 | # MAGIC SELECT
 28 | # MAGIC   club_id,
 29 | # MAGIC   COUNT(1) AS total_members
 30 | # MAGIC FROM
 31 | # MAGIC   employee
 32 | # MAGIC GROUP BY
 33 | # MAGIC   club_id
 34 | 
 35 | # COMMAND ----------
 36 | 
 37 | # MAGIC %sql -- Let's caculate number of employees for each club and sort the result in DECREASING order of total_members
 38 | # MAGIC SELECT
 39 | # MAGIC   club_id,
 40 | # MAGIC   COUNT(1) AS total_members
 41 | # MAGIC FROM
 42 | # MAGIC   employee
 43 | # MAGIC GROUP BY
 44 | # MAGIC   club_id
 45 | # MAGIC ORDER BY
 46 | # MAGIC   total_members DESC
 47 | 
 48 | # COMMAND ----------
 49 | 
 50 | # MAGIC %sql -- Let's caculate number of employees for each club and sort the result in INCREASING order of total_members
 51 | # MAGIC SELECT
 52 | # MAGIC   club_id,
 53 | # MAGIC   COUNT(1) AS total_members
 54 | # MAGIC FROM
 55 | # MAGIC   employee
 56 | # MAGIC GROUP BY
 57 | # MAGIC   club_id
 58 | # MAGIC ORDER BY
 59 | # MAGIC   total_members
 60 | 
 61 | # COMMAND ----------
 62 | 
 63 | # MAGIC %md
 64 | # MAGIC Above query does not seems perfect, As our user does not know what is C1, C2... so on. We need to specify the name of the club. For that we have to perform an inner join with club table.
 65 | # MAGIC Lets find out name of the club and total members. You have to select only club which has more than one member. Arrange them in decreasing order of total members.
 66 | 
 67 | # COMMAND ----------
 68 | 
 69 | # MAGIC %sql
 70 | # MAGIC SELECT
 71 | # MAGIC   C.club_name,
 72 | # MAGIC   COUNT(1) AS total_members
 73 | # MAGIC FROM
 74 | # MAGIC   employee E
 75 | # MAGIC   INNER JOIN club C ON E.club_id = C.club_id
 76 | # MAGIC GROUP BY
 77 | # MAGIC   C.club_name
 78 | # MAGIC HAVING
 79 | # MAGIC   total_members > 1
 80 | # MAGIC ORDER BY
 81 | # MAGIC   total_members DESC
 82 | 
 83 | # COMMAND ----------
 84 | 
 85 | # MAGIC %md 
 86 | # MAGIC We can group & order our resultset based on more than one column. lets group our data based on club & department.
 87 | 
 88 | # COMMAND ----------
 89 | 
 90 | # MAGIC %sql
 91 | # MAGIC SELECT
 92 | # MAGIC   D.dept_name AS Department,
 93 | # MAGIC   C.club_name AS Club,
 94 | # MAGIC   COUNT(1) AS total_members
 95 | # MAGIC FROM
 96 | # MAGIC   employee E
 97 | # MAGIC   INNER JOIN club C ON E.club_id = C.club_id
 98 | # MAGIC   INNER JOIN department D ON E.dept_id = D.dept_id
 99 | # MAGIC GROUP BY
100 | # MAGIC   D.dept_name,
101 | # MAGIC   C.club_name
102 | # MAGIC ORDER BY
103 | # MAGIC   total_members -- so now you can see, there are multiple rows for marketing & IT. As these two department has members from different club.
104 | 
105 | # COMMAND ----------
106 | 
107 | # MAGIC %md
108 | # MAGIC Most of the time we use aggregation functions with grouped data. Lets calculate average basic salary of each department. 
109 | 
110 | # COMMAND ----------
111 | 
112 | # MAGIC %sql
113 | # MAGIC SELECT
114 | # MAGIC   *
115 | # MAGIC FROM
116 | # MAGIC   employee
117 | # MAGIC order by
118 | # MAGIC   empcode
119 | 
120 | # COMMAND ----------
121 | 
122 | # MAGIC %sql
123 | # MAGIC SELECT
124 | # MAGIC   D.dept_name,
125 | # MAGIC   ROUND(AVG(ES.basic_salary), 2) AS AVG_BASIC_SALARY
126 | # MAGIC FROM
127 | # MAGIC   employee E
128 | # MAGIC   INNER JOIN emp_salary ES ON E.empcode = ES.empcode
129 | # MAGIC   INNER JOIN department D ON E.dept_id = D.dept_id
130 | # MAGIC GROUP BY
131 | # MAGIC   D.dept_name
132 | # MAGIC ORDER BY
133 | # MAGIC   AVG_BASIC_SALARY
134 | 
135 | # COMMAND ----------
136 | 
137 | # MAGIC %md 
138 | # MAGIC Our finance team is now wants to downsize the company(not a good news for employees... :( ). They want you to calculate total salary distributed by each department.
139 | 
140 | # COMMAND ----------
141 | 
142 | # MAGIC %sql
143 | # MAGIC SELECT
144 | # MAGIC   D.dept_name,
145 | # MAGIC   ROUND(SUM(ES.basic_salary), 2) AS TOTAL_BASIC_SALARY
146 | # MAGIC FROM
147 | # MAGIC   employee E
148 | # MAGIC   INNER JOIN emp_salary ES ON E.empcode = ES.empcode
149 | # MAGIC   INNER JOIN department D ON E.dept_id = D.dept_id
150 | # MAGIC GROUP BY
151 | # MAGIC   D.dept_name
152 | # MAGIC ORDER BY
153 | # MAGIC   TOTAL_BASIC_SALARY
154 | 
155 | # COMMAND ----------
156 | 
157 | # MAGIC %run ../SETUP/_clean_up
158 | 
159 | # COMMAND ----------
160 | 
161 | 
162 | 


--------------------------------------------------------------------------------
/SQL Refresher/SR005-Sub Queries.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC ### Whats is subquery?
  4 | # MAGIC A subquery is a SQL query nested inside a larger query. The subquery can be nested inside a SELECT, INSERT, UPDATE, or DELETE statement or inside another subquery. A subquery is usually added within the WHERE Clause of another SQL SELECT statement.
  5 | # MAGIC 
  6 | # MAGIC A subquery can be used anywhere an expression is allowed. A subquery is also called an inner query or inner select, while the statement containing a subquery is also called an outer query or outer select.
  7 | # MAGIC 
  8 | # MAGIC  In Transact-SQL, there is usually no performance difference between a statement that includes a subquery and a semantically equivalent version that does not. For architectural information on how SQL Server processes queries, see SQL statement processing.However, in some cases where existence must be checked, a join yields better performance. Otherwise, the nested query must be processed for each result of the outer query to ensure elimination of duplicates. In such cases, a join approach would yield better results.
  9 | # MAGIC  
 10 | # MAGIC 
 11 | # MAGIC *Note: there are multiple ways to write same query. You have to select the best way. This notebook is specifically for discussing sub queries, so some queries may not make sense but that is just for example. I want to show you multiple ways of generating same result. Later in this series we will have a specific notebook to talk about query execution order & optimization. There we will talk in detail about performance.*
 12 | # MAGIC 
 13 | # MAGIC ![Subquery](https://raw.githubusercontent.com/martandsingh/images/master/subquery.jpg)
 14 | 
 15 | # COMMAND ----------
 16 | 
 17 | # MAGIC %run ../SETUP/_initial_setup
 18 | 
 19 | # COMMAND ----------
 20 | 
 21 | # MAGIC %md
 22 | # MAGIC -- lets say we want to get all the employees who are the member of existing club(the club exists in club table). There are two ways to achieve this:
 23 | # MAGIC 1. Inner Join between employee & club based on club_id
 24 | # MAGIC 1. Using sub query
 25 | 
 26 | # COMMAND ----------
 27 | 
 28 | # MAGIC %sql -- Lets use join first. This will return full name of all the employees who are member of a valid club(existing club).
 29 | # MAGIC SELECT
 30 | # MAGIC   concat(E.firstname, ' ', E.lastname) AS FullName
 31 | # MAGIC FROM
 32 | # MAGIC   employee E
 33 | # MAGIC   INNER JOIN club C ON E.club_id = C.club_id
 34 | # MAGIC ORDER BY
 35 | # MAGIC   FullName
 36 | 
 37 | # COMMAND ----------
 38 | 
 39 | # MAGIC %sql -- other way to get same result using sub query or inner query. Below query will return exactly same result as above. The query inside the paranthesis (SELECT club_id FROM club) is your sub query. First this query is executing and providing a resultset which later will be used in WHERE condition for the parent query.
 40 | # MAGIC SELECT
 41 | # MAGIC   concat(E.firstname, ' ', E.lastname) AS FullName
 42 | # MAGIC FROM
 43 | # MAGIC   employee E
 44 | # MAGIC WHERE
 45 | # MAGIC   club_id IN (
 46 | # MAGIC     SELECT
 47 | # MAGIC       club_id
 48 | # MAGIC     FROM
 49 | # MAGIC       club
 50 | # MAGIC   )
 51 | # MAGIC ORDER BY
 52 | # MAGIC   FullName
 53 | 
 54 | # COMMAND ----------
 55 | 
 56 | # MAGIC %md
 57 | # MAGIC Do not use sub queries blindly, sometimes it is not efficient to use sub queries. As we mentioned earlier, In case of existence check we should prefer JOINS over sub queries. You can compare the execution time of both the queries. We have a very small set of data, which may not show you a significat difference between queries. In real life case where you deal with GB, TB of data, the difference can be huge.
 58 | # MAGIC 
 59 | # MAGIC Let's take one more example. We have to find out average basic salary of IT department. The output resultset must return only one column which is avg salary for IT department. Let's do this task with inner join & sub query. 
 60 | 
 61 | # COMMAND ----------
 62 | 
 63 | # MAGIC %sql
 64 | # MAGIC SELECT
 65 | # MAGIC   ES.basic_salary
 66 | # MAGIC FROM
 67 | # MAGIC   employee E
 68 | # MAGIC   INNER JOIN emp_salary ES ON E.empcode = ES.empcode
 69 | # MAGIC WHERE
 70 | # MAGIC   dept_id = 'DEP001'
 71 | 
 72 | # COMMAND ----------
 73 | 
 74 | # MAGIC %sql -- INNER JOIN
 75 | # MAGIC SELECT
 76 | # MAGIC   ROUND(AVG(ES.basic_salary), 2) AS AVG_BASIC_SALARY
 77 | # MAGIC from
 78 | # MAGIC   employee E
 79 | # MAGIC   INNER JOIN department D ON E.dept_id = D.dept_id
 80 | # MAGIC   INNER JOIN emp_salary ES ON E.empcode = ES.empcode
 81 | # MAGIC GROUP BY
 82 | # MAGIC   D.dept_name
 83 | # MAGIC HAVING
 84 | # MAGIC   D.dept_name = 'IT'
 85 | 
 86 | # COMMAND ----------
 87 | 
 88 | # MAGIC %sql -- Above task using inner query or subquery
 89 | # MAGIC SELECT
 90 | # MAGIC   ROUND(AVG(ES.basic_salary), 2) AS AVG_BASIC_SALARY
 91 | # MAGIC from
 92 | # MAGIC   employee E
 93 | # MAGIC   INNER JOIN emp_salary ES ON E.empcode = ES.empcode
 94 | # MAGIC WHERE
 95 | # MAGIC   E.dept_id = (
 96 | # MAGIC     SELECT
 97 | # MAGIC       dept_id
 98 | # MAGIC     FROM
 99 | # MAGIC       department
100 | # MAGIC     WHERE
101 | # MAGIC       dept_name = 'IT'
102 | # MAGIC   )
103 | 
104 | # COMMAND ----------
105 | 
106 | 
107 | 
108 | # COMMAND ----------
109 | 
110 | # MAGIC %run ../SETUP/_clean_up
111 | 
112 | # COMMAND ----------
113 | 
114 | 
115 | 


--------------------------------------------------------------------------------
/SQL Refresher/SR006-Views & Temp Views.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC ### What is View?
  4 | # MAGIC View is a virtual table based on resultset of a query. A view always gives you latest resultset. In real scenario, if you have a complex query with many joins & sub queries which you want to reuse, you can create a view for that query and use it like a physical table. 
  5 | # MAGIC 
  6 | # MAGIC SYNTAX:
  7 | # MAGIC 
  8 | # MAGIC CREATE VIEW IF NOT EXISTS {ViewName} AS
  9 | # MAGIC 
 10 | # MAGIC 
 11 | # MAGIC SELECT * FROM TABLE
 12 | # MAGIC 
 13 | # MAGIC ### What is Temp View?
 14 | # MAGIC TEMPORARY views are session-scoped and is dropped when session ends because it skips persisting the definition in the underlying metastore, if any. GLOBAL TEMPORARY views are tied to a system preserved temporary schema global_temp.
 15 | # MAGIC 
 16 | # MAGIC SYNTAX:
 17 | # MAGIC 
 18 | # MAGIC CREATE TEMP VIEW IF NOT EXISTS {tempviewname} AS
 19 | # MAGIC 
 20 | # MAGIC SELECT * FROM TABLE
 21 | # MAGIC 
 22 | # MAGIC 
 23 | # MAGIC ![SQL_JOINS](https://raw.githubusercontent.com/martandsingh/images/master/view-demo.png)
 24 | 
 25 | # COMMAND ----------
 26 | 
 27 | # MAGIC %run ../SETUP/_initial_setup
 28 | 
 29 | # COMMAND ----------
 30 | 
 31 | # MAGIC %sql --Lets say we want to find all the the employee with their department & club name. For that we have to write a complex query with multiple join.
 32 | # MAGIC SELECT
 33 | # MAGIC   E.firstname,
 34 | # MAGIC   E.lastname,
 35 | # MAGIC   D.dept_name AS Department,
 36 | # MAGIC   C.club_name AS Club
 37 | # MAGIC FROM
 38 | # MAGIC   employee E
 39 | # MAGIC   INNER JOIN department D ON E.dept_id = D.dept_id
 40 | # MAGIC   INNER JOIN club C ON E.club_id = C.club_id
 41 | 
 42 | # COMMAND ----------
 43 | 
 44 | # MAGIC %md 
 45 | # MAGIC Now we have a complex query which we may want to reuse that query in multiple procedure. You are in your scrum meeting and find out, you are using the wrong logic. Now you have to make changes in multiple procedure. It will:
 46 | # MAGIC 1. Waste your time & effort
 47 | # MAGIC 1. You may miss some procedures
 48 | # MAGIC 
 49 | # MAGIC So best way to peform this task is to create a view. You can use that view in your procedures. In case of any logic change, now you only have to update your view. As view gives you the latest result, your changes will immediately reflect to all the procedures. tada!!!!
 50 | 
 51 | # COMMAND ----------
 52 | 
 53 | # MAGIC %sql CREATE VIEW IF NOT EXISTS VW_GET_EMPLOYEES AS
 54 | # MAGIC SELECT
 55 | # MAGIC   E.firstname,
 56 | # MAGIC   E.lastname,
 57 | # MAGIC   D.dept_name AS Department,
 58 | # MAGIC   C.club_name AS Club
 59 | # MAGIC FROM
 60 | # MAGIC   employee E
 61 | # MAGIC   INNER JOIN department D ON E.dept_id = D.dept_id
 62 | # MAGIC   INNER JOIN club C ON E.club_id = C.club_id
 63 | 
 64 | # COMMAND ----------
 65 | 
 66 | # MAGIC %md
 67 | # MAGIC Now we have our complex query available as a view VW_GET_EMPLOYEES, which we can use as a regular table.
 68 | 
 69 | # COMMAND ----------
 70 | 
 71 | # MAGIC %sql
 72 | # MAGIC -- it will give you exact same result as our complex query
 73 | # MAGIC SELECT * FROM VW_GET_EMPLOYEES
 74 | 
 75 | # COMMAND ----------
 76 | 
 77 | # MAGIC %sql
 78 | # MAGIC -- We can list views using SHOW VIEWS
 79 | # MAGIC SHOW VIEWS
 80 | 
 81 | # COMMAND ----------
 82 | 
 83 | 
 84 | 
 85 | # COMMAND ----------
 86 | 
 87 | # MAGIC %md
 88 | # MAGIC Now let's create a temp view. How is it different than a regular view?
 89 | # MAGIC 
 90 | # MAGIC Well a temp view will be available only for your current session. If you restart your session, you will loose your temp view. It will not affect any physical table or data.
 91 | 
 92 | # COMMAND ----------
 93 | 
 94 | # MAGIC %sql
 95 | # MAGIC CREATE OR REPLACE TEMP VIEW TEMP_VW_GET_EMPLOYEES AS
 96 | # MAGIC SELECT
 97 | # MAGIC   E.firstname,
 98 | # MAGIC   E.lastname,
 99 | # MAGIC   D.dept_name AS Department,
100 | # MAGIC   C.club_name AS Club
101 | # MAGIC FROM
102 | # MAGIC   employee E
103 | # MAGIC   INNER JOIN department D ON E.dept_id = D.dept_id
104 | # MAGIC   INNER JOIN club C ON E.club_id = C.club_id
105 | 
106 | # COMMAND ----------
107 | 
108 | # MAGIC %sql
109 | # MAGIC SELECT * FROM TEMP_VW_GET_EMPLOYEES
110 | 
111 | # COMMAND ----------
112 | 
113 | # MAGIC %sql
114 | # MAGIC -- You can drop views using DROP VIEW
115 | # MAGIC DROP VIEW VW_GET_EMPLOYEES
116 | 
117 | # COMMAND ----------
118 | 
119 | # MAGIC %run ../SETUP/_clean_up
120 | 
121 | # COMMAND ----------
122 | 
123 | 
124 | 


--------------------------------------------------------------------------------
/SQL Refresher/SR007-Common Table Expressions.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC ### What is common table expression (CTE)?
  4 | # MAGIC A Common Table Expression (or CTE) is a feature in several SQL versions to improve the maintainability and readability of an SQL query.
  5 | # MAGIC 
  6 | # MAGIC It is also known as:
  7 | # MAGIC 1. Common Table Expression
  8 | # MAGIC 1. Subquery Factoring
  9 | # MAGIC 1. SQL WITH Clause
 10 | # MAGIC 
 11 | # MAGIC A Common Table Expression (or CTE) is a query you can define within another SQL query. 
 12 | # MAGIC 
 13 | # MAGIC ### What is the difference between Subquery & CTE?
 14 | # MAGIC If you are new to SQL concept, then it may sound like a subquery (believe me you are totally normal, it happened to me too :p). A CTE also generates a result that contains rows and columns of data. The difference is that you can give this result a name, and you can refer to it multiple times within your main query. So in other words, CTE provides you a named resultset.
 15 | # MAGIC 
 16 | # MAGIC You can use a CTE in:
 17 | # MAGIC 1. SELECT
 18 | # MAGIC 1. INSERT
 19 | # MAGIC 1. UPDATE
 20 | # MAGIC 1. DELETE
 21 | # MAGIC 
 22 | # MAGIC 
 23 | # MAGIC ![cte](https://raw.githubusercontent.com/martandsingh/images/master/cte.jpg)
 24 | 
 25 | # COMMAND ----------
 26 | 
 27 | # MAGIC %run ../SETUP/_initial_setup
 28 | 
 29 | # COMMAND ----------
 30 | 
 31 | # MAGIC %sql WITH cte_department_employee_count AS(
 32 | # MAGIC   SELECT
 33 | # MAGIC     D.dept_name AS Department,
 34 | # MAGIC     COUNT(1) AS `Total Members`
 35 | # MAGIC   FROM
 36 | # MAGIC     employee E
 37 | # MAGIC     INNER JOIN department D ON E.dept_id = D.dept_id
 38 | # MAGIC   GROUP BY
 39 | # MAGIC     D.dept_name
 40 | # MAGIC   ORDER BY
 41 | # MAGIC     `Total Members` DESC
 42 | # MAGIC ) -- so here we can use cte_department_employee_count as our named result set immediately after CTE is defined.
 43 | # MAGIC SELECT
 44 | # MAGIC   *
 45 | # MAGIC FROM
 46 | # MAGIC   cte_department_employee_count;
 47 | # MAGIC -- Keep in mind once you reed CTE, it will dissappear & throw error if you try to run it again.
 48 | 
 49 | # COMMAND ----------
 50 | 
 51 | # MAGIC %md
 52 | # MAGIC ### Nested CTE
 53 | # MAGIC You can use nested CTE if you want to reuse your CTE in another.
 54 | # MAGIC 
 55 | # MAGIC Syntax:
 56 | # MAGIC 
 57 | # MAGIC WITH CTE1 AS (
 58 | # MAGIC   
 59 | # MAGIC   {Query1}
 60 | # MAGIC   
 61 | # MAGIC ),
 62 | # MAGIC 
 63 | # MAGIC CTE2 AS (
 64 | # MAGIC 
 65 | # MAGIC   {QUERY2}
 66 | # MAGIC   
 67 | # MAGIC )
 68 | # MAGIC 
 69 | # MAGIC SELECT * FROM CTE1;
 70 | # MAGIC 
 71 | # MAGIC SELECT * FROM CTE2;
 72 | # MAGIC 
 73 | # MAGIC let's try it out.
 74 | 
 75 | # COMMAND ----------
 76 | 
 77 | # MAGIC %sql -- let's create one CTE which will calculate department wise salary. We will use full join so that we can include department which does not exists, this will generate NULL values for AVG salary field. In the second expression we will remove those invalid rows (basic salary with null values). This can be done in a simpler way using a single query but for the sake of tutorial, I am using 2 different CTE to achevie this task, but in real life scenario CTE is not a good way to acheive this. You can simple do this by one group statement & filter (WHERE). You can see in the below query we applied outer join (just for the sake of tutorial, do not think about the business logic here). You can see many invalid rows with dept_name and AVG_BASIC_SALARY as null. Now we will change this query to CTE & using another CTE we will clean these rows.
 78 | # MAGIC --WITH cte_dept_salary AS(
 79 | # MAGIC SELECT
 80 | # MAGIC   D.dept_name,
 81 | # MAGIC   AVG(ES.basic_salary) AS AVG_BASIC_SALARY
 82 | # MAGIC FROM
 83 | # MAGIC   employee E FULL
 84 | # MAGIC   JOIN emp_salary ES ON E.empcode = ES.empcode FULL
 85 | # MAGIC   JOIN department D ON E.dept_id = D.dept_id
 86 | # MAGIC GROUP BY
 87 | # MAGIC   D.dept_name --)
 88 | 
 89 | # COMMAND ----------
 90 | 
 91 | # MAGIC %sql WITH cte_avg_salary AS (
 92 | # MAGIC   SELECT
 93 | # MAGIC     D.dept_name,
 94 | # MAGIC     AVG(ES.basic_salary) AS AVG_BASIC_SALARY
 95 | # MAGIC   FROM
 96 | # MAGIC     employee E FULL
 97 | # MAGIC     JOIN emp_salary ES ON E.empcode = ES.empcode FULL
 98 | # MAGIC     JOIN department D ON E.dept_id = D.dept_id
 99 | # MAGIC   GROUP BY
100 | # MAGIC     D.dept_name
101 | # MAGIC ),
102 | # MAGIC cte_avg_salary_clean AS (
103 | # MAGIC   SELECT
104 | # MAGIC     *
105 | # MAGIC   FROM
106 | # MAGIC     cte_avg_salary
107 | # MAGIC   WHERE
108 | # MAGIC     dept_name IS NOT NULL
109 | # MAGIC     AND AVG_BASIC_SALARY IS NOT NULL
110 | # MAGIC )
111 | # MAGIC SELECT
112 | # MAGIC   *
113 | # MAGIC FROM
114 | # MAGIC   cte_avg_salary_clean;
115 | 
116 | # COMMAND ----------
117 | 
118 | # MAGIC %md
119 | # MAGIC ### CTE with View
120 | # MAGIC You can also create view using your CTE.
121 | # MAGIC 
122 | # MAGIC Syntax:
123 | # MAGIC 
124 | # MAGIC CREATE OR REPLACE VIEW {ViewName} AS
125 | # MAGIC 
126 | # MAGIC WITH CTE AS (
127 | # MAGIC 
128 | # MAGIC   {COMPLEX QUERY}
129 | # MAGIC   
130 | # MAGIC )
131 | # MAGIC 
132 | # MAGIC SELECT * FROM CTE
133 | 
134 | # COMMAND ----------
135 | 
136 | # MAGIC %sql -- We can use CTE with views also. Let's use above nested cte in a view
137 | # MAGIC CREATE
138 | # MAGIC OR REPLACE VIEW VW_DEPT_SALARY AS WITH cte_avg_salary AS (
139 | # MAGIC   SELECT
140 | # MAGIC     D.dept_name,
141 | # MAGIC     AVG(ES.basic_salary) AS AVG_BASIC_SALARY
142 | # MAGIC   FROM
143 | # MAGIC     employee E FULL
144 | # MAGIC     JOIN emp_salary ES ON E.empcode = ES.empcode FULL
145 | # MAGIC     JOIN department D ON E.dept_id = D.dept_id
146 | # MAGIC   GROUP BY
147 | # MAGIC     D.dept_name
148 | # MAGIC ),
149 | # MAGIC cte_avg_salary_clean AS (
150 | # MAGIC   SELECT
151 | # MAGIC     dept_name AS Department,
152 | # MAGIC     ROUND(AVG_BASIC_SALARY, 2) AS AVG_BASIC_SALARY
153 | # MAGIC   FROM
154 | # MAGIC     cte_avg_salary
155 | # MAGIC   WHERE
156 | # MAGIC     dept_name IS NOT NULL
157 | # MAGIC     AND AVG_BASIC_SALARY IS NOT NULL
158 | # MAGIC )
159 | # MAGIC SELECT
160 | # MAGIC   *
161 | # MAGIC FROM
162 | # MAGIC   cte_avg_salary_clean;
163 | 
164 | # COMMAND ----------
165 | 
166 | # MAGIC %sql
167 | # MAGIC SELECT
168 | # MAGIC   *
169 | # MAGIC FROM
170 | # MAGIC   VW_DEPT_SALARY;
171 | 
172 | # COMMAND ----------
173 | 
174 | # MAGIC %run ../SETUP/_clean_up
175 | 
176 | # COMMAND ----------
177 | 
178 | 
179 | 


--------------------------------------------------------------------------------
/SQL Refresher/SR008 - EXCEPT, UNION, UNION ALL, INTERSECTION.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC ### UNION
  4 | # MAGIC The UNION operator is used to combine the result-set of two or more SELECT statements. It will remove duplicate rows from the final resultset.
  5 | # MAGIC 
  6 | # MAGIC ### UNION ALL
  7 | # MAGIC The UNION operator is used to combine the result-set of two or more SELECT statements. It will include duplicate rows in the final resultset.
  8 | # MAGIC 
  9 | # MAGIC ### INTERSECT
 10 | # MAGIC The INTERSECT clause in SQL is used to combine two SELECT statements but the dataset returned by the INTERSECT statement will be the intersection of the data-sets of the two SELECT statements. In simple words, the INTERSECT statement will return only those rows which will be common to both of the SELECT statements.
 11 | # MAGIC 
 12 | # MAGIC ### EXCEPT
 13 | # MAGIC The SQL EXCEPT operator is used to return all rows in the first SELECT statement that are not returned by the second SELECT statement.
 14 | # MAGIC 
 15 | # MAGIC *to use all these operations, both the table should have same number of columns & same types of columns*
 16 | # MAGIC 
 17 | # MAGIC ![Union_Intersection](https://raw.githubusercontent.com/martandsingh/images/master/union.jpg)
 18 | 
 19 | # COMMAND ----------
 20 | 
 21 | # MAGIC %run ../SETUP/_initial_setup
 22 | 
 23 | # COMMAND ----------
 24 | 
 25 | # MAGIC %sql
 26 | # MAGIC SELECT * FROM supplier_india
 27 | 
 28 | # COMMAND ----------
 29 | 
 30 | # MAGIC %sql 
 31 | # MAGIC SELECT * FROM supplier_nepal
 32 | 
 33 | # COMMAND ----------
 34 | 
 35 | # MAGIC %md 
 36 | # MAGIC ### UNION DEMO
 37 | 
 38 | # COMMAND ----------
 39 | 
 40 | # MAGIC %sql
 41 | # MAGIC -- here union will combine our both the dataset excluding duplicates. But you must be wondering why Martand Singh  & Gaurav Chadwani are showing twice?
 42 | # MAGIC -- The reason behind this is, as you are selecing supp_id & city in your resultset, which makes your row unique than other row. UNION check for duplicate value for the combination of the columns in the resultset. If you remove supp_id & city from query then it will remove duplicate suppliers. Let's try it in the next cell.
 43 | # MAGIC 
 44 | # MAGIC SELECT supp_id, supp_name, city FROM supplier_nepal
 45 | # MAGIC UNION
 46 | # MAGIC SELECT supp_id, supp_name, city FROM supplier_india
 47 | 
 48 | # COMMAND ----------
 49 | 
 50 | # MAGIC %sql
 51 | # MAGIC -- now you can see thoe tw duplicate suppliers are gone now. This is because we are only selecting supplier_name
 52 | # MAGIC SELECT supp_name FROM supplier_nepal
 53 | # MAGIC UNION
 54 | # MAGIC SELECT  supp_name FROM supplier_india
 55 | 
 56 | # COMMAND ----------
 57 | 
 58 | # MAGIC %md
 59 | # MAGIC ### UNION ALL DEMO
 60 | # MAGIC UNION ALL perform same as UNION except union all will include duplicate rows also in resultset.
 61 | 
 62 | # COMMAND ----------
 63 | 
 64 | # MAGIC %sql
 65 | # MAGIC -- Now you will see duplicate records as we are using UNION ALL. You can see Martand Singh & Gaurav Chandawani are duplicate. If all the records are unique then union and union all behave same. UNION ALL is faster than UNION as it does not have to perform checks for duplicate rows. So until you dont need duplicate check, try to use UNION ALL.
 66 | # MAGIC SELECT supp_name FROM supplier_nepal
 67 | # MAGIC UNION ALL
 68 | # MAGIC SELECT  supp_name FROM supplier_india
 69 | 
 70 | # COMMAND ----------
 71 | 
 72 | # MAGIC %md
 73 | # MAGIC ###INTERSECT DEMO
 74 | # MAGIC Let's say we want to find common supplier names. Intersection will give you records which are available in both the tables. Again it will find out common rows based on all the columns in your select query. So if we include city or supplier id, it will not give you any result as all three column makes your record unique.
 75 | 
 76 | # COMMAND ----------
 77 | 
 78 | # MAGIC %sql
 79 | # MAGIC -- this will return zero rows
 80 | # MAGIC SELECT supp_id, supp_name, city FROM supplier_india
 81 | # MAGIC INTERSECT
 82 | # MAGIC SELECT supp_id, supp_name, city FROM supplier_nepal
 83 | 
 84 | # COMMAND ----------
 85 | 
 86 | # MAGIC %sql
 87 | # MAGIC -- this will return 2 rows as these two supplier names are common in both the tables.
 88 | # MAGIC SELECT supp_name FROM supplier_india
 89 | # MAGIC INTERSECT
 90 | # MAGIC SELECT supp_name FROM supplier_nepal
 91 | 
 92 | # COMMAND ----------
 93 | 
 94 | # MAGIC %md
 95 | # MAGIC ### EXCEPT DEMO
 96 | # MAGIC The SQL EXCEPT operator is used to return all rows in the first SELECT statement that are not returned by the second SELECT statement.
 97 | # MAGIC 
 98 | # MAGIC Let's find all the suppliers from India which are only managing Indian market.
 99 | # MAGIC Or in other words, Let's find all the Indian supplier which are not operating in Nepal.
100 | 
101 | # COMMAND ----------
102 | 
103 | # MAGIC %sql
104 | # MAGIC -- so to calculate above query, we will select all the Indian supplier which are not available in nepali supplier table
105 | # MAGIC SELECT supp_name FROM supplier_india
106 | # MAGIC EXCEPT
107 | # MAGIC SELECT supp_name FROM supplier_nepal
108 | # MAGIC 
109 | # MAGIC --Above query says, get all supplier name from India which are not available in Nepal.
110 | # MAGIC --So there are only two suppliers which operation only in India other two (Martand & Gaurav) they operate in Nepal also, as they are available in Nepal supplier list too.
111 | 
112 | # COMMAND ----------
113 | 
114 | # MAGIC %sql
115 | # MAGIC -- so if you reverse the query, then it says, get all the supplier from nepal which are only operating in nepal (or not operating in india).
116 | # MAGIC SELECT supp_name FROM supplier_nepal
117 | # MAGIC EXCEPT
118 | # MAGIC SELECT supp_name FROM supplier_india
119 | 
120 | # COMMAND ----------
121 | 
122 | # MAGIC %run ../SETUP/_clean_up
123 | 
124 | # COMMAND ----------
125 | 
126 | 
127 | 


--------------------------------------------------------------------------------
/SQL Refresher/SR009-External Tables.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC ### What is external table?
  4 | # MAGIC In a typical table, the data is stored in the database; however, in an external table, the data is stored in files in an external stage. External tables store file-level metadata about the data files, such as the filename, a version identifier and related properties. In external table, the data is not stored in database system, it is saved somewhere else in external location. These tables are slow to load as data is loaded from external source every time your run query on it. 
  5 | # MAGIC 
  6 | # MAGIC SYNTAX:
  7 | # MAGIC We use "external" keyword to create an external table.
  8 | # MAGIC 
  9 | # MAGIC Some characterstics:
 10 | # MAGIC 1. Data in external storage
 11 | # MAGIC 1. Slower to load
 12 | # MAGIC 1. Dropping the table will delete only table meta data. Your external data will be safe.
 13 | # MAGIC 1. It give you latest result from the file. If you delete external file then your query will throw exception.
 14 | # MAGIC 
 15 | # MAGIC Use case:
 16 | # MAGIC 1. As a data engineer, I use external table when I want to perform some adhoc analysis on some external file which we use infrequently.
 17 | # MAGIC 1. Sometimes you have some data in csv or flat files which your business user provides you & it frequently gets updated (my personal experience), in that case you can use external table(considering file size is small) to load that data, so that everytime you get latest result.
 18 | # MAGIC 
 19 | # MAGIC ![External_Table](https://raw.githubusercontent.com/martandsingh/images/master/external.png)
 20 | 
 21 | # COMMAND ----------
 22 | 
 23 | # MAGIC %run ../SETUP/_initial_setup
 24 | 
 25 | # COMMAND ----------
 26 | 
 27 | # MAGIC %sql -- here we are creating an external table from CSV file which is stored in my databricks storage. below command may not run in your system as you may not have same file at the same location. So you can upload a CSV file (in my case it was tab delimited) in your databricks storage and update the path and columns accordingly. th csv file which I used is available at : https://raw.githubusercontent.com/martandsingh/datasets/master/bank-full.csv
 28 | # MAGIC DROP TABLE IF EXISTS bank_report;
 29 | # MAGIC CREATE EXTERNAL TABLE bank_report (
 30 | # MAGIC   age STRING,
 31 | # MAGIC   job STRING,
 32 | # MAGIC   marital STRING,
 33 | # MAGIC   education STRING,
 34 | # MAGIC   default STRING,
 35 | # MAGIC   balance STRING,
 36 | # MAGIC   housing STRING,
 37 | # MAGIC   loan STRING,
 38 | # MAGIC   contact STRING,
 39 | # MAGIC   day STRING,
 40 | # MAGIC   month STRING,
 41 | # MAGIC   duration STRING,
 42 | # MAGIC   campaign STRING,
 43 | # MAGIC   pdays STRING,
 44 | # MAGIC   previous STRING,
 45 | # MAGIC   poutcome STRING,
 46 | # MAGIC   y STRING
 47 | # MAGIC ) USING CSV OPTIONS (
 48 | # MAGIC   path "/FileStore/tables/dataset/*.csv",
 49 | # MAGIC   delimiter ";",
 50 | # MAGIC   header "true"
 51 | # MAGIC );
 52 | 
 53 | # COMMAND ----------
 54 | 
 55 | # MAGIC %sql -- now let's select top 100 records from our external table
 56 | # MAGIC SELECT
 57 | # MAGIC   *
 58 | # MAGIC FROM
 59 | # MAGIC   bank_report
 60 | # MAGIC LIMIT
 61 | # MAGIC   100;
 62 | 
 63 | # COMMAND ----------
 64 | 
 65 | # MAGIC %sql -- We can also create external table without defining schema.
 66 | # MAGIC DROP TABLE IF EXISTS bank_report_nc;
 67 | # MAGIC CREATE EXTERNAL TABLE bank_report_nc USING CSV OPTIONS (
 68 | # MAGIC   path "/FileStore/tables/dataset/*.csv",
 69 | # MAGIC   delimiter ";",
 70 | # MAGIC   header "true"
 71 | # MAGIC );
 72 | 
 73 | # COMMAND ----------
 74 | 
 75 | # MAGIC %sql
 76 | # MAGIC SELECT * FROM bank_report_nc LIMIT 10;
 77 | 
 78 | # COMMAND ----------
 79 | 
 80 | # MAGIC %sql -- creating an external table in delta location. We are getting a subset of our external table bank_report and saving the output to a new delta table bank_report_del.
 81 | # MAGIC DROP TABLE IF EXISTS bank_report_del;
 82 | # MAGIC CREATE TABLE bank_report_del USING DELTA AS
 83 | # MAGIC SELECT
 84 | # MAGIC   *
 85 | # MAGIC FROM
 86 | # MAGIC   bank_report
 87 | # MAGIC WHERE
 88 | # MAGIC   balance > 1500
 89 | 
 90 | # COMMAND ----------
 91 | 
 92 | # MAGIC %sql
 93 | # MAGIC SELECT * FROM bank_report_del
 94 | 
 95 | # COMMAND ----------
 96 | 
 97 | # DO NOT WORRY ABOUT THIS CODE. I am using this code to save parquet file in your dbfs, for the external table demo using parquet file in next cell. We will undertand these code in future. For now, just run this cell.
 98 | df = spark.sql("SELECT * FROM bank_report WHERE balance > 1500")
 99 | #display(df)
100 | 
101 | df.write.format("delta").mode("overwrite").save("/delta/bank_users_1500")
102 | 
103 | # COMMAND ----------
104 | 
105 | # You can check the files using dbutils
106 | display(dbutils.fs.ls('/delta/bank_users_1500'))
107 | 
108 | # COMMAND ----------
109 | 
110 | # MAGIC %sql --We are creating an external table using delta location. We have saved parquet file in the given delta location.
111 | # MAGIC DROP TABLE IF EXISTS bank_report_parq;
112 | # MAGIC CREATE EXTERNAL TABLE bank_report_parq USING DELTA LOCATION "/delta/bank_users_1500/"
113 | 
114 | # COMMAND ----------
115 | 
116 | # MAGIC %sql
117 | # MAGIC SELECT * FROM bank_report_parq
118 | 
119 | # COMMAND ----------
120 | 
121 | # MAGIC %run ../SETUP/_clean_up
122 | 
123 | # COMMAND ----------
124 | 
125 | 
126 | 


--------------------------------------------------------------------------------
/SQL Refresher/SR010-Drop database & tables.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ### Drop database
 4 | # MAGIC DROP DATABASE {DB_NAME}
 5 | # MAGIC 
 6 | # MAGIC ### DROP table
 7 | # MAGIC DROP TABLE {TABLE_NAME}
 8 | # MAGIC 
 9 | # MAGIC Our initial setup script creates DB_DEMO & employee table. Let's check a demo.
10 | 
11 | # COMMAND ----------
12 | 
13 | # MAGIC %run ../SETUP/_initial_setup
14 | 
15 | # COMMAND ----------
16 | 
17 | # MAGIC %sql
18 | # MAGIC --list databases
19 | # MAGIC SHOW DATABASES;
20 | 
21 | # COMMAND ----------
22 | 
23 | # MAGIC %sql
24 | # MAGIC SHOW TABLES;
25 | 
26 | # COMMAND ----------
27 | 
28 | # MAGIC %sql 
29 | # MAGIC -- we can see there is a database name db_demo & few tables. let's drop employee table & then whole database.
30 | # MAGIC DROP TABLE employee;
31 | 
32 | # COMMAND ----------
33 | 
34 | # MAGIC %sql
35 | # MAGIC SHOW TABLES;
36 | # MAGIC -- now you will not see employee table in the list
37 | 
38 | # COMMAND ----------
39 | 
40 | # MAGIC %sql
41 | # MAGIC -- lets delete whole database. if your database has table then we have to use one keyword CASCADE, it will delete all the table and other objects in side the database and then drop the database.
42 | # MAGIC DROP DATABASE DB_DEMO CASCADE
43 | 
44 | # COMMAND ----------
45 | 
46 | # MAGIC %sql
47 | # MAGIC SHOW DATABASES
48 | # MAGIC -- db_demo will not be in the list.
49 | 
50 | # COMMAND ----------
51 | 
52 | # MAGIC %run ../SETUP/_clean_up
53 | 
54 | # COMMAND ----------
55 | 
56 | 
57 | 


--------------------------------------------------------------------------------
/SQL Refresher/SR011-Check Table & Database Details.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | # MAGIC %md
 3 | # MAGIC ### What is metadata?
 4 | # MAGIC Metadata is descriptive information. So whenever you create a database or table, metadata is generated in backend. During your data engineering process, you may need to see details or metadata about your table & database. We can do it using "DESCRIBE" keyword.
 5 | # MAGIC 
 6 | # MAGIC Let's have a demo.
 7 | # MAGIC 
 8 | # MAGIC ![METADATA](https://raw.githubusercontent.com/martandsingh/images/master/metadat.png)
 9 | 
10 | # COMMAND ----------
11 | 
12 | # MAGIC %run ../SETUP/_initial_setup
13 | 
14 | # COMMAND ----------
15 | 
16 | # MAGIC %sql
17 | # MAGIC -- check database details
18 | # MAGIC DESCRIBE DATABASE DB_DEMO;
19 | # MAGIC 
20 | # MAGIC -- you can see it will give you basic information about database. You can see the location of data stored. You can change this location when creating the database,.
21 | 
22 | # COMMAND ----------
23 | 
24 | # MAGIC %sql
25 | # MAGIC DESCRIBE DATABASE EXTENDED DB_DEMO;
26 | # MAGIC -- When you write extended, it will give you more details about database. We have not defined any Properties so it is showing empty column for that. 
27 | 
28 | # COMMAND ----------
29 | 
30 | # MAGIC %sql
31 | # MAGIC DESCRIBE TABLE employee
32 | # MAGIC 
33 | # MAGIC -- or you can simply write DESCRIBE employee. You can see all the columns and data types. We do not have any partitioning for now.
34 | 
35 | # COMMAND ----------
36 | 
37 | # MAGIC %sql
38 | # MAGIC DESCRIBE EXTENDED employee
39 | # MAGIC -- this will show detailed information about table. 
40 | 
41 | # COMMAND ----------
42 | 
43 | # In above column you can see the location of the table. location shows where are all the data files saved for that particular table. let's explore that folder
44 | display(dbutils.fs.ls("dbfs:/user/hive/warehouse/db_demo.db/employee/"))
45 | 
46 | # COMMAND ----------
47 | 
48 | # All the transaction logs is saved in _delta_log folder
49 | display(dbutils.fs.ls("dbfs:/user/hive/warehouse/db_demo.db/employee/_delta_log/"))
50 | 
51 | # COMMAND ----------
52 | 
53 | # transaction logs will be saved in json files. let's explore one.
54 | df_trans = spark.sql(f"SELECT * FROM json.`dbfs:/user/hive/warehouse/db_demo.db/employee/_delta_log/00000000000000000001.json`")
55 | display(df_trans)
56 | 
57 | # add column shows files added for that particular transaction & remove shows which file is removed for that transaction
58 | 
59 | # COMMAND ----------
60 | 
61 | # MAGIC %sql
62 | # MAGIC DESCRIBE DETAIL employee
63 | # MAGIC -- Using this you can more details about table. We are interested in numFiles. numfiles shows the data files used for this table. But we earlier saw there are more than 8 files in file location. what is that? 
64 | # MAGIC -- Delta lake keeps the older version of data to use time travel functionality. When you run query it checks the meta data and only include the files which are valid and ignore all others. We do not have to go so deeply in this just better to know how it works.
65 | 
66 | # COMMAND ----------
67 | 
68 | # MAGIC %sql
69 | # MAGIC -- You can check history for a table. It will show you all the changes saved in transaction logs
70 | # MAGIC DESCRIBE HISTORY db_demo.employee
71 | 
72 | # COMMAND ----------
73 | 
74 | # MAGIC %sql
75 | # MAGIC --Check all the databases in system
76 | # MAGIC SHOW DATABASES;
77 | 
78 | # COMMAND ----------
79 | 
80 | # MAGIC %sql
81 | # MAGIC -- Check all the tables in all the databases. before running this command you have to select a database. We have selected the DB_DEMO database in our initial_setup script
82 | # MAGIC SHOW TABLES 
83 | 
84 | # COMMAND ----------
85 | 
86 | # MAGIC %sql
87 | # MAGIC -- SHOW PARTITIONS DB_DEMO.employee;
88 | # MAGIC -- we can use above table to check partitions. As our table does not have partition, we cannot run the command.
89 | 
90 | # COMMAND ----------
91 | 
92 | 
93 | 


--------------------------------------------------------------------------------
/SQL Refresher/SR012-Versioning, Time Travel & Optimization.py:
--------------------------------------------------------------------------------
  1 | # Databricks notebook source
  2 | # MAGIC %md
  3 | # MAGIC ### Versioning, Time Travel & Optimization
  4 | # MAGIC 
  5 | # MAGIC ### OPTIMIZE
  6 | # MAGIC Delta Lake on Databricks can improve the speed of read queries from a table. One way to improve this speed is to coalesce small files into larger ones. You trigger compaction by running the OPTIMIZE command
  7 | # MAGIC 
  8 | # MAGIC ### Z-ORDER
  9 | # MAGIC Data Skipping is a performance optimization that aims at speeding up queries that contain filters (WHERE clauses).
 10 | # MAGIC 
 11 | # MAGIC As new data is inserted into a Databricks Delta table, file-level min/max statistics are collected for all columns (including nested ones) of supported types. Then, when there’s a lookup query against the table, Databricks Delta first consults these statistics to determine which files can safely be skipped.  This is done automatically and no specific commands are required to be run for this.
 12 | # MAGIC 
 13 | # MAGIC * Z-Ordering is a technique to co-locate related information in the same set of files.
 14 | # MAGIC * Z-Ordering maps multidimensional data to one dimension while preserving the locality of the data points.
 15 | # MAGIC 
 16 | # MAGIC 
 17 | # MAGIC ### Z-Order Vs Partition
 18 | # MAGIC Partitioning physically splits the data into different files/directories having only one specific value, while ZOrder provides clustering of related data inside the files that may contain multiple possible values for given column.
 19 | # MAGIC 
 20 | # MAGIC Partitioning is useful when you have a low cardinality column - when there are not so many different possible values - for example, you can easily partition by year & month (maybe by day), but if you partition in addition by hour, then you'll have too many partitions with too many files, and it will lead to big performance problems.
 21 | # MAGIC 
 22 | # MAGIC ZOrder allows to create bigger files that are more efficient to read compared to many small files.
 23 | 
 24 | # COMMAND ----------
 25 | 
 26 | # MAGIC %run ../SETUP/_pyspark_init_setup
 27 | 
 28 | # COMMAND ----------
 29 | 
 30 | # MAGIC %run ../SETUP/_initial_setup
 31 | 
 32 | # COMMAND ----------
 33 | 
 34 | from pyspark.sql.types import StructField, StructType, StringType, DecimalType, IntegerType
 35 | 
 36 | # COMMAND ----------
 37 | 
 38 | # MAGIC %md
 39 | # MAGIC ### Load Data
 40 | 
 41 | # COMMAND ----------
 42 | 
 43 | # We are using a game steam dataset.
 44 | custom_schema = StructType(
 45 | [
 46 |     StructField("gamer_id", IntegerType(), True),
 47 |     StructField("game", StringType(), True),
 48 |     StructField("behaviour", StringType(), True),
 49 |     StructField("play_hours", DecimalType(), True),
 50 |     StructField("rating", IntegerType(), True)
 51 | ])
 52 | df = spark.read.option("header", "true").schema(custom_schema).csv('/FileStore/datasets/steam-200k.csv')
 53 | display(df)
 54 | 
 55 | # COMMAND ----------
 56 | 
 57 | df.write.format("delta").saveAsTable("DB_DEMO.game_stats")
 58 | 
 59 | # COMMAND ----------
 60 | 
 61 | # MAGIC %sql
 62 | # MAGIC DESCRIBE EXTENDED DB_DEMO.game_stats
 63 | 
 64 | # COMMAND ----------
 65 | 
 66 | display(dbutils.fs.ls("dbfs:/user/hive/warehouse/db_demo.db/game_stats"))
 67 | 
 68 | # COMMAND ----------
 69 | 
 70 | # MAGIC %sql
 71 | # MAGIC -- Check num of files before optimize. The moment I run this command it shows 3 as number of files.
 72 | # MAGIC DESCRIBE DETAIL db_demo.game_stats
 73 | 
 74 | # COMMAND ----------
 75 | 
 76 | # MAGIC %md
 77 | # MAGIC ### OPTIMIZE combines smaller files and create bigger file.
 78 | 
 79 | # COMMAND ----------
 80 | 
 81 | # MAGIC %sql
 82 | # MAGIC -- let;s optimize the table and then see the number of files
 83 | # MAGIC OPTIMIZE db_demo.game_stats
 84 | 
 85 | # COMMAND ----------
 86 | 
 87 | # MAGIC %sql
 88 | # MAGIC -- Check num of files after optimize. After optimizing it is showing 1 file.
 89 | # MAGIC DESCRIBE DETAIL db_demo.game_stats
 90 | 
 91 | # COMMAND ----------
 92 | 
 93 | display(dbutils.fs.ls("dbfs:/user/hive/warehouse/db_demo.db/game_stats"))
 94 | 
 95 | # COMMAND ----------
 96 | 
 97 | # _delta_log file contains all the transactions files
 98 | display(dbutils.fs.ls("dbfs:/user/hive/warehouse/db_demo.db/game_stats/_delta_log/"))
 99 | 
100 | # COMMAND ----------
101 | 
102 | display(spark.sql(f"SELECT * FROM json.`dbfs:/user/hive/warehouse/db_demo.db/game_stats/_delta_log/00000000000000000001.json`"))
103 | 
104 | # this was the last transaction which is our optimize command. You will see add column which includes new files added, in our case there is only one new file but in delete columns you can see 3 files deleted. We saw same result using DESCRIBE DETAIL employee command earlier.
105 | 
106 | # COMMAND ----------
107 | 
108 | display(spark.read.format("delta").load('dbfs:/user/hive/warehouse/db_demo.db/game_stats/'))
109 | 
110 | # COMMAND ----------
111 | 
112 | df.write.format("delta").saveAsTable("DB_DEMO.game_stats_new")
113 | 
114 | # COMMAND ----------
115 | 
116 | # MAGIC %md
117 | # MAGIC ### Z-ORDER
118 | # MAGIC It will keep simillary data closer to optimize query.
119 | 
120 | # COMMAND ----------
121 | 
122 | # MAGIC %sql
123 | # MAGIC OPTIMIZE db_demo.game_stats_new
124 | # MAGIC ZORDER BY (game)
125 | 
126 | # COMMAND ----------
127 | 
128 | display(spark.read.format("delta").load('dbfs:/user/hive/warehouse/db_demo.db/game_stats_new/'))
129 | # if you compare this result with above game_stats result, you will see here data of same types are grouped because we applied ZORDER based on game colmn. It is keeping data of same game together.
130 | 
131 | # COMMAND ----------
132 | 
133 | # MAGIC %md 
134 | # MAGIC ### Versioning & Timetravel
135 | 
136 | # COMMAND ----------
137 | 
138 | # MAGIC %md
139 | # MAGIC Lets update table, each transation will create a new version of table. Does that mean we can still access the old version of data? 
140 | # MAGIC 
141 | # MAGIC Let's try it out.
142 | 
143 | # COMMAND ----------
144 | 
145 | # MAGIC %sql
146 | # MAGIC UPDATE DB_DEMO.game_stats
147 | # MAGIC SET rating = CASE WHEN play_hours <10 THEN 3 ELSE 4.5 END
148 | 
149 | # COMMAND ----------
150 | 
151 | # MAGIC %sql
152 | # MAGIC DESCRIBE HISTORY DB_DEMO.game_stats
153 | # MAGIC -- IN our case we have three versions. Let's access the version befor update. that  mean version 1 as version2 is the latest version which was created by our update command.
154 | 
155 | # COMMAND ----------
156 | 
157 | # MAGIC %sql 
158 | # MAGIC -- IN this version we should see our rating as 0
159 | # MAGIC SELECT * FROM DB_DEMO.game_stats VERSION AS OF 1
160 | 
161 | # COMMAND ----------
162 | 
163 | # MAGIC %sql 
164 | # MAGIC -- You can use timestamp also to access older version of data
165 | # MAGIC SELECT * FROM DB_DEMO.game_stats TIMESTAMP AS OF "2022-06-17 07:40:14.000+0000"
166 | 
167 | # COMMAND ----------
168 | 
169 | # MAGIC %sql 
170 | # MAGIC -- This will give you the latest version with non-zero ratings 
171 | # MAGIC SELECT * FROM DB_DEMO.game_stats
172 | 
173 | # COMMAND ----------
174 | 
175 | # MAGIC %md
176 | # MAGIC ### ROLLBACK
177 | # MAGIC Let's say you update or delete your table by mistake & now you want to rollback to the previous version. Is it possible? 
178 | # MAGIC Lets try it out.
179 | 
180 | # COMMAND ----------
181 | 
182 | # MAGIC %sql
183 | # MAGIC -- let's delete our data from game_stats table
184 | # MAGIC DELETE FROM DB_DEMO.game_stats
185 | 
186 | # COMMAND ----------
187 | 
188 | # MAGIC %sql
189 | # MAGIC -- now we can see there are no records
190 | # MAGIC SELECT * FROM DB_DEMO.game_stats
191 | 
192 | # COMMAND ----------
193 | 
194 | # MAGIC %sql
195 | # MAGIC DESCRIBE HISTORY DB_DEMO.game_stats
196 | 
197 | # COMMAND ----------
198 | 
199 | # MAGIC %sql
200 | # MAGIC -- remember version 2 was our version created after we updated the rating . 3rd version is created by delete command. So we want to restore before our delete.
201 | # MAGIC RESTORE TABLE DB_DEMO.game_stats TO VERSION AS OF 2 
202 | 
203 | # COMMAND ----------
204 | 
205 | # MAGIC %sql
206 | # MAGIC --- TADA!! We have restored our data.
207 | # MAGIC  SELECT * FROM DB_DEMO.game_stats
208 | 
209 | # COMMAND ----------
210 | 
211 | # MAGIC %md
212 | # MAGIC ### Purging data using VACUUM
213 | # MAGIC Delta lake versioning is very helpful to restore or access old data but it is not possible to keep all the versions of big data. It will be very expensive if you have GBs or TBs of data. So we should purge or remove outdated file. VACUUME command helps us to acheive it.
214 | # MAGIC 
215 | # MAGIC __This will delete data files permanently.__
216 | 
217 | # COMMAND ----------
218 | 
219 | # MAGIC %sql
220 | # MAGIC -- This will give you error, because it is not a best practice to delete all your old versions. below command woll delete all the old versions because we are using 0 hours. DEFAULT is 168 hours (7 days)
221 | # MAGIC 
222 | # MAGIC -- UNCOMMENT BELOW COMMAND AND RUN
223 | # MAGIC --VACUUM  DB_DEMO.game_stats retain 0 hours
224 | 
225 | # COMMAND ----------
226 | 
227 | # MAGIC %sql
228 | # MAGIC --# If you want to use 0 hourse as retention period, in that case you have to change spark configuration which checks retention duration.
229 | # MAGIC 
230 | # MAGIC SET spark.databricks.delta.retentionDurationCheck.enabled=false;
231 | # MAGIC SET spark.databricks.delta.vacuum.logging.enabled=true;
232 | 
233 | # COMMAND ----------
234 | 
235 | # MAGIC %md
236 | # MAGIC You can use DRY RUN version of VACUUM to print out all the records to be deleted. it will not delete the files.
237 | 
238 | # COMMAND ----------
239 | 
240 | # MAGIC %sql
241 | # MAGIC VACUUM  DB_DEMO.game_stats retain 1 HOURS DRY RUN
242 | # MAGIC -- below versions will be deleted if you run this query.
243 | 
244 | # COMMAND ----------
245 | 
246 | # Before deleting the data files let;s check the folder
247 | display(dbutils.fs.ls('dbfs:/user/hive/warehouse/db_demo.db/game_stats/'))
248 | 
249 | # COMMAND ----------
250 | 
251 | # MAGIC %sql
252 | # MAGIC VACUUM  DB_DEMO.game_stats retain 1 HOURS 
253 | 
254 | # COMMAND ----------
255 | 
256 | # check the folder after purging. It deleted the versions older than 1 hour
257 | display(dbutils.fs.ls('dbfs:/user/hive/warehouse/db_demo.db/game_stats/'))
258 | 
259 | # COMMAND ----------
260 | 
261 | # MAGIC %md
262 | # MAGIC __Sometimes you can still query the older versions because of caching, so it is always better to restart your cluster.__
263 | 
264 | # COMMAND ----------
265 | 
266 | # MAGIC %run ../SETUP/_clean_up
267 | 
268 | # COMMAND ----------
269 | 
270 | # MAGIC %run ../SETUP/_pyspark_clean_up
271 | 
272 | # COMMAND ----------
273 | 
274 | 
275 | 


--------------------------------------------------------------------------------