├── PYSPARK_TESTING └── PYT01 - Unit Testing.py ├── PySpark_ETL ├── PS00-Introduction.py ├── PS01-Read Files.py ├── PS02-Schema Handling.py ├── PS03-Creating Dataframes.py ├── PS04-Basic Transformation.py ├── PS05-Handling JSON.py ├── PS06-JOINS.py ├── PS07-Grouping & Aggregation.py ├── PS08-Ordering Data.py ├── PS09-String Functions.py ├── PS10-Date & Time Functions.py ├── PS11-Partitioning & Repartitioning.py ├── PS12-Missing Value Handling.py ├── PS13-Deduplication.py ├── PS14-Data Profiling using PySpark.py ├── PS15-Data Caching.py ├── PS16-User Defined Functions.py ├── PS17-Write Data.py └── Z01- Case Study Sales Order Analysis.py ├── README.md ├── SETUP ├── _clean_up.py ├── _initial_setup.py ├── _pyspark_clean_up.py ├── _pyspark_init_setup.py ├── _pyspark_setup_files.py ├── _setup_database.py └── _setup_demo_table.py └── SQL Refresher ├── PS000-INTRODUCTION.py ├── SR000-Introduction.py ├── SR001-Basic CRUD.py ├── SR002-Select & Filtering.py ├── SR003-JOINS.py ├── SR004-Order & Grouping.py ├── SR005-Sub Queries.py ├── SR006-Views & Temp Views.py ├── SR007-Common Table Expressions.py ├── SR008 - EXCEPT, UNION, UNION ALL, INTERSECTION.py ├── SR009-External Tables.py ├── SR010-Drop database & tables.py ├── SR011-Check Table & Database Details.py └── SR012-Versioning, Time Travel & Optimization.py /PYSPARK_TESTING/PYT01 - Unit Testing.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC 4 | # MAGIC ### What is unit testing? 5 | # MAGIC UNIT TESTING is a type of software testing where individual units or components of a software are tested. The purpose is to validate that each unit of the software code performs as expected. Unit Testing is done during the development (coding phase) of an application by the developers. 6 | # MAGIC 7 | # MAGIC unit testing 8 | # MAGIC By 9 | # MAGIC TechTarget Contributor 10 | # MAGIC Unit testing is a software development process in which the smallest testable parts of an application, called units, are individually and independently scrutinized for proper operation. This testing methodology is done during the development process by the software developers and sometimes QA staff. The main objective of unit testing is to isolate written code to test and determine if it works as intended. 11 | # MAGIC 12 | # MAGIC Unit testing is an important step in the development process, because if done correctly, it can help detect early flaws in code which may be more difficult to find in later testing stages. 13 | # MAGIC 14 | # MAGIC Unit testing is a component of test-driven development (TDD), a pragmatic methodology that takes a meticulous approach to building a product by means of continual testing and revision. This testing method is also the first level of software testing, which is performed before other testing methods such as integration testing. Unit tests are typically isolated to ensure a unit does not rely on any external code or functions. Testing can be done manually but is often automated. 15 | # MAGIC 16 | # MAGIC How unit tests work 17 | # MAGIC A unit test typically comprises of three stages: plan, cases and scripting and the unit test itself. In the first step, the unit test is prepared and reviewed. The next step is for the test cases and scripts to be made, then the code is tested. 18 | # MAGIC 19 | # MAGIC Test-driven development requires that developers first write failing unit tests. Then they write code and refactor the application until the test passes. TDD typically results in an explicit and predictable code base. 20 | # MAGIC 21 | # MAGIC 22 | # MAGIC Each test case is tested independently in an isolated environment, as to ensure a lack of dependencies in the code. The software developer should code criteria to verify each test case, and a testing framework can be used to report any failed tests. Developers should not make a test for every line of code, as this may take up too much time. Developers should then create tests focusing on code which could affect the behavior of the software being developed. 23 | # MAGIC 24 | # MAGIC Unit testing involves only those characteristics that are vital to the performance of the unit under test. This encourages developers to modify the source code without immediate concerns about how such changes might affect the functioning of other units or the program as a whole. Once all of the units in a program have been found to be working in the most efficient and error-free manner possible, larger components of the program can be evaluated by means of integration testing. Unit tests should be performed frequently, and can be done manually or can be automated. 25 | # MAGIC 26 | # MAGIC ### Types of unit testing 27 | # MAGIC * Unit tests can be performed manually or automated. Those employing a manual method may have an instinctual document made detailing each step in the process; however, automated testing is the more common method to unit tests. Automated approaches commonly use a testing framework to develop test cases. These frameworks are also set to flag and report any failed test cases while also providing a summary of test cases. 28 | # MAGIC 29 | # MAGIC ### Advantages and disadvantages of unit testing 30 | # MAGIC Advantages to unit testing include: 31 | # MAGIC * The earlier a problem is identified, the fewer compound errors occur. 32 | # MAGIC * Costs of fixing a problem early can quickly outweigh the cost of fixing it later. 33 | # MAGIC * Debugging processes are made easier. 34 | # MAGIC * Developers can quickly make changes to the code base. 35 | # MAGIC * Developers can also re-use code, migrating it to new projects. 36 | # MAGIC 37 | # MAGIC ### Concepts in an object-oriented way for Python Unittest 38 | # MAGIC * Test fixture- the preparation necessary to carry out test(s) and related cleanup actions. 39 | # MAGIC * Test case- the individual unit of testing. 40 | # MAGIC * A Test suite- collection of test cases, test suites, or both. 41 | # MAGIC * Test runner- component for organizing the execution of tests and for delivering the outcome to the user. 42 | # MAGIC 43 | # MAGIC __*always start unittest function name with "test_".*__ 44 | 45 | # COMMAND ---------- 46 | 47 | #Creating Add function for addition 48 | def add(a, b): 49 | return a + b 50 | #Creating multi function for multiplication 51 | def multi(a,b): 52 | return a*b 53 | 54 | def subt(a,b): 55 | return a-b 56 | 57 | # COMMAND ---------- 58 | 59 | import unittest 60 | 61 | #Creating Class for Unit Testing 62 | class test_class(unittest.TestCase): 63 | def test_add(self): 64 | self.assertEqual(10, add(7, 3)) 65 | 66 | def test_multi(self): 67 | self.assertEqual(25,multi(5,5)) 68 | 69 | @unittest.skip("OBSELETE METHOD") 70 | def test_multi(self): 71 | self.assertEqual(25,multi(5,5)) 72 | 73 | # create a test suite using loadTestsFromTestCase() 74 | suite = unittest.TestLoader().loadTestsFromTestCase(test_class) 75 | #Running test cases using Test Cases Suit. 76 | p = (unittest.TextTestRunner(verbosity=2).run(suite)) 77 | 78 | # COMMAND ---------- 79 | 80 | # MAGIC %md 81 | # MAGIC ### Demo 82 | # MAGIC Let's perform unit testing with real dataset. 83 | 84 | # COMMAND ---------- 85 | 86 | # MAGIC %md 87 | # MAGIC ### Read Data 88 | 89 | # COMMAND ---------- 90 | 91 | # MAGIC %run ../SETUP/_pyspark_init_setup 92 | 93 | # COMMAND ---------- 94 | 95 | df_ol = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderlist.csv") 96 | df_od = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderdetails.csv") 97 | df_st = spark.read.option("header", "true").csv("/FileStore/datasets/sales/salestarget.csv") 98 | 99 | # COMMAND ---------- 100 | 101 | from pyspark.sql.functions import col, round 102 | 103 | # COMMAND ---------- 104 | 105 | # MAGIC %md 106 | # MAGIC ### joining order list and order details. 107 | 108 | # COMMAND ---------- 109 | 110 | df_join = df_ol\ 111 | .join(df_od, df_ol["Order ID"]==df_od["Order ID"], "inner")\ 112 | .withColumn("Profit", col("Profit").cast("decimal(10,2)"))\ 113 | .withColumn("Amount", col("Amount").cast("decimal(10,2)"))\ 114 | .select(df_ol["Order ID"], "Amount", "Profit", "Quantity", "Category", "State", "City")\ 115 | .limit(50) 116 | display(df_join) 117 | 118 | # COMMAND ---------- 119 | 120 | # MAGIC %md 121 | # MAGIC ### Preparing smaller dataset for unit testing 122 | 123 | # COMMAND ---------- 124 | 125 | df_ut = df_join.limit(50) # defining dataframe to perform our unit test. We will tes our functions for a smaller dataset that is why we are only taking 50 records. 126 | 127 | display(df_ut) 128 | 129 | # COMMAND ---------- 130 | 131 | # MAGIC %md 132 | # MAGIC ### Functions to calculate total & average sales 133 | 134 | # COMMAND ---------- 135 | 136 | # function calculates average profit for each state, round upto 2 decimal digits 137 | def state_avg_profit(df): 138 | return df.groupBy("State").mean("Profit").withColumn("avg(Profit)", round(col("avg(Profit)"), 2)).orderBy("State") 139 | 140 | # calculates total sales for each state 141 | def state_total_sales(df): 142 | return df.groupBy("State").sum("Amount").withColumn("sum(Amount)", round(col("sum(Amount)"), 2)).orderBy("State") 143 | 144 | # COMMAND ---------- 145 | 146 | # df_avg_profit consists average sale for each state. Calculating average & total sales for our testing dataframe 147 | df_avg_profit = state_avg_profit(df_ut) 148 | display(df_avg_profit) 149 | 150 | df_total_sales = state_total_sales(df_ut) 151 | display(df_total_sales) 152 | 153 | # COMMAND ---------- 154 | 155 | # MAGIC %md 156 | # MAGIC ### Now our aim is to test our state_avg_profit() function. 157 | # MAGIC We will compare Gujarat's average sale & total sale. for testing purpose I have calculated Gujarat's average profit in an excel sheet(for out unit testing dataframe (50 records) ). It should be -225.6. 158 | 159 | # COMMAND ---------- 160 | 161 | # Defining variable which include average & total sales of gujarat. 162 | 163 | # COMMAND ---------- 164 | 165 | avg_sale_guj = float(df_avg_profit.where(col("State")=="Gujarat" ).collect()[0]["avg(Profit)"]) 166 | total_sale_guj = float(df_total_sales.where(col("State")=="Gujarat" ).collect()[0]["sum(Amount)"]) 167 | 168 | # COMMAND ---------- 169 | 170 | # MAGIC %md 171 | # MAGIC ### Defining unit test & performing 172 | 173 | # COMMAND ---------- 174 | 175 | import unittest 176 | 177 | #Creating Class for Unit Testing 178 | class sales_unit_class(unittest.TestCase): 179 | 180 | def test_avg_profit(self): 181 | expected_avg_sales = -225.60 # This is the value which is expected we calculated using excel 182 | self.assertEqual(expected_avg_sales, avg_sale_guj) 183 | 184 | def test_record_count(self): 185 | # this is the expected sale, I have put wrong value to fail this test. Correct value should be 1782.0 186 | #If you will change it to correct value, it will pass the unit test. 187 | expected_total_sales = 1789.0 188 | self.assertEqual(expected_total_sales, total_sale_guj) 189 | 190 | # create a test suite for test_class using loadTestsFromTestCase() 191 | suite = unittest.TestLoader().loadTestsFromTestCase(sales_unit_class) 192 | #Running test cases using Test Cases Suit.. 193 | unittest.TextTestRunner(verbosity=2).run(suite) 194 | 195 | # COMMAND ---------- 196 | 197 | # MAGIC %md 198 | # MAGIC ### Attention: 199 | # MAGIC __*This is for demo purpose that is why we are writing unit testing in the same notebook but it is always prefered to write unit testin in a separate notebook.*__ 200 | 201 | # COMMAND ---------- 202 | 203 | # MAGIC %run ../SETUP/_pyspark_clean_up 204 | 205 | # COMMAND ---------- 206 | 207 | 208 | -------------------------------------------------------------------------------- /PySpark_ETL/PS00-Introduction.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### PySpark ETL 4 | # MAGIC In this tutorial we will learn basic ETL pipelines using pyspark. ETL stands for Extract, Transform & Load. 5 | # MAGIC 6 | # MAGIC * Extract - Extracting data from a source to your storage. e.g csv to dataframe 7 | # MAGIC * Tranform - applying transformation e.g. adding new column, rename column, change type, drop columns, derived columns, filter operations, joins etc. 8 | # MAGIC * Load - Load back processed or transformed data to the sink/destination storage. 9 | # MAGIC 10 | # MAGIC All the code is available in github, feel free to pull or fork the repository: 11 | # MAGIC 12 | # MAGIC __https://github.com/martandsingh/ApacheSpark__ 13 | # MAGIC 14 | # MAGIC follow me on linkedin for more updates: 15 | # MAGIC 16 | # MAGIC __https://www.linkedin.com/in/martandsays/__ 17 | # MAGIC 18 | # MAGIC ![PYSPARK_ETL](https://raw.githubusercontent.com/martandsingh/images/master/etl_banner.gif) 19 | 20 | # COMMAND ---------- 21 | 22 | 23 | -------------------------------------------------------------------------------- /PySpark_ETL/PS01-Read Files.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### Read CSV file using PySpark 4 | # MAGIC 5 | # MAGIC ### What is SparkContext? 6 | # MAGIC A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. It is Main entry point for Spark functionality. 7 | # MAGIC 8 | # MAGIC *Note: Only one SparkContext should be active per JVM. You must stop() the active SparkContext before creating a new one.* 9 | # MAGIC 10 | # MAGIC ### What is SparkSession? 11 | # MAGIC SparkSession is the entry point to Spark SQL. It is one of the very first objects you create while developing a Spark SQL application. 12 | # MAGIC 13 | # MAGIC As a Spark developer, you create a SparkSession using the SparkSession.builder method (that gives you access to Builder API that you use to configure the session). 14 | # MAGIC 15 | # MAGIC 16 | # MAGIC ### Spark Context Vs Spark Session 17 | # MAGIC SparkSession vs SparkContext – Since earlier versions of Spark or Pyspark, SparkContext (JavaSparkContext for Java) is an entry point to Spark programming with RDD and to connect to Spark Cluster, Since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset. 18 | # MAGIC 19 | # MAGIC 20 | # MAGIC By default, Databricks notebook provides a spark context object named "spark". This is the prebuild context object, we can use it directly. 21 | # MAGIC ![PYSPARK_CSV](https://raw.githubusercontent.com/martandsingh/images/master/pyspark-read-csv.png) 22 | 23 | # COMMAND ---------- 24 | 25 | spark 26 | 27 | # COMMAND ---------- 28 | 29 | # MAGIC %run ../SETUP/_pyspark_init_setup 30 | 31 | # COMMAND ---------- 32 | 33 | # MAGIC %md 34 | # MAGIC ### Read CSV file 35 | # MAGIC We will read a CSV file from our __DBFS (Databricks File Storage)__. I have upload my CSV file to FileStore/tables/cancer.csv 36 | # MAGIC You can find all the dataset used in these tutorials at https://github.com/martandsingh/datasets. 37 | 38 | # COMMAND ---------- 39 | 40 | # Import CSV file. For csv file, by default delimiter is comma so no need to mention in case of comma separated values. But for other delmiter you have to mention it. 41 | df_csv = spark \ 42 | .read \ 43 | .option("encoding", "UTF-8") \ 44 | .option("delmiter", ",") \ 45 | .option("header", "True") \ 46 | .csv("/FileStore/datasets/cancer.csv") 47 | 48 | # COMMAND ---------- 49 | 50 | # display function will visualize your dataset in a beautiful way. This is databricks notebook function which will not work in your spark scripts. 51 | display(df_csv) 52 | 53 | # COMMAND ---------- 54 | 55 | df_csv.show() # prints top 20 records. It does not return anyting. 56 | 57 | # COMMAND ---------- 58 | 59 | df_csv.show(2) # showing only top 2 rows with header. 60 | 61 | # COMMAND ---------- 62 | 63 | li = df_csv.take(2) # this will return a list containing 2 rows 64 | print(li) 65 | 66 | # COMMAND ---------- 67 | 68 | # you can access the list 69 | print(len(li)) 70 | print(li[0]["State"]) 71 | 72 | # COMMAND ---------- 73 | 74 | all_record = df_csv.collect() # return whole dataset. Do not use it until you need it. If you have very big dataset, this will take a huge time & computation to finish. 75 | #print(all_record) # I am commenting the command, you may uncomment and run, but make sure you do not have a very big dataset 76 | 77 | # COMMAND ---------- 78 | 79 | # MAGIC %md 80 | # MAGIC ### READ JSON 81 | 82 | # COMMAND ---------- 83 | 84 | df_json_sales = spark.read.option("multiline", "true").json("/FileStore/datasets/unece.json") 85 | 86 | # COMMAND ---------- 87 | 88 | display(df_json_sales) 89 | 90 | # COMMAND ---------- 91 | 92 | # MAGIC %md 93 | # MAGIC Now we just loaded a JSON file to our spark dataframe. Let's try one more... 94 | 95 | # COMMAND ---------- 96 | 97 | df_json = spark.read.option("multiline", "true").json("/FileStore/datasets/used_cars_nested.json") 98 | 99 | # COMMAND ---------- 100 | 101 | display(df_json) 102 | 103 | # COMMAND ---------- 104 | 105 | # MAGIC %md 106 | # MAGIC ### Oopss.... 107 | # MAGIC What kind of json is this? 108 | # MAGIC 109 | # MAGIC You must be thinking the same thing. But actually, this is the correct behaviour. If you will compare the structure of our Sales.json & used_cars_nested.json, you will see that later one is nested JSON with complicate structure & complicate means you have to do some more work to clean this json to convert it into a tabular form. In real life, you will not get a simple json like Sales.json, most of the time real life datasets are much complicate. 110 | # MAGIC 111 | # MAGIC *__So in our next notebook we will see how to deal with nested or complex json files.__* 112 | 113 | # COMMAND ---------- 114 | 115 | # MAGIC %md 116 | # MAGIC ### Read Parquet File 117 | # MAGIC ### What is Parquet? 118 | # MAGIC Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Apache Parquet is designed to be a common interchange format for both batch and interactive workloads. It is similar to other columnar-storage file formats available in Hadoop, namely RCFile and ORC. 119 | # MAGIC #### Characteristics of Parquet 120 | # MAGIC 1. Free and open source file format. 121 | # MAGIC 1. Language agnostic. 122 | # MAGIC 1. Column-based format - files are organized by column, rather than by row, which saves storage space and speeds up analytics queries. 123 | # MAGIC 1. Used for analytics (OLAP) use cases, typically in conjunction with traditional OLTP databases. 124 | # MAGIC 1. Highly efficient data compression and decompression. 125 | # MAGIC 1. Supports complex data types and advanced nested data structures. 126 | # MAGIC 127 | # MAGIC #### Benefits of Parquet 128 | # MAGIC 1. Good for storing big data of any kind (structured data tables, images, videos, documents). 129 | # MAGIC 1. Saves on cloud storage space by using highly efficient column-wise compression, and flexible encoding schemes for columns with different data types. 130 | # MAGIC 1. Increased data throughput and performance using techniques like data skipping, whereby queries that fetch specific column values need not read the entire row of data. 131 | 132 | # COMMAND ---------- 133 | 134 | df_par = spark.read.parquet("/FileStore/datasets/USED_CAR_PARQUET/") 135 | display(df_par) 136 | 137 | # COMMAND ---------- 138 | 139 | # MAGIC %run ../SETUP/_pyspark_clean_up 140 | 141 | # COMMAND ---------- 142 | 143 | # MAGIC %md 144 | # MAGIC In this notebook, we simply learnt how to read different format files. This was very basic notebook, so you may wondering is it worth creating a separate notebook for reading files? I would say "yes!!", as databricks is a vast technology growing frequently. So in future there me othere use cases which we can add in our "Read Files" notebook. So it is worth creating a notebook for it. 145 | # MAGIC 146 | # MAGIC ### Assignment: 147 | # MAGIC 1. Try loading JSON (simple & nested). We will learn more about nested json in our next notebook. 148 | # MAGIC 2. Try loading TSV(tab separated file) or any other flat file has a delimiter other than comma(',') 149 | 150 | # COMMAND ---------- 151 | 152 | 153 | -------------------------------------------------------------------------------- /PySpark_ETL/PS02-Schema Handling.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### Schema Handling 4 | # MAGIC In this notebook, we will learn: 5 | # MAGIC 1. What is inferSchema? 6 | # MAGIC 1. How to check dataframe schema? 7 | # MAGIC 1. How to define custom schema? 8 | # MAGIC 9 | # MAGIC ### Whats is inferSchema? 10 | # MAGIC Infer schema will automatically guess the data types for each field. If we set this option to TRUE, the API will read some sample records from the file to infer the schema. 11 | # MAGIC 12 | # MAGIC InferSchema option is __false by default__ that is why all the columns are string by default. By setting __inferSchema=true__, Spark will automatically go through the csv file and infer the schema of each column. This requires an extra pass over the file which will result in reading a file with __inferSchema set to true being slower__. But in return the dataframe will most likely have a correct schema given its input. 13 | # MAGIC 14 | # MAGIC ### How to check dataframe schema? 15 | # MAGIC To check dataframe schema you can simply use printSchema(). 16 | # MAGIC 17 | # MAGIC example: df.printSchema() 18 | 19 | # COMMAND ---------- 20 | 21 | # MAGIC %run ../SETUP/_pyspark_init_setup 22 | 23 | # COMMAND ---------- 24 | 25 | from pyspark.sql.types import StringType, IntegerType, StructField, StructType 26 | 27 | # COMMAND ---------- 28 | 29 | # MAGIC %md 30 | # MAGIC ### Method 1: Inferschema 31 | 32 | # COMMAND ---------- 33 | 34 | df_infer = spark \ 35 | .read \ 36 | .option("header", "true") \ 37 | .csv("/FileStore/datasets/cancer.csv") 38 | 39 | df_infer.printSchema() 40 | 41 | # You can see all the columns has string type as inferSchema is set false by default. 42 | 43 | # COMMAND ---------- 44 | 45 | df_infer = spark \ 46 | .read \ 47 | .option("header", "true") \ 48 | .option("inferSchema", "true")\ 49 | .csv("/FileStore/datasets/cancer.csv") 50 | 51 | df_infer.printSchema() 52 | # So now you will see datatypes are different, schema will check all the data and will choose corrct type for each columns. Keep in mind this option is slower as spark has to traverse whole data to identity data type. So avoid this for big data sets 53 | 54 | # COMMAND ---------- 55 | 56 | # MAGIC %md 57 | # MAGIC ### Method 2: Custom Schema 58 | 59 | # COMMAND ---------- 60 | 61 | # Let's create a test dataframe & a custom schema. We can you the csv file to create dataframe also but just for the demo I want to use smaller number of columns which will help you to understand better & save time also. 62 | 63 | # Create python list of data 64 | data = [("James","","Smith","36636","M",50,3000), 65 | ("Michael","Rose","","40288","M",43,4000), 66 | ("Robert","","Williams","42114","M",23,4000), 67 | ("Maria","Anne","Jones","39192","F",56,4000), 68 | ("Jen","Mary","Brown","33341","F",34,1500) 69 | ] 70 | 71 | # defining custom schema. Your schema will be StructType of StructField array. then you can define any custom name for your column & type. 72 | # StructField("middlename",StringType(),True) - this means our column name is middlename which is String type. The third True parameter determines whether the column can contain null values or not? True mean column allows null values. 73 | 74 | schema = StructType([ 75 | StructField("firstname",StringType(),True), 76 | StructField("middlename",StringType(),True), 77 | StructField("lastname",StringType(),True), 78 | StructField("id", StringType(), True), 79 | StructField("gender", StringType(), True), 80 | StructField("age", IntegerType(), True), 81 | StructField("salary", IntegerType(), True) 82 | ]) 83 | 84 | 85 | 86 | # COMMAND ---------- 87 | 88 | df = spark.createDataFrame(data=data,schema=schema) 89 | df.printSchema() 90 | # so you can see our age & salary columns are integer type while all other have string type. This way of assigning schema is faster than inferSchema. 91 | 92 | # COMMAND ---------- 93 | 94 | display(df) 95 | 96 | # COMMAND ---------- 97 | 98 | # MAGIC %md 99 | # MAGIC Let's read one CSV with custom schema & header. 100 | 101 | # COMMAND ---------- 102 | 103 | # Lets define a custom schema first 104 | custom_schema = StructType( 105 | [StructField("customer_id", IntegerType(), False), 106 | StructField("gender", StringType(), True), 107 | StructField("age", IntegerType(), True), 108 | StructField("annual_income_k_usd", IntegerType(), True), 109 | StructField("score", IntegerType(), True), 110 | ] 111 | ) 112 | 113 | # COMMAND ---------- 114 | 115 | # assigning custome schema 116 | df_mall = spark \ 117 | .read \ 118 | .schema(custom_schema) \ 119 | .option("header", "true") \ 120 | .csv("/FileStore/datasets/Mall_Customers.csv") 121 | display(df_mall) 122 | 123 | # COMMAND ---------- 124 | 125 | # we can see the column names & types are correct. It is always recommended to check dataframe schema before performing other operations. It will be helpful to decide whether your dataframe needs any type casting. We will study about type casting in further notebooks. So do not worry if you do not understand any piece of code. This is just introductory notebook. There is a separate notebook to explain all the basic transformation using pyspark. 126 | 127 | df_mall.printSchema() 128 | 129 | # COMMAND ---------- 130 | 131 | # MAGIC %run ../SETUP/_pyspark_clean_up 132 | 133 | # COMMAND ---------- 134 | 135 | 136 | -------------------------------------------------------------------------------- /PySpark_ETL/PS03-Creating Dataframes.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### Dataframe 4 | # MAGIC 5 | # MAGIC A distributed collection of data grouped into named columns. 6 | # MAGIC A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession. 7 | # MAGIC 8 | # MAGIC __createDataFrame()__ and __toDF()__ methods are two different way’s to create DataFrame in spark. By using toDF() method, we don’t have the control over schema customization whereas in createDataFrame() method we have complete control over the schema customization. Use toDF() method only for local testing. But we can use createDataFrame() method for both local testings as well as for running the code in production. 9 | # MAGIC 10 | # MAGIC __toDF()__ is used to convert rdd to dataframe. 11 | # MAGIC 12 | # MAGIC 13 | # MAGIC ![DATAFRAME](https://raw.githubusercontent.com/martandsingh/images/master/dataframe.jpeg) 14 | 15 | # COMMAND ---------- 16 | 17 | # MAGIC %run ../SETUP/_pyspark_init_setup 18 | 19 | # COMMAND ---------- 20 | 21 | # MAGIC %md 22 | # MAGIC ### createDataFrame() 23 | 24 | # COMMAND ---------- 25 | 26 | from pyspark.sql.types import StructType, StructField, StringType, IntegerType 27 | 28 | 29 | # COMMAND ---------- 30 | 31 | # Dataframe using list of tuples. Here we can apply custom header 32 | schema = StructType( [\ 33 | StructField("lang", StringType(),True),\ 34 | StructField("user", IntegerType(),True)\ 35 | ]) 36 | data = [("English", 12413), ("Hindi", 455543)] 37 | df = spark.createDataFrame(data, schema) 38 | display(df) 39 | 40 | # COMMAND ---------- 41 | 42 | # CreateDataFrame using dictionary list 43 | 44 | data_tuple = [{"lang":"English", "user": 10000}, {"lang":"Spanish", "user": 12452}] 45 | df = spark.createDataFrame(data_tuple) 46 | display(df) 47 | 48 | # COMMAND ---------- 49 | 50 | # MAGIC %md 51 | # MAGIC ### toDF() 52 | # MAGIC This is used to convert rdd to dataframes. 53 | 54 | # COMMAND ---------- 55 | 56 | #Lets create an rdd with a list 57 | data = [("Football", 34566), ("Cricket", 2536), ("Baseball", 1234)] 58 | rdd = sc.parallelize(data) 59 | df_rdd =rdd.toDF() 60 | display(df_rdd) 61 | 62 | # COMMAND ---------- 63 | 64 | # MAGIC %md 65 | # MAGIC ### using files 66 | # MAGIC This type of dataframes we are going to see in our whole course. We will create dataframe using CSv, JSON & parquet files. but for this particular notebook demo I will use CSV source. 67 | 68 | # COMMAND ---------- 69 | 70 | df_csv = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderlist.csv") 71 | display(df_csv) 72 | 73 | # COMMAND ---------- 74 | 75 | # MAGIC %run ../SETUP/_pyspark_clean_up 76 | -------------------------------------------------------------------------------- /PySpark_ETL/PS04-Basic Transformation.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC 4 | # MAGIC ### Basic Data Transformation 5 | # MAGIC In this notebook we will learn: 6 | # MAGIC 1. Read parquet file 7 | # MAGIC 1. Check row & columns 8 | # MAGIC 1. Describe function 9 | # MAGIC 1. Select columns 10 | # MAGIC 1. Filter columns 11 | # MAGIC 1. Case statement 12 | # MAGIC 1. Add new column 13 | # MAGIC 1. Rename column 14 | # MAGIC 1. Drop column 15 | # MAGIC 1. String comparison 16 | # MAGIC 1. AND & OR condition 17 | # MAGIC 1. EXPR 18 | # MAGIC 19 | # MAGIC ![DATA_TRANSFORMATION](https://raw.githubusercontent.com/martandsingh/images/master/transformation.png) 20 | 21 | # COMMAND ---------- 22 | 23 | # MAGIC %run ../SETUP/_pyspark_init_setup 24 | 25 | # COMMAND ---------- 26 | 27 | # MAGIC %md 28 | # MAGIC ### Read Parquet File 29 | # MAGIC pyspark_init_setup will setup data files in your dbfs location including parquet file. We will use it to perform basic operations. 30 | 31 | # COMMAND ---------- 32 | 33 | df_raw = spark.read.parquet("/FileStore/datasets/USED_CAR_PARQUET/") 34 | display(df_raw) 35 | 36 | # COMMAND ---------- 37 | 38 | # MAGIC %md 39 | # MAGIC ### Print Schema 40 | 41 | # COMMAND ---------- 42 | 43 | # Check dataframe schema 44 | df_raw.printSchema() 45 | 46 | # COMMAND ---------- 47 | 48 | # MAGIC %md 49 | # MAGIC ### Describe function 50 | # MAGIC As name suggests it will give you a quick summary of your dataframe e.g. count, mean, median and other information. 51 | 52 | # COMMAND ---------- 53 | 54 | df_describe = df_raw.describe() # returns a dataframe 55 | display(df_describe) 56 | 57 | # COMMAND ---------- 58 | 59 | # MAGIC %md 60 | # MAGIC ### Check Rows & Columns 61 | 62 | # COMMAND ---------- 63 | 64 | # df.columns returns python list including all the columns 65 | all_columns = df_raw.columns 66 | print(all_columns) 67 | 68 | # COMMAND ---------- 69 | 70 | total_rows = df_raw.count() 71 | print(total_rows) 72 | 73 | # COMMAND ---------- 74 | 75 | # MAGIC %md 76 | # MAGIC ### Select Columns 77 | 78 | # COMMAND ---------- 79 | 80 | # Choose only required columns 81 | display(df_raw.select("vehicle_type", "brand_name", "model", "price")) 82 | 83 | # COMMAND ---------- 84 | 85 | # MAGIC %md 86 | # MAGIC ### col function 87 | # MAGIC Returns a Column based on the given column name. We can use col("columnname") to select column 88 | 89 | # COMMAND ---------- 90 | 91 | # include col function 92 | from pyspark.sql.functions import col 93 | 94 | # COMMAND ---------- 95 | 96 | # below query will return sae resultset as above 97 | display(df_raw.select( col("vehicle_type"), col("brand_name"), col("model"), col("price") )) 98 | 99 | # COMMAND ---------- 100 | 101 | # MAGIC %md 102 | # MAGIC ### LIMIT 103 | # MAGIC limit function is used to limit number of rows. If you want to select top n records, you can use limit functions. 104 | 105 | # COMMAND ---------- 106 | 107 | # select top 5 records 108 | df_limit = df_raw.select( col("vehicle_type"), col("brand_name"), col("model"), col("price") ).limit(5) 109 | display(df_limit) 110 | 111 | # COMMAND ---------- 112 | 113 | # MAGIC %md 114 | # MAGIC ### Add New Column 115 | # MAGIC We can add new column using withColumn("{column-name}", {value}) 116 | 117 | # COMMAND ---------- 118 | 119 | # Lets create a new column full_name with brand_name + model name in capital letters & select only top 5 full_name & price column. 120 | # withColumn is used to add new column. 121 | # import concat_ws functions. this function concat two string with the provided separator. 122 | from pyspark.sql.functions import concat_ws, upper 123 | 124 | df_processed = df_raw \ 125 | .withColumn("full_name", concat_ws(' ', col("brand_name"), col("model")) ) 126 | 127 | 128 | display(df_processed) 129 | # You can see the full_name column added to the dataframe (scroll right). 130 | 131 | # COMMAND ---------- 132 | 133 | # MAGIC %md 134 | # MAGIC ### Rename Column 135 | # MAGIC You can rename your column with 136 | 137 | # COMMAND ---------- 138 | 139 | df_processed = df_processed \ 140 | .withColumnRenamed("full_name", "vehicle_name") 141 | 142 | display(df_processed.limit(5)) 143 | # so we can see our output has new column name. 144 | 145 | # COMMAND ---------- 146 | 147 | # MAGIC %md 148 | # MAGIC ### Drop Column(s) 149 | # MAGIC You can drop one or more column using drop() 150 | 151 | # COMMAND ---------- 152 | 153 | display(df_raw.drop("description", "ad_title", "seller_location")) 154 | # so here we used drop to delete few columns. But there is one very interesting thing with spark dataframe. Spark dataframes are immutable. This means whenever you perform any transformation in your dataset, it creates a new dataset. This mean dropping those columns will generate a new dataframe, it will not delete those columns from the original dataframe(df_raw). let try displa(df_raw) in next cell. 155 | 156 | # COMMAND ---------- 157 | 158 | display(df_raw.limit(4)) 159 | # we can see those 4 columns are still there. This is because of immutabile characterstics of spark dataframe. This is one of the major differences between spark & pandas dataframe. 160 | # So everytime you perform any transformation, you have to create a new dataframe to persist those changes. 161 | 162 | # COMMAND ---------- 163 | 164 | # thats why we will create a new dataframe 165 | df_processed = df_raw.drop("description", "ad_title", "seller_location") 166 | df_processed.printSchema() # now this dataframe will not include dropped columns. 167 | 168 | # COMMAND ---------- 169 | 170 | # MAGIC %md 171 | # MAGIC ### Filter Data 172 | # MAGIC You can use filter() to filter you data based on specific condition. It is simillar to SQL WHERE keyword. 173 | 174 | # COMMAND ---------- 175 | 176 | # Lets select all the cars with body_type as SUV. Only select brand_name, body_type, price, displacement 177 | df_SUV = df_processed \ 178 | .filter(col("body_type")== "SUV")\ 179 | .select("brand_name", "body_type", "price", "displacement") 180 | 181 | display(df_SUV) 182 | 183 | # COMMAND ---------- 184 | 185 | # MAGIC %md 186 | # MAGIC ### CASE Statement 187 | # MAGIC The SQL CASE Statement. The CASE statement goes through conditions and returns a value when the first condition is met (like an if-then-else statement). 188 | # MAGIC 189 | # MAGIC Let's categorise our data. Create a new column called "power" based on displacement value. 190 | # MAGIC 191 | # MAGIC - > Displacement < 1500 (less than 1500) -> Low 192 | # MAGIC 193 | # MAGIC - > 1500 >= Displacement < 2500 -> Medium 194 | # MAGIC 195 | # MAGIC - > Displacement >= 2500 -> Strong 196 | 197 | # COMMAND ---------- 198 | 199 | from pyspark.sql.functions import when, regexp_replace 200 | 201 | # we are adding a new column "power" which is being calculated using displacement. displacement contains cc. So first we will use regexp_replace to replace text cc. i.e. 100cc -> 100, 800cc -> 800. Then we will use "when" to perform conditional check. 202 | df_power = df_processed \ 203 | .withColumn("power", 204 | when(regexp_replace(df_raw.displacement, "cc", "") < 1500, "LOW") 205 | .when(( regexp_replace(df_raw.displacement, "cc", "") >= 1500) & (df_raw.displacement < 2500), "MEDIUM") 206 | .otherwise("HIGH") 207 | ).select("vehicle_type", "brand_name", "displacement", "power") 208 | 209 | display(df_power) 210 | 211 | # COMMAND ---------- 212 | 213 | # MAGIC %md 214 | # MAGIC ### AND & OR 215 | # MAGIC These keyword comes handy when we are defining multiple filter condition. 216 | # MAGIC 217 | # MAGIC 1. AND- represented by &. it means all the condition in the expression must be true. 218 | # MAGIC 219 | # MAGIC example: I need a White Suzuki car. It mean my car must be white & brand must be Suzuki. It has to satisfy both the criteria. I will not accept Black Suzuki or White Honda. 220 | # MAGIC 221 | # MAGIC TRUE & TRUE = TRUE ( this is and condition in binary form. So both of the expression must be true) 222 | # MAGIC 223 | # MAGIC 2. OR - represent by |. It means any one of multiple condition is true. 224 | # MAGIC 225 | # MAGIC TRUE + FALSE = TRUE 226 | 227 | # COMMAND ---------- 228 | 229 | # Let's see one more example. But this time we will use 2 filter conditions. Select all white suzuki cars. 230 | df_suz_wh = df_processed.filter( (col("brand_name")=="Suzuki") & (col("color")=="White") ) 231 | display(df_suz_wh) 232 | # we can see our resultset only include White Suzuki. 233 | 234 | # COMMAND ---------- 235 | 236 | # I want to buy a car, but I have one condition. It should be either Honda or Suzuki. let's write the query. Here we will use OR condition as I have two options Honda & Suzuki. Any one of them is acceptable. Keep in mind comparison is always case sensitive. Honda is different then honda. So col("brand_name") == "Honda" & col("brand_name") == "honda" will produce different resultset. Try this in your assignment. 237 | 238 | df_car = df_processed.filter( (col("brand_name") == "Honda") | (col("brand_name") == "Suzuki" ) ) 239 | display(df_car) 240 | 241 | # COMMAND ---------- 242 | 243 | # MAGIC %md 244 | # MAGIC ## String comparison is case sensitive !!! 245 | # MAGIC As I mentioned earlier text based comparison is always case sensitive. Honda is different than honda. So what is the best possible way to compare string? 246 | # MAGIC 247 | # MAGIC There are multiple ways but which I prefer is using __UPPER OR LOWER__ case function while comparing. Let's see the using example. We will slightly change above query. 248 | 249 | # COMMAND ---------- 250 | 251 | from pyspark.sql.functions import upper 252 | 253 | # I added upper function & if you noticed, I changed the case of "Honda" to "honda". It is still giving me same result. The query will convert both side of comparison string to upper case and then compare. In this way you guarantee that both sides will always have same case. you can use lower function also for this. 254 | df_car = df_processed.filter( (upper(col("brand_name")) == "honda".upper() ) | ( upper(col("brand_name")) == "Suzuki".upper() ) ) 255 | display(df_car) 256 | 257 | # COMMAND ---------- 258 | 259 | # MAGIC %md 260 | # MAGIC ### EXPR - expression 261 | # MAGIC Using EXPR we can run SQL statements in pyspark statements. 262 | 263 | # COMMAND ---------- 264 | 265 | # lets categorize our data using SQL expression. 266 | from pyspark.sql.functions import expr 267 | df_categorized_expr = df_processed \ 268 | .withColumn("power", expr(''' 269 | CASE WHEN displacement < 1500 THEN "LOW" 270 | WHEN displacement >= 1500 AND displacement < 2500 THEN "MEDIUM" 271 | ELSE "HIGH" END''' )) \ 272 | .select("vehicle_type", "body_type", "brand_name", "displacement", "power") 273 | 274 | display(df_categorized_expr) 275 | 276 | # COMMAND ---------- 277 | 278 | # MAGIC %md 279 | # MAGIC ### Chaining 280 | # MAGIC In our examples we apply different kind of operations in different cell, but apache spark allows you to chain multiple statements. 281 | # MAGIC e.g. df.operation1.operation2.operation3..... 282 | # MAGIC 283 | # MAGIC __Lets do following operations on our dataset:__ 284 | # MAGIC 1. delete column description, manufacturer & adtitle 285 | # MAGIC 1. update column displacement, remove cc from values. e.g. 800cc -> 800 286 | # MAGIC 1. create a new column price_in_k = price/1000. it will proce price in 1000s. 287 | # MAGIC 1. select only body_type, brand_name, color, displacement, price_in_k 288 | 289 | # COMMAND ---------- 290 | 291 | from pyspark.sql.functions import regexp_replace, col 292 | 293 | df_processed = df_raw\ 294 | .withColumn("displacement", regexp_replace(col("displacement"), "cc", ""))\ 295 | .withColumn("price_in_k", col("price")/1000)\ 296 | .drop("description", "manufacturer", "adtitle")\ 297 | .select("body_type", "brand_name", "color", "displacement", "price_in_k") 298 | 299 | display(df_processed) 300 | 301 | # COMMAND ---------- 302 | 303 | # MAGIC %run ../SETUP/_pyspark_clean_up 304 | 305 | # COMMAND ---------- 306 | 307 | # MAGIC %md 308 | # MAGIC ### Assignment 1 309 | # MAGIC 1. Add a new column price_usd to your dataframe df_raw. This column will save price of car in USD. price_usd = price * (0.0051); 310 | # MAGIC 2. Add one more column brand_name_upper, which will have brand_name in upper cases. 311 | # MAGIC 3. Select only brand_name_upper, color, price_usd columns to your final resultset. 312 | # MAGIC 313 | # MAGIC ### Assignment 2 314 | # MAGIC 1. Select all the Grey Honda Hatchback cars. 315 | 316 | # COMMAND ---------- 317 | 318 | 319 | -------------------------------------------------------------------------------- /PySpark_ETL/PS05-Handling JSON.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### JSON Handling 4 | # MAGIC 1. How to read simple & nested JSON. 5 | # MAGIC 2. How to create new columns using nested json 6 | # MAGIC 7 | # MAGIC ![PYSPARK_JSON](https://raw.githubusercontent.com/martandsingh/images/master/json_pyspark.png) 8 | 9 | # COMMAND ---------- 10 | 11 | # MAGIC %run ../SETUP/_pyspark_init_setup 12 | 13 | # COMMAND ---------- 14 | 15 | df = spark \ 16 | .read \ 17 | .option("multiline", "true")\ 18 | .json("/FileStore/datasets/used_cars_nested.json") 19 | 20 | display(df) 21 | 22 | # COMMAND ---------- 23 | 24 | # Imports 25 | from pyspark.sql.functions import explode, col 26 | 27 | # COMMAND ---------- 28 | 29 | df_exploded = df \ 30 | .withColumn("usedCars", explode(df["usedCars"])) 31 | 32 | display(df_exploded) 33 | 34 | # COMMAND ---------- 35 | 36 | # Now we will read JSON values and add new columns, later we will delete usedCars(Raw json) column as we do not need it. 37 | df_clean = df_exploded \ 38 | .withColumn("vehicle_type", col("usedCars")["@type"])\ 39 | .withColumn("body_type", col("usedCars")["bodyType"])\ 40 | .withColumn("brand_name", col("usedCars")["brand"]["name"])\ 41 | .withColumn("color", col("usedCars")["color"])\ 42 | .withColumn("description", col("usedCars")["description"])\ 43 | .withColumn("model", col("usedCars")["model"])\ 44 | .withColumn("manufacturer", col("usedCars")["manufacturer"])\ 45 | .withColumn("ad_title", col("usedCars")["name"])\ 46 | .withColumn("currency", col("usedCars")["priceCurrency"])\ 47 | .withColumn("seller_location", col("usedCars")["sellerLocation"])\ 48 | .withColumn("displacement", col("usedCars")["vehicleEngine"]["engineDisplacement"])\ 49 | .withColumn("transmission", col("usedCars")["vehicleTransmission"])\ 50 | .withColumn("price", col("usedCars")["price"]) \ 51 | .drop("usedCars") 52 | display(df_clean) 53 | 54 | # COMMAND ---------- 55 | 56 | # MAGIC %md 57 | # MAGIC So now we have our clean dataframe df_clean. So we saw how explode function create a row for each element of an array. In our case, our array had struct items. So each row created had a struct type item (df_exploded), which we later used to create new columns (df_clean). 58 | # MAGIC 59 | # MAGIC ### Explode 60 | # MAGIC The explode() method converts each element of the specified column(s) into a row. 61 | # MAGIC 62 | # MAGIC __Syntax:__ 63 | # MAGIC 64 | # MAGIC dataframe.explode(column, ignore_index) 65 | # MAGIC 66 | # MAGIC But there may some cases where we have a string type column which consist of JSON string. how to deal with it? Obviously, you can convert it to struct type and then follow the same process we did earlier. Apart from this, there is one more cleaner way to achieve this. 67 | # MAGIC 68 | # MAGIC We can use json_tuple, get_json_object functions. 69 | # MAGIC 70 | # MAGIC __get_json_object()__: This is used to query json object inline. 71 | # MAGIC 72 | # MAGIC __json_tuple()__: *We can use this if json has only one level of nesting* 73 | # MAGIC 74 | # MAGIC Confused????? Let's try with an example. 75 | 76 | # COMMAND ---------- 77 | 78 | from pyspark.sql.functions import get_json_object, json_tuple 79 | 80 | # COMMAND ---------- 81 | 82 | # lets create a test dataframe, which contains JSON string. Range function creates a simple dataframe with given number of rows. 83 | df_json_string = spark.range(1)\ 84 | .selectExpr(""" 85 | '{"Car" : {"Model" : ["i10", "i20", "Verna"], "Brand":"Hyundai" }}' as Cars 86 | """) 87 | 88 | display(df_json_string) 89 | 90 | # COMMAND ---------- 91 | 92 | # MAGIC %md 93 | # MAGIC Our task is to create a new dataframe with 2 columns: 94 | # MAGIC * Model - take only first item in array. Just for the sake of tutorial. It will tell you how to pick a specific item using json_tuple 95 | # MAGIC * Brand 96 | 97 | # COMMAND ---------- 98 | 99 | df_cars = df_json_string \ 100 | .withColumn("Brand", json_tuple(col("Cars"), "Car") )\ 101 | .withColumn("Model", get_json_object(col("Cars"), "$.Car.Model[1]")) 102 | 103 | display(df_cars) 104 | 105 | # COMMAND ---------- 106 | 107 | # MAGIC %run ../SETUP/_pyspark_clean_up 108 | 109 | # COMMAND ---------- 110 | 111 | # MAGIC %md 112 | # MAGIC ### Assignment 113 | # MAGIC 1. Download https://github.com/martandsingh/datasets/blob/master/person_details.json & try to read it using spark 114 | # MAGIC 2. Create a dataframe with column: name, age, cars(Array type), city, state, country 115 | 116 | # COMMAND ---------- 117 | 118 | 119 | -------------------------------------------------------------------------------- /PySpark_ETL/PS06-JOINS.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC 4 | # MAGIC ### What are the joins? 5 | # MAGIC PySpark Join is used to combine two DataFrames and by chaining these you can join multiple DataFrames. In real life projects, you will be having more than one dataset or table (normalized data). To calculate or get final data sometime you may need to join different datasets. 6 | # MAGIC 7 | # MAGIC There are many types of joins are available, but we will discuss only most common table join types: 8 | # MAGIC 1. Inner join - returns record which are common in both the tables. 9 | # MAGIC 1. Left outer join - common records + unmatched records from left table (left side of join clause) 10 | # MAGIC 1. Right outer join - common records + unmatched records from right table 11 | # MAGIC 1. Cross join - Cartesian product of two tables. It will create MxN records (M - no of records in left table, N - no of records in right side table). 12 | # MAGIC 1. Full outer join - all the records from both the tables except the common records. 13 | # MAGIC 14 | # MAGIC ![JOINS](https://raw.githubusercontent.com/martandsingh/images/master/joins.jpg) 15 | 16 | # COMMAND ---------- 17 | 18 | # MAGIC %run ../SETUP/_pyspark_init_setup 19 | 20 | # COMMAND ---------- 21 | 22 | # MAGIC %sql 23 | # MAGIC -- Here we are creating SQL tables. Do not worry about the code. We have a separate tutorial for this SQL_Refresher. Check out our githb url: https://github.com/martandsingh/ApacheSpark 24 | # MAGIC 25 | # MAGIC CREATE TABLE IF NOT EXISTS T1 26 | # MAGIC ( 27 | # MAGIC id VARCHAR(10) 28 | # MAGIC ); 29 | # MAGIC CREATE TABLE IF NOT EXISTS T2 30 | # MAGIC ( 31 | # MAGIC id VARCHAR(10) 32 | # MAGIC ); 33 | # MAGIC CREATE TABLE IF NOT EXISTS T3 34 | # MAGIC ( 35 | # MAGIC id VARCHAR(10) 36 | # MAGIC ); 37 | # MAGIC 38 | # MAGIC INSERT INTO T1 VALUES ('1'), ('2'), ('3'), ('4'), (NULL); 39 | # MAGIC INSERT INTO T2 VALUES ('1'), ('2'); 40 | # MAGIC INSERT INTO T3 VALUES ( '3'), ('4'), ('5'), (NULL); 41 | 42 | # COMMAND ---------- 43 | 44 | # convert tables to dataframe 45 | df_1 = spark.sql("SELECT * FROM T1"); 46 | df_2 = spark.sql("SELECT * FROM T2"); 47 | df_3 = spark.sql("SELECT * FROM T3"); 48 | 49 | # COMMAND ---------- 50 | 51 | display(df_1) 52 | display(df_2) 53 | display(df_3) 54 | 55 | # COMMAND ---------- 56 | 57 | # MAGIC %md 58 | # MAGIC ### INNER JOIN 59 | 60 | # COMMAND ---------- 61 | 62 | # INNER JOIN 63 | # This join will give only matching records. It will return only the records which are present on both the table. 64 | df_inner = df_1.join(df_2, df_1["id"]==df_2["id"], "inner") 65 | display(df_inner) 66 | 67 | # COMMAND ---------- 68 | 69 | # MAGIC %md 70 | # MAGIC ### LEFT OUTER JOIN 71 | 72 | # COMMAND ---------- 73 | 74 | # LEFT JOIN 75 | # It will return only the records which are present on both the table and all non-matching records from left table. 76 | df_left = df_1.join(df_2, df_1["id"]==df_2["id"], "left") 77 | display(df_left) 78 | 79 | # COMMAND ---------- 80 | 81 | # MAGIC %md 82 | # MAGIC ### RIGHT OUTER JOIN 83 | 84 | # COMMAND ---------- 85 | 86 | # RIGHT JOIN 87 | # It will return only the records which are present on both the table and all non-matching records from right table. 88 | df_right = df_1.join(df_3, df_1["id"]==df_3["id"], "right") 89 | display(df_right) 90 | 91 | # COMMAND ---------- 92 | 93 | # MAGIC %md 94 | # MAGIC ### FULL OUTER JOIN 95 | 96 | # COMMAND ---------- 97 | 98 | # FULL OUTER JOIN 99 | df_full = df_1.join(df_3, df_1["id"]==df_3["id"], "full") 100 | display(df_full) 101 | 102 | # COMMAND ---------- 103 | 104 | # MAGIC %md 105 | # MAGIC ### CROSS JOIN 106 | 107 | # COMMAND ---------- 108 | 109 | # CROSS JOIN, it will return cartesian product of both the tables 110 | df_cross = df_1.crossJoin(df_3) 111 | display(df_cross) 112 | 113 | # COMMAND ---------- 114 | 115 | # MAGIC %md 116 | # MAGIC ### Example 117 | # MAGIC 118 | # MAGIC Let's see an example using our sales order dataset. 119 | 120 | # COMMAND ---------- 121 | 122 | # We will use sales dataset. We have three tables: order list, order details & sales target. Let's load data first. Order list and details are linked using order id. Order details & Sales target are linked with category. 123 | df_ol = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderlist.csv") 124 | df_od = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderdetails.csv") 125 | df_st = spark.read.option("header", "true").csv("/FileStore/datasets/sales/salestarget.csv") 126 | 127 | # COMMAND ---------- 128 | 129 | display(df_ol.limit(3)) 130 | display(df_od.limit(3)) 131 | display(df_st.limit(3)) 132 | 133 | # COMMAND ---------- 134 | 135 | df_inner = df_ol \ 136 | .join(df_od, df_ol["Order Id"] == df_od["Order Id"], "inner")\ 137 | .select(df_ol["Order Id"], df_ol["Order Date"], df_od["Amount"], df_od["Profit"], df_od["Category"]) 138 | display(df_inner) 139 | 140 | # COMMAND ---------- 141 | 142 | # left_outer, left as same 143 | df_left_outer = df_ol \ 144 | .join(df_od, df_ol["Order Id"] == df_od["Order Id"], "left_outer")\ 145 | .select(df_ol["Order Id"], df_ol["Order Date"], df_od["Amount"], df_od["Profit"], df_od["Category"]) 146 | display(df_left_outer) 147 | 148 | # COMMAND ---------- 149 | 150 | # MAGIC %run ../SETUP/_pyspark_clean_up 151 | 152 | # COMMAND ---------- 153 | 154 | 155 | -------------------------------------------------------------------------------- /PySpark_ETL/PS07-Grouping & Aggregation.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### Group & Aggregation 4 | # MAGIC Similar to SQL GROUP BY clause, PySpark groupBy() function is used to collect the identical data into groups on DataFrame and perform aggregate functions on the grouped data. 5 | # MAGIC 6 | # MAGIC PYSPARK AGG is an aggregate function that is functionality provided in PySpark that is used for operations. The aggregate operation operates on the data frame of a PySpark and generates the result for the same. It operates on a group of rows and the return value is then calculated back for every group. 7 | # MAGIC 8 | # MAGIC ![GROUPING](https://raw.githubusercontent.com/martandsingh/images/master/grouping.png) 9 | 10 | # COMMAND ---------- 11 | 12 | # MAGIC %run ../SETUP/_pyspark_init_setup 13 | 14 | # COMMAND ---------- 15 | 16 | df = spark.read.parquet('/FileStore/datasets/USED_CAR_PARQUET/') 17 | display(df) 18 | 19 | # COMMAND ---------- 20 | 21 | from pyspark.sql.functions import col 22 | 23 | # COMMAND ---------- 24 | 25 | # Group by body type. Query will writted body_type and total count for that particular body type 26 | df_type = df.groupBy("body_type").count().orderBy(col("count").desc()) 27 | display(df_type) 28 | 29 | # COMMAND ---------- 30 | 31 | # Grouping using multiple columns. Below query will group your data with body_type & brand_name. e.g. How many suzuki hatchback are there? 32 | display(df.groupBy("brand_name", "body_type" ).count().orderBy(col("count").desc())) 33 | 34 | # COMMAND ---------- 35 | 36 | # Average price of each brand name. Get the average price for each brand name 37 | df_avgPrice = df.groupBy("brand_name").mean("price").orderBy("avg(price)") 38 | display(df_avgPrice) 39 | 40 | # with this analysis we can see Daewoo, Suzuki is relatively cheaper car & MG, Haval, Kia are expensive cars 41 | 42 | # COMMAND ---------- 43 | 44 | # Now lets check does body type affect the price of car, what kind of body types are cheaper or expensive. 45 | df_body_price = df.groupBy("brand_name", "body_type").mean("price").orderBy(col("avg(price)").desc()) 46 | display(df_body_price) 47 | 48 | # so toyota SUV are generally expensive, daewoo & suzuki sedan are cheaper 49 | 50 | # COMMAND ---------- 51 | 52 | # MAGIC %md 53 | # MAGIC ### Agg function 54 | 55 | # COMMAND ---------- 56 | 57 | df_agg = df.agg({"brand_name":"count", "body_type":"count", "price": "avg"}) 58 | display(df_agg) 59 | 60 | # COMMAND ---------- 61 | 62 | df_agg = df.agg({"brand_name":"count", "body_type":"count", "price": "avg"}) \ 63 | .withColumnRenamed("avg(price)", "avg_price") \ 64 | .withColumnRenamed("count(body_type)", "total_types")\ 65 | .withColumnRenamed("count(brand_name)", "total_brands") 66 | display(df_agg) 67 | 68 | # COMMAND ---------- 69 | 70 | # MAGIC %run ../SETUP/_pyspark_clean_up 71 | -------------------------------------------------------------------------------- /PySpark_ETL/PS08-Ordering Data.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### Ordering Data 4 | # MAGIC The __ORDER BY__ keyword is used to sort the records in ascending order by default. To sort the records in descending order, use the DESC keyword 5 | 6 | # COMMAND ---------- 7 | 8 | # MAGIC %run ../SETUP/_pyspark_init_setup 9 | 10 | # COMMAND ---------- 11 | 12 | df_ol = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderlist.csv") 13 | display(df_ol) 14 | 15 | # COMMAND ---------- 16 | 17 | from pyspark.sql.functions import col 18 | 19 | # COMMAND ---------- 20 | 21 | # Sort in ascending order 22 | df_cust_order = df_ol.filter(col("CustomerName").isNotNull()).orderBy(col("CustomerName")) 23 | display(df_cust_order) 24 | 25 | # COMMAND ---------- 26 | 27 | # Sort in descending order 28 | df_cust_order = df_ol.filter(col("CustomerName").isNotNull()).orderBy(col("CustomerName").desc()) 29 | display(df_cust_order) 30 | 31 | # COMMAND ---------- 32 | 33 | # Sort with more than one column - ascending order 34 | df_sort = df_ol\ 35 | .filter(col("CustomerName").isNotNull())\ 36 | .orderBy([col("CustomerName"), col("State")], ascending=True) 37 | display(df_sort) 38 | 39 | # COMMAND ---------- 40 | 41 | # Sort with more than one column - descending order 42 | df_sort = df_ol\ 43 | .filter(col("CustomerName").isNotNull())\ 44 | .orderBy([col("CustomerName"), col("State")], ascending=False) 45 | display(df_sort) 46 | 47 | # COMMAND ---------- 48 | 49 | 50 | # Sort in ascending order. If column has null value you can control where to show null values. by default null value will always be on top. 51 | df_cust_order = df_ol.orderBy(col("CustomerName")) 52 | display(df_cust_order) 53 | 54 | # COMMAND ---------- 55 | 56 | # You can choose where to place null values using asc_nulls_last(place null at last), asc_nulls_first (place null at first) 57 | df_cust_order = df_ol.orderBy(col("CustomerName").asc_nulls_last()) 58 | display(df_cust_order) 59 | 60 | # COMMAND ---------- 61 | 62 | # MAGIC %run ../SETUP/_pyspark_clean_up 63 | -------------------------------------------------------------------------------- /PySpark_ETL/PS09-String Functions.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### String Functions 4 | # MAGIC 5 | # MAGIC In this demo we will learn basic string functions which we use in our daily life projects. 6 | 7 | # COMMAND ---------- 8 | 9 | # MAGIC %run ../SETUP/_pyspark_init_setup 10 | 11 | # COMMAND ---------- 12 | 13 | df = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderlist.csv") 14 | display(df) 15 | 16 | # COMMAND ---------- 17 | 18 | # MAGIC %md 19 | # MAGIC * Length 20 | # MAGIC * Case change & Initcap 21 | # MAGIC * Replace 22 | # MAGIC * Substring 23 | # MAGIC * Concatenation (concat, concat_ws) 24 | # MAGIC * Right/Left padding 25 | # MAGIC * String split 26 | # MAGIC * Trim 27 | # MAGIC * Repeat 28 | 29 | # COMMAND ---------- 30 | 31 | from pyspark.sql.functions import col, length, upper, lower, regexp_extract, regexp_replace, trim, repeat, substring, substring_index, concat_ws, concat, initcap, split, lpad, rpad 32 | 33 | # COMMAND ---------- 34 | 35 | #Add new column cust_length, the length of customer name 36 | df_trans = df.withColumn("cust_length", length(col("CustomerName")) ) 37 | display(df_trans) 38 | 39 | # COMMAND ---------- 40 | 41 | #Add new column cust_upper & cust_lower which will contain customer name in upper & lower case respectively. 42 | df_trans = df_trans \ 43 | .withColumn("cust_upper", upper("CustomerName") )\ 44 | .withColumn("cust_lower", lower("CustomerName") )\ 45 | .withColumn("cust_initcap", initcap("cust_lower") ) 46 | display(df_trans) 47 | 48 | # COMMAND ---------- 49 | 50 | #Add new column hidden_name which include customer name with "a" replaced as *. 51 | df_trans = df_trans \ 52 | .withColumn("hidden_name", regexp_replace(col("CustomerName"), "a", "*" )) 53 | display(df_trans) 54 | 55 | # COMMAND ---------- 56 | 57 | # get starting & last three characters of customer name 58 | df_trans = (df_trans 59 | .withColumn("customer_name_start", substring(col("CustomerName"), 1, 3)) 60 | .withColumn("customer_name_last", substring(col("CustomerName"), -3, 2)) 61 | ) 62 | display(df_trans) 63 | 64 | # COMMAND ---------- 65 | 66 | # Extract year from order date using split functions 67 | df_trans = df_trans.withColumn("Year", split(col("Order Date"), "-")[2] ) 68 | display(df_trans) 69 | 70 | # COMMAND ---------- 71 | 72 | # Extract year from order date using split functions 73 | df_trans = df_trans.withColumn("repeat_date", repeat("Order Date", 2 ) ) 74 | display(df_trans) 75 | 76 | # COMMAND ---------- 77 | 78 | df_trans = df_trans.withColumn("trim_name", trim("CustomerName") ) 79 | display(df_trans) 80 | 81 | # COMMAND ---------- 82 | 83 | df_trans = df_trans\ 84 | .withColumn("lpad_name", lpad("CustomerName", 20, "$") )\ 85 | .withColumn("rpad_name", rpad("CustomerName", 20, "#") ) 86 | display(df_trans) 87 | 88 | # COMMAND ---------- 89 | 90 | # MAGIC %run ../SETUP/_pyspark_clean_up 91 | -------------------------------------------------------------------------------- /PySpark_ETL/PS10-Date & Time Functions.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### Date Time Functions 4 | # MAGIC Date is an important data type when it comes to reporting. In most of the time, reports are arranged based on date, month, year or a particular amount of time. To answer all those questions we will see how can we use date & time in pyspark. 5 | # MAGIC 6 | # MAGIC ![DATETIME](https://raw.githubusercontent.com/martandsingh/images/master/datetime.jpg) 7 | 8 | # COMMAND ---------- 9 | 10 | # MAGIC %run ../SETUP/_pyspark_init_setup 11 | 12 | # COMMAND ---------- 13 | 14 | df = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderlist.csv") 15 | display(df) 16 | 17 | # COMMAND ---------- 18 | 19 | # MAGIC %md 20 | # MAGIC * Current data time 21 | # MAGIC * Convert string to date 22 | # MAGIC * Extract date part e.g. day, month, year, week 23 | # MAGIC * convert to/from unixtimestamp 24 | # MAGIC * Change date format 25 | # MAGIC * Date difference 26 | # MAGIC * Add/Subtract date 27 | 28 | # COMMAND ---------- 29 | 30 | from pyspark.sql.functions import current_date, current_timestamp, to_date, col, dayofmonth, month, year, quarter, date_add, date_sub, date_trunc, add_months, months_between, datediff 31 | 32 | # COMMAND ---------- 33 | 34 | df_trans = df \ 35 | .withColumn("current_date", current_date())\ 36 | .withColumn("current_timestamp", current_timestamp()) 37 | 38 | display(df_trans) 39 | 40 | # COMMAND ---------- 41 | 42 | # We can see order date is string type. 43 | df.printSchema() 44 | 45 | # COMMAND ---------- 46 | 47 | # Let's convert Order Date column to date 48 | df_trans = df_trans.withColumn("order_date", to_date(col("Order Date"), "dd-MM-yyyy")) 49 | display(df_trans) 50 | 51 | # COMMAND ---------- 52 | 53 | df_trans.printSchema() 54 | 55 | # COMMAND ---------- 56 | 57 | # Add new column Day, Month & Year from order_date column 58 | df_trans = df_trans\ 59 | .withColumn("Day", dayofmonth("order_date") )\ 60 | .withColumn("Month", month("order_date") )\ 61 | .withColumn("Year", year("order_date") )\ 62 | .withColumn("Quarter", quarter("order_date") ) 63 | 64 | display(df_trans) 65 | 66 | # COMMAND ---------- 67 | 68 | # Add and Subtract days in order day 69 | df_trans = df_trans\ 70 | .withColumn("order_next_10_days", date_add(col("order_date"), 10))\ 71 | .withColumn("order_prev_10_days", date_sub(col("order_date"), 10))\ 72 | .withColumn("order_add_months", add_months(col("order_date"), 2))\ 73 | .withColumn("date_dff", datediff(col("order_next_10_days"), col("order_date")) ) 74 | 75 | 76 | display(df_trans) 77 | 78 | # COMMAND ---------- 79 | 80 | # MAGIC %run ../SETUP/_pyspark_clean_up 81 | -------------------------------------------------------------------------------- /PySpark_ETL/PS11-Partitioning & Repartitioning.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC 4 | # MAGIC ### Partitioning 5 | # MAGIC In the world of big data, partitioning is an extremely important concept. As name suggest, partitioning means dividing your data into smaller parts based on a partition key. You can also use multiple keys to partition your data. 6 | # MAGIC 7 | # MAGIC we use partitionBy() to parition our data. Partition means when you choose a partition key, you data is divided into smaller parts based on that key & it will store your data into subfolders. example: 8 | # MAGIC 9 | # MAGIC If you have 1 billion rows for 1000 users. Everytime when you use filter based on userid (WHERE userid = 'abc'), the executor will scan whole data. Let say you choose userid as partition key, it will divide or partition your data into 1000 sub folders(as we have 1000 unique users). Now the query (WHERE useri='abc') will scan only once folder which contains abc records. 10 | # MAGIC 11 | # MAGIC ### How to choose a partition key? 12 | # MAGIC 13 | # MAGIC Let's see a demo. 14 | # MAGIC 15 | # MAGIC ![Partition](https://raw.githubusercontent.com/martandsingh/images/master/partitioning.png) 16 | # MAGIC 17 | # MAGIC ### How to decide number of partitions in Spark? 18 | # MAGIC In Spark, one should carefully choose the number of partitions depending on the cluster design and application requirements. The best technique to determine the number of spark partitions in an RDD is to multiply the number of cores in the cluster with the number of partitions. 19 | # MAGIC 20 | # MAGIC ### How do I create a partition in Spark? 21 | # MAGIC In Spark, you can create partitions in two ways - 22 | # MAGIC 1. Repartition - used to increase and decrease the partitions. Results in more or less equal sized partitions. Since a full shuffle takes place, repartition is less performant than coalesce. Repartition always involves a shuffle. 23 | # MAGIC 1. Coalesce - used to decrease the partition. It creates unequal partitions. Faster than repartition but query performance an be slower. Coalesce doesn’t involve a full shuffle. 24 | # MAGIC 25 | # MAGIC By invoking partitionBy method on an RDD, you can provide an explicit partitioner, 26 | 27 | # COMMAND ---------- 28 | 29 | # MAGIC %run ../SETUP/_pyspark_init_setup 30 | 31 | # COMMAND ---------- 32 | 33 | from pyspark.sql.types import StructField, StructType, StringType, DecimalType, IntegerType 34 | 35 | # COMMAND ---------- 36 | 37 | # We are using a game steam dataset. 38 | custom_schema = StructType( 39 | [ 40 | StructField("gamer_id", IntegerType(), True), 41 | StructField("game", StringType(), True), 42 | StructField("behaviour", StringType(), True), 43 | StructField("play_hours", DecimalType(), True), 44 | StructField("rating", IntegerType(), True) 45 | ]) 46 | df = spark.read.option("header", "true").schema(custom_schema).csv('/FileStore/datasets/steam-200k.csv') 47 | display(df) 48 | 49 | # COMMAND ---------- 50 | 51 | df.count() 52 | 53 | # COMMAND ---------- 54 | 55 | df.select("game").distinct().count() 56 | 57 | # COMMAND ---------- 58 | 59 | df.rdd.getNumPartitions() 60 | 61 | # COMMAND ---------- 62 | 63 | # We are using a game steam dataset. It will partition your data into default values of partitions which is 3 in my case you can check using df.rdd.getNumPartitions(). This process will be quicker as we do not have any partition key so spark does not have to sort and partition data based on key. 64 | 65 | df.write.mode("overwrite").parquet("/FileStore/output/gamelogs_unpart") 66 | 67 | # COMMAND ---------- 68 | 69 | display(dbutils.fs.ls('/FileStore/output/gamelogs_unpart')) 70 | 71 | # COMMAND ---------- 72 | 73 | # We are using a game steam dataset. This will create multiple folders based on game names. we have 5155 unique game, it will create 5155 folders. This process will take longer time to execute. 74 | 75 | df.write.partitionBy("game").mode("overwrite").parquet("/FileStore/output/gamelogs_part") 76 | 77 | # COMMAND ---------- 78 | 79 | df_files = dbutils.fs.ls('/FileStore/output/gamelogs_part') 80 | type(df_files) 81 | 82 | # COMMAND ---------- 83 | 84 | len(df_files) # So we can see we have 5156 (5155 for games, 1 for log) 85 | 86 | # COMMAND ---------- 87 | 88 | # Lets read from our partition data 89 | df_game = spark.read.parquet("/FileStore/output/gamelogs_part/") 90 | display(df_game) 91 | 92 | # COMMAND ---------- 93 | 94 | from pyspark.sql.functions import col 95 | 96 | # COMMAND ---------- 97 | 98 | display(df_game.filter( (col("game") == "Dota 2") & (col("behaviour") == "purchase") & (col("play_hours") == 1 ) )) 99 | 100 | # COMMAND ---------- 101 | 102 | display(df.filter( (col("game") == "Dota 2") & (col("behaviour") == "purchase") & (col("play_hours") == 1 ) )) 103 | 104 | # COMMAND ---------- 105 | 106 | # MAGIC %md 107 | # MAGIC ### Repartition 108 | 109 | # COMMAND ---------- 110 | 111 | df_game.rdd.getNumPartitions() 112 | 113 | # COMMAND ---------- 114 | 115 | from pyspark.sql.functions import spark_partition_id 116 | 117 | # COMMAND ---------- 118 | 119 | # Add a new column which include partition id. Count total number of record in each partition. It is recommended to have almost equal number of records in ech partition. 120 | 121 | display( df_game.withColumn("partitionId", spark_partition_id()).groupBy("partitionId").count().orderBy("count")) 122 | 123 | 124 | # COMMAND ---------- 125 | 126 | # repartition() is used to increase or decrease the RDD, DataFrame, Dataset partitions 127 | # If we want to reduce or increade the paritition size we can use it. 128 | df_repart = df_game.repartition(40) 129 | display(df_repart.limit(10)) 130 | 131 | # COMMAND ---------- 132 | 133 | df_repart.rdd.getNumPartitions() # 40 partitions as we did 134 | 135 | # COMMAND ---------- 136 | 137 | # we can also increase partition 138 | df_repart2 = df_game.repartition(80) 139 | display(df_repart2.limit(10)) 140 | 141 | # COMMAND ---------- 142 | 143 | # MAGIC %md 144 | # MAGIC ### coalesce() 145 | # MAGIC It is only used to reduce the partition. This is optimized or improved version of repartition() where the movement of the data across the partitions is lower using coalesce. 146 | 147 | # COMMAND ---------- 148 | 149 | df_col2 = df_game.coalesce(10) 150 | display(df_col2) 151 | 152 | # COMMAND ---------- 153 | 154 | df_col.rdd.getNumPartitions()# 10 partitions as we did 155 | 156 | # COMMAND ---------- 157 | 158 | 159 | 160 | # COMMAND ---------- 161 | 162 | # if you will try to increase the partition using coalesce, it will throw error. 163 | df_col3 = df_col.coalesce(1000) 164 | display(df_col3) 165 | 166 | # COMMAND ---------- 167 | 168 | df_col3.rdd.getNumPartitions() # 332 partitions which is the original value. we cannot increase it 169 | 170 | # COMMAND ---------- 171 | 172 | # MAGIC %run ../SETUP/_pyspark_clean_up 173 | -------------------------------------------------------------------------------- /PySpark_ETL/PS12-Missing Value Handling.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### Handling Missing Values 4 | # MAGIC In real life, data is not fine & curated. It is ambigous, dirty. It may contains null values or invalid values. for example. You have a hospital dataset, the age of patient is 1000. I am not sure about other planets but on earth it is impossible for a person to live 1000 years. So clearly it is a mistake. There can be case where firstname of patient is null. 5 | # MAGIC 6 | # MAGIC So you will see these kind of issues in raw data. These values can hamper your analysis. So we have to find and fix those values. 7 | # MAGIC 8 | # MAGIC ![MISSING_VALUES](https://raw.githubusercontent.com/martandsingh/images/master/missing-values.png) 9 | # MAGIC 10 | # MAGIC 11 | # MAGIC There are many ways to handle these values: 12 | # MAGIC 13 | # MAGIC 1. Drop - We can drop the entire row if we find any null value. But as per my experience I do not prefer this method as sometime you delete important information or anamolies if you use this method. 14 | # MAGIC 15 | # MAGIC 1. Fill - Here we try to fill our null values with some valid values. These values can be mean, mode, median or some other logic you define. The concept it to replace null values with most common or likely value. In this way you do not lose the entire row. 16 | # MAGIC 17 | # MAGIC 1. Replace - Replace is more flexible option than fill. It can do the operation with fill() does but apart from that it is also helpful when you want to replace a string with another. 18 | # MAGIC 19 | # MAGIC ### What is Imputation? 20 | # MAGIC The process of preserving all cases by replacing missing data with an estimated value based on other available information is called imputation. Fill() & Replace() are used for imputation. 21 | # MAGIC 22 | # MAGIC *Let's not go deep in theory and see the action. 23 | 24 | # COMMAND ---------- 25 | 26 | # MAGIC %run ../SETUP/_pyspark_init_setup 27 | 28 | # COMMAND ---------- 29 | 30 | from pyspark.sql.functions import col 31 | 32 | # COMMAND ---------- 33 | 34 | df = spark.read.option("header", "true").csv('/FileStore/datasets/missing_val_dataset.csv') 35 | 36 | # COMMAND ---------- 37 | 38 | display(df) 39 | 40 | # COMMAND ---------- 41 | 42 | df.printSchema() 43 | 44 | # COMMAND ---------- 45 | 46 | # MAGIC %md 47 | # MAGIC ## Check Null values count 48 | 49 | # COMMAND ---------- 50 | 51 | 52 | #Lets check null values for one column. It will give you 53 | null_count = df.filter(col("cloud").isNull()).count() # count of all the null values 54 | not_null_count = df.filter(col("vintage").isNotNull()).count() # count of not null values 55 | print("Total null values: ", null_count) 56 | print("Total not null values: ", not_null_count) 57 | 58 | # COMMAND ---------- 59 | 60 | # You can get same result using sql expression 61 | null_count_sql = df.filter("cloud IS NULL").count() # count of all the null values 62 | not_null_count_sql = df.filter("vintage IS NOT NULL").count() # count of not null values 63 | print("Total null values: ", null_count_sql) 64 | print("Total not null values: ", not_null_count_sql) 65 | 66 | # COMMAND ---------- 67 | 68 | # Check null count based on multiple columns 69 | 70 | null_count_ml = df.filter(col("cloud").isNull() & col("vintage").isNull() ).count() # count of all the null values 71 | not_null_count_ml = df.filter(col("cloud").isNotNull() & col("cloud").isNotNull() ).count() # count of not null values 72 | print("Total null values: ", null_count_ml) 73 | print("Total not null values: ", not_null_count_ml) 74 | 75 | # COMMAND ---------- 76 | 77 | display(df.filter(col("cloud").isNull())) 78 | 79 | # COMMAND ---------- 80 | 81 | df.count() 82 | 83 | # COMMAND ---------- 84 | 85 | from pyspark.sql.functions import count, when, isnan, lower 86 | 87 | # COMMAND ---------- 88 | 89 | # Here we can see we will not catch 'NA' values from vintage as NA is not a valid null character. To catch that we can 90 | df_null = df.select([count(when( isnan(c) | col(c).isNull() , c)).alias(c) for c in df.columns]) 91 | display(df_null) 92 | 93 | # COMMAND ---------- 94 | 95 | # So to catch NA we added one more condition. Now in our below condition we are considering "NA" as invalid values. 96 | df_null = df.select([count(when( isnan(c) | col(c).isNull() | (col(c) == "NA") , c)).alias(c) for c in df.columns]) 97 | display(df_null) 98 | 99 | # COMMAND ---------- 100 | 101 | # MAGIC %md 102 | # MAGIC Fill(), DROP() & REPLACE() function will handle only system null values. So we should replace our "NA" values with system NULL. Below code is doing the same thing. It is replacn "NA" with None. 103 | 104 | # COMMAND ---------- 105 | 106 | # Here we are replacing all the NA values to null as NA values are not valid missing value. So to process them we are changing them to system null values. 107 | df_trans = df.withColumn("vintage", when(col("vintage") == "NA", None) 108 | .otherwise(col("vintage")) ) 109 | 110 | display(df_trans) 111 | 112 | # COMMAND ---------- 113 | 114 | # Now let's check null value counts after reaplcing "NA" with None. 115 | df_null = df_trans.select([count(when( isnan(c) | col(c).isNull() , c)).alias(c) for c in df_trans.columns]) 116 | display(df_null) 117 | 118 | # COMMAND ---------- 119 | 120 | # MAGIC %md 121 | # MAGIC ### Handle Missing Values 122 | # MAGIC There are multiple way to deal with missing values. 123 | 124 | # COMMAND ---------- 125 | 126 | # MAGIC %md 127 | # MAGIC ### Way1 : Drop 128 | # MAGIC 129 | # MAGIC Drop rows with null values. 130 | # MAGIC 131 | # MAGIC It has 2 options __Any__ and __All__. 132 | # MAGIC 133 | # MAGIC In "Any" row is deleted, if any one of all the columns has a null or invalid values. In contrary, "All" will delete row only when all the columns has null values. You can also use subset if the columns. 134 | # MAGIC 135 | # MAGIC *Let's check with examples... 136 | 137 | # COMMAND ---------- 138 | 139 | # before dropping lets count the total number of rows. 140 | total_count=df_trans.count() # total count 141 | print(total_count) 142 | 143 | # COMMAND ---------- 144 | 145 | # it will drop complete row if any one column has null value. It will not NA value rows from vintage column 146 | # Using any deleted 59 columns out of 61. 147 | dropna_any = df_trans.na.drop(how="any") 148 | display(dropna_any) 149 | 150 | # COMMAND ---------- 151 | 152 | # it will drop complete row if all the column has null value. We can see this statement did not delete any records as there are no records with all column null. It did not delete any as there are no rows with all null values. 153 | df.na.drop(how="all").count() 154 | 155 | # COMMAND ---------- 156 | 157 | 158 | 159 | # COMMAND ---------- 160 | 161 | # Take null value count in variable so that we can verify later with our drop() function. 162 | cld_all_null = (df_trans.filter("cloud IS NULL AND vintage IS NULL")).count() 163 | print(cld_all_null) 164 | 165 | cld_any_null = (df_trans.filter("cloud IS NULL OR vintage IS NULL")).count() 166 | print(cld_any_null) 167 | 168 | # COMMAND ---------- 169 | 170 | # apply drop to few columns. This will delete rows where either cloud or vintage is null. This will delte 59 rows, you can confirm with variable cld_any_null 171 | df_trans.na.drop(how="any", subset=["cloud", "vintage"]).count() 172 | # this count is after deleting rows. total_rows - cld_any_null 173 | 174 | # COMMAND ---------- 175 | 176 | # apply drop to few columns. This will delete rows where cloud and vintage both columns are null. It will delete 8 rows as you can confirm with variable cld_all_null value. 177 | df_trans.na.drop(how="all", subset=["cloud", "vintage"]).count() 178 | # this count is after deleting rows. total_rows - cld_all_null 179 | 180 | # COMMAND ---------- 181 | 182 | # MAGIC %md 183 | # MAGIC ### Way 2: Fill 184 | # MAGIC Now let's see how to fill null value. This is an imputation technique. 185 | 186 | # COMMAND ---------- 187 | 188 | # Fill function is used to fill null values. Below code will replace all the null values in all the columns with NULL_VAL_REPLACE value. You can use any custom value. Just take care of column length and type. 189 | 190 | display(df_trans.na.fill("NULL_VAL_REPLACE").select("cloud", "vintage")) 191 | 192 | # COMMAND ---------- 193 | 194 | # You can use subset if you want to fill missing values only in few columns. 195 | display(df_trans.na.fill("MISSING_VALUE", subset=["cloud", "vintage"])) 196 | 197 | # COMMAND ---------- 198 | 199 | # You can define new value for each column. Below code will replace all the null values in cloud with NEW_VALUE & vintage with 2022. 200 | 201 | missing_values = { 202 | "cloud": "NEW_VALUE", 203 | "vintage": "2022" 204 | } 205 | display(df_trans.na.fill(missing_values)) 206 | 207 | # COMMAND ---------- 208 | 209 | # MAGIC %md 210 | # MAGIC ### Way 3: REPLACE 211 | 212 | # COMMAND ---------- 213 | 214 | display(df_trans.na.replace("GloBI", "NEW_GLOBI_NAME")) 215 | 216 | # COMMAND ---------- 217 | 218 | # MAGIC %run ../SETUP/_pyspark_clean_up 219 | -------------------------------------------------------------------------------- /PySpark_ETL/PS13-Deduplication.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### Deduplication 4 | # MAGIC Duplicate rows are a big issue in the world of big data. It can not only affect you analysis but also takes extra storage which in result cost you more. 5 | # MAGIC 6 | # MAGIC __Deduplication__ is the process of removing duplicate data from your dataset. 7 | # MAGIC 8 | # MAGIC So it is important to find out & cure duplicate rows. There may be case where you want to drop duplicate data (but not always). 9 | # MAGIC 10 | # MAGIC We will use: 11 | # MAGIC 1. Distinct - It gives you distinct resultset which mean all the rows are unique. There are no duplicate rows. 12 | # MAGIC 1. DropDuplicates() - This function will help you to drop duplicate rows from your dataset. 13 | # MAGIC 14 | # MAGIC ![DUPLICATE_DATA](https://raw.githubusercontent.com/martandsingh/images/master/duplicate.jpg) 15 | 16 | # COMMAND ---------- 17 | 18 | # MAGIC %run ../SETUP/_pyspark_init_setup 19 | 20 | # COMMAND ---------- 21 | 22 | from pyspark.sql.types import StructField, StructType, StringType, DecimalType, IntegerType 23 | 24 | # COMMAND ---------- 25 | 26 | # We are using a game steam dataset. 27 | custom_schema = StructType( 28 | [ 29 | StructField("gamer_id", IntegerType(), True), 30 | StructField("game", StringType(), True), 31 | StructField("behaviour", StringType(), True), 32 | StructField("play_hours", DecimalType(), True), 33 | StructField("rating", IntegerType(), True) 34 | ]) 35 | df = spark.read.option("header", "true").schema(custom_schema).csv('/FileStore/datasets/steam-200k.csv') 36 | display(df) 37 | 38 | # COMMAND ---------- 39 | 40 | # MAGIC %md 41 | # MAGIC ### DISTINCT() 42 | 43 | # COMMAND ---------- 44 | 45 | # Lets check if our recordset has any duplicate rows 46 | total_count = df.count() 47 | distinct_count = df.distinct().count() 48 | print('total: ', total_count) 49 | print('distinct: ', distinct_count) 50 | print('duplicate: ', total_count-distinct_count) 51 | # so you can see we have few duplicate records. Keep in mind these duplicate records are comparing whole row which mean there exist two or more rows which have exact same value for all the columns in the table. 52 | 53 | # COMMAND ---------- 54 | 55 | # Lets find out duplicate values for specific column or multiple columns 56 | # lets find all the distinct game names. there are 5155 distinct games in our dataset. 57 | print(df.select("game").distinct().count()) 58 | 59 | 60 | # COMMAND ---------- 61 | 62 | # Find distinct record based on game & behaviour. 63 | distinct_selected_column= df.select("game", "behaviour").distinct().count() 64 | print(distinct_selected_column) 65 | 66 | # COMMAND ---------- 67 | 68 | # So we saw how we can find distinct records based on few columns and all the columns. Now let's see how to drop duplicates values from our dataframe. 69 | 70 | # COMMAND ---------- 71 | 72 | # MAGIC %md 73 | # MAGIC ### DropDuplicates() 74 | 75 | # COMMAND ---------- 76 | 77 | # you can match this count with our distinct_count variable. 78 | df_distinct = df.drop_duplicates() 79 | df_distinct.count() 80 | 81 | # COMMAND ---------- 82 | 83 | # Drop duplicates on selected column. It will check the duplicate rows based on combination of the given columns. it should match our distinct_selected_column value. 84 | df_drop_selected = df.drop_duplicates(["game", "behaviour"]) 85 | df_drop_selected.count() 86 | 87 | # COMMAND ---------- 88 | 89 | # MAGIC %run ../SETUP/_pyspark_clean_up 90 | 91 | # COMMAND ---------- 92 | 93 | 94 | -------------------------------------------------------------------------------- /PySpark_ETL/PS14-Data Profiling using PySpark.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### What is Data Profiling? 4 | # MAGIC Data profiling is the process of examining, analyzing, and creating useful summaries of data. The process yields a high-level overview which aids in the discovery of data quality issues, risks, and overall trends. Data profiling produces critical insights into data that companies can then leverage to their advantage. 5 | # MAGIC 6 | # MAGIC In this notebook, we will learn few methods to generate our data profile. There are many application available in the market which can help you with data profiling. My main motto of this notebook is to explain how can anyone perform data profiling without purchasing third-party softwares. 7 | # MAGIC 8 | # MAGIC Also if you understand the correct concept, then you may design your own custom data quality checks which may not be available in any other softwares, As different organization has different business rules. No single software can cover all the requirement, so it is better to know some core concepts. 9 | # MAGIC 10 | # MAGIC Data profling is a part of Data Quality Checks. If you want to know more about data quality, you can refer: 11 | # MAGIC 12 | # MAGIC https://www.marketingevolution.com/marketing-essentials/data-quality 13 | # MAGIC 14 | # MAGIC ![DQC](https://raw.githubusercontent.com/martandsingh/images/master/dqc.jpg) 15 | 16 | # COMMAND ---------- 17 | 18 | # MAGIC %run ../SETUP/_pyspark_init_setup 19 | 20 | # COMMAND ---------- 21 | 22 | # MAGIC %md 23 | # MAGIC ### Create Delta Table 24 | # MAGIC Let's create a delta table to store our data profiling. We can create table using SQL command. You can write SQL code in the notebook using %sql keyword. 25 | 26 | # COMMAND ---------- 27 | 28 | # MAGIC %sql 29 | # MAGIC -- Create a new database for DQC. 30 | # MAGIC CREATE DATABASE IF NOT EXISTS DB_DQC; 31 | # MAGIC USE DB_DQC; 32 | 33 | # COMMAND ---------- 34 | 35 | # MAGIC %sql 36 | # MAGIC --Lets create a delta table to store our profiling data 37 | # MAGIC CREATE OR REPLACE TABLE data_profiling 38 | # MAGIC ( 39 | # MAGIC dqc_name VARCHAR(100), 40 | # MAGIC stage VARCHAR(20), 41 | # MAGIC db_name VARCHAR(50), 42 | # MAGIC table_name VARCHAR(50), 43 | # MAGIC column_name VARCHAR(50), 44 | # MAGIC dqc_value VARCHAR(50), 45 | # MAGIC query VARCHAR(2000), 46 | # MAGIC description VARCHAR(100), 47 | # MAGIC created_on TIMESTAMP 48 | # MAGIC ); 49 | 50 | # COMMAND ---------- 51 | 52 | df = spark.read.parquet('/FileStore/datasets/USED_CAR_PARQUET/') 53 | display(df.limit(4)) 54 | 55 | # COMMAND ---------- 56 | 57 | from pyspark.sql.functions import col, length 58 | from datetime import datetime 59 | 60 | # COMMAND ---------- 61 | 62 | # get list of column & its type 63 | def get_column_types(df_source): 64 | dict_types= {} 65 | for column in df.schema.fields: 66 | dict_types[column.name] = str(column.dataType) 67 | return dict_types 68 | 69 | # Return dictionary with missing columns & count 70 | def get_null_check(df_source): 71 | list_cols = df_source.columns 72 | dict_null = {} 73 | for column in list_cols: 74 | null_count=(df.filter(col(column).isNull()).count()) 75 | dict_null[column] = null_count 76 | return dict_null 77 | 78 | # get min value only for numeric columns 79 | def get_min_val(df_source): 80 | data = {} 81 | for column in df_source.columns: 82 | dtype = str(df.schema[column].dataType) 83 | #print(dtype) 84 | if dtype.lower() in ["longtype", "inttype", "decimaltype", "floattype"]: 85 | min_val = df_source.agg({column: "min"}).collect()[0]["min("+column+")"] 86 | data[column] = str(min_val) 87 | else: 88 | data[column]="NA" 89 | return data 90 | 91 | # get average value only for numeric columns 92 | def get_avg_val(df_source): 93 | data = {} 94 | for column in df_source.columns: 95 | dtype = str(df.schema[column].dataType) 96 | #print(dtype) 97 | if dtype.lower() in ["longtype", "inttype", "decimaltype", "floattype"]: 98 | avg_val = df_source.agg({column: "avg"}).collect()[0]["avg("+column+")"] 99 | data[column] = str(avg_val) 100 | else: 101 | data[column]="NA" 102 | return data 103 | 104 | 105 | # get max value only for numeric columns 106 | def get_max_val(df_source): 107 | data = {} 108 | for column in df_source.columns: 109 | dtype = str(df.schema[column].dataType) 110 | #print(dtype) 111 | if dtype.lower() in ["longtype", "inttype", "decimaltype", "floattype"]: 112 | max_val = df_source.agg({column: "max"}).collect()[0]["max("+column+")"] 113 | data[column] = str(max_val) 114 | else: 115 | data[column]="NA" 116 | return data 117 | 118 | # get max length of the value in a string type column 119 | def get_max_length_val(df_source): 120 | data = {} 121 | for column in df_source.columns: 122 | dtype = str(df.schema[column].dataType) 123 | #print(dtype) 124 | if dtype.lower() in ["stringtype"]: 125 | df_len = df.withColumn("length", length(col(column))).select(column, "length") 126 | max_length = df_len.agg({"length":"max"}).collect()[0]["max(length)"] 127 | 128 | data[column] = str(max_length) 129 | else: 130 | data[column]="NA" 131 | return data 132 | 133 | 134 | # get min length of the value in a string type column 135 | def get_min_length_val(df_source): 136 | data = {} 137 | for column in df_source.columns: 138 | dtype = str(df.schema[column].dataType) 139 | #print(dtype) 140 | if dtype.lower() in ["stringtype"]: 141 | df_len = df.withColumn("length", length(col(column))).select(column, "length") 142 | min_length = df_len.agg({"length":"min"}).collect()[0]["min(length)"] 143 | 144 | data[column] = str(min_length) 145 | else: 146 | data[column]="NA" 147 | return data 148 | 149 | #get count of values matching for the given regex. In our case we are counting values with special characters. 150 | def special_character_check(df_source): 151 | data = {} 152 | for column in df_source.columns: 153 | dtype = str(df.schema[column].dataType) 154 | #print(dtype) 155 | if dtype.lower() in ["stringtype"]: 156 | special_character_count = df.filter(col(column).rlike("[()`~/\!@#$%^&*()']")).count() 157 | data[column] = str(special_character_count) 158 | else: 159 | data[column]="NA" 160 | return data 161 | 162 | #Pass the required parameter and it will log DQC stats to delta table. 163 | def log_dqc_checks(dict_dqc, dqc_name, db_name, table_name, stage, query="", description=""): 164 | data =[] 165 | for key in dict_dqc: 166 | dict_result = { 167 | "dqc_name" : dqc_name, 168 | "stage" : stage , 169 | "db_name" : db_name, 170 | "table_name" :table_name, 171 | "column_name": key, 172 | "dqc_value" : dict_dqc[key], 173 | "query" : query, 174 | "description": description, 175 | "created_on" : datetime.now() 176 | } 177 | data.append(dict_result) 178 | df = spark.createDataFrame(data) 179 | df_sort = df \ 180 | .select("dqc_name", "stage", "db_name", "table_name", "column_name", "dqc_value", "query", "description", "created_on") 181 | df_sort.write.insertInto("DB_DQC.data_profiling", overwrite=False) 182 | 183 | 184 | # COMMAND ---------- 185 | 186 | def dqc_pipeline(df_source): 187 | DB_NAME = "DB_DQC" 188 | TABLE_NAME = "data_profiling" 189 | STAGE = "PRE_ETL" 190 | 191 | print("Column type dqc in progress...") 192 | dict_types = get_column_types(df_source) 193 | log_dqc_checks(dict_types, "COLUMN_TYPE_DQC", DB_NAME, TABLE_NAME, STAGE) 194 | print("Column type dqc completed.") 195 | 196 | # Log NULL CHECK DQC 197 | print("Missing value dqc in progress...") 198 | dict_null = get_null_check(df_source) 199 | log_dqc_checks(dict_null, "NULL_COUNT_DQC", DB_NAME, TABLE_NAME, STAGE) 200 | print("Missing value dqc completed.") 201 | 202 | print("Minimum value dqc in progress...") 203 | dict_min = get_min_val(df_source) 204 | log_dqc_checks(dict_min, "MIN_VAL_DQC", DB_NAME, TABLE_NAME, STAGE) 205 | print("Minimum value dqc completed.") 206 | 207 | print("Maximum value dqc in progress...") 208 | dict_max = get_max_val(df_source) 209 | log_dqc_checks(dict_max, "MAX_VAL_DQC", DB_NAME, TABLE_NAME, STAGE) 210 | print("Maximum value dqc completed.") 211 | 212 | print("Average value dqc in progress...") 213 | dict_avg = get_avg_val(df_source) 214 | log_dqc_checks(dict_max, "AVG_VAL_DQC", DB_NAME, TABLE_NAME, STAGE) 215 | print("Average value dqc completed.") 216 | 217 | print("Maximum length dqc in progress...") 218 | dict_max_length = get_max_length_val(df_source) 219 | log_dqc_checks(dict_max_length, "MAX_LENGTH_DQC", DB_NAME, TABLE_NAME, STAGE) 220 | print("Maximum length dqc completed.") 221 | 222 | print("Minimum length dqc in progress...") 223 | dict_min_length = get_min_length_val(df_source) 224 | log_dqc_checks(dict_min_length, "MIN_LENGTH_DQC", DB_NAME, TABLE_NAME, STAGE) 225 | print("Minimum length dqc completed.") 226 | 227 | print("Special character dqc in progress...") 228 | dict_special_character = special_character_check(df_source) 229 | log_dqc_checks(dict_special_character, "SPECIAL_CHAR_DQC", DB_NAME, TABLE_NAME, STAGE) 230 | print("Special character dqc completed.") 231 | 232 | # COMMAND ---------- 233 | 234 | dqc_pipeline(df) 235 | 236 | # COMMAND ---------- 237 | 238 | # MAGIC %sql 239 | # MAGIC SELECT * FROM data_profiling ORDER BY dqc_name 240 | 241 | # COMMAND ---------- 242 | 243 | # MAGIC %sql 244 | # MAGIC SELECT dqc_name, COUNT(dqc_value) FROM data_profiling WHERE (dqc_value) > 1 GROUP BY dqc_name 245 | 246 | # COMMAND ---------- 247 | 248 | # MAGIC %md 249 | # MAGIC ### What we will do? 250 | # MAGIC We are trying to profile our data which mean we are trying to record table statistic so that we can compare it later with post transformation data. 251 | # MAGIC 252 | # MAGIC e.g. You have a unprocessed table, it has 5 columns & 100000 records. So we will calculate few basic statistics like total row count, total number of null values for each column, total distinct count, duplicate rows, datatype etc. 253 | # MAGIC 254 | # MAGIC ### How it is useful? 255 | # MAGIC Once you record these stats, then later we can calculate same statistic after cleaning data. In this way we can compare pre & post transformation stats to see how our data has changed during data pipeline. 256 | 257 | # COMMAND ---------- 258 | 259 | # MAGIC %sql 260 | # MAGIC DROP DATABASE DB_DQC CASCADE; 261 | 262 | # COMMAND ---------- 263 | 264 | # MAGIC %run ../SETUP/_pyspark_clean_up 265 | -------------------------------------------------------------------------------- /PySpark_ETL/PS15-Data Caching.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### Caching - Optimize spark performance 4 | # MAGIC Data caching is a very important technique when it comes to optimize your spark performance. Sometimes we have to reuse our big dataframe multiple times. It is not always prefered to load them frequently. SO what else we can do? 5 | # MAGIC 6 | # MAGIC We can create a local or cached copy to quickly retreive data from it. In this case our original dataframe will not be used. We will be using the cached copy. There may be the cases, new updates are there in original dataframe but you still have stale or older version of data cached. We have to take care of it also. 7 | # MAGIC 8 | # MAGIC ### Databricks provides two types of caching: 9 | # MAGIC 1. Spark Caching 10 | # MAGIC 1. Delta Caching 11 | # MAGIC 12 | # MAGIC ### Delta and Apache Spark caching 13 | # MAGIC 14 | # MAGIC The Delta cache accelerates data reads by creating copies of remote files in nodes’ local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then performed locally, which results in significantly improved reading speed. 15 | # MAGIC 16 | # MAGIC The Delta cache works for all Parquet files and is not limited to Delta Lake format files. The Delta cache supports reading Parquet files in Amazon S3, DBFS, HDFS, Azure Blob storage, Azure Data Lake Storage Gen1, and Azure Data Lake Storage Gen2. It does not support other storage formats such as CSV, JSON, and ORC. 17 | # MAGIC 18 | # MAGIC Here are the characteristics of each type: 19 | # MAGIC 20 | # MAGIC * __Type of stored data__: The Delta cache contains local copies of remote data. It can improve the performance of a wide range of queries, but cannot be used to store results of arbitrary subqueries. The Spark cache can store the result of any subquery data and data stored in formats other than Parquet (such as CSV, JSON, and ORC). 21 | # MAGIC 22 | # MAGIC * __Performance__: The data stored in the Delta cache can be read and operated on faster than the data in the Spark cache. This is because the Delta cache uses efficient decompression algorithms and outputs data in the optimal format for further processing using whole-stage code generation. 23 | # MAGIC 24 | # MAGIC * __Automatic vs manual control__: When the Delta cache is enabled, data that has to be fetched from a remote source is automatically added to the cache. This process is fully transparent and does not require any action. However, to preload data into the cache beforehand, you can use the CACHE SELECT command (see Cache a subset of the data). When you use the Spark cache, you must manually specify the tables and queries to cache. 25 | # MAGIC 26 | # MAGIC * __Disk vs memory-based__: The Delta cache is stored on the local disk, so that memory is not taken away from other operations within Spark. Due to the high read speeds of modern SSDs, the Delta cache can be fully disk-resident without a negative impact on its performance. In contrast, the Spark cache uses memory. 27 | # MAGIC 28 | # MAGIC *Attention: In this chapter you may not able to see caching effect as our dataset is very small. To see a significant difference, your dataset must be big and must have complex processing. so this notebook is just to show you how we can use caching.* 29 | # MAGIC 30 | # MAGIC __for more details visits:__ 31 | # MAGIC 32 | # MAGIC https://docs.databricks.com/delta/optimizations/delta-cache.html 33 | 34 | # COMMAND ---------- 35 | 36 | # MAGIC %run ../SETUP/_pyspark_init_setup 37 | 38 | # COMMAND ---------- 39 | 40 | # MAGIC %md 41 | # MAGIC ### PySpark Caching 42 | # MAGIC We have two methods to perform pyspark caching: 43 | # MAGIC 1. cache() 44 | # MAGIC 1. persist() 45 | # MAGIC 46 | # MAGIC Both caching and persisting are used to save the Spark RDD, Dataframe, and Dataset’s. But, the difference is, RDD cache() method default saves it to memory (MEMORY_ONLY) whereas persist() method is used to store it to the user-defined storage level. 47 | # MAGIC 48 | # MAGIC __Storage class__ 49 | # MAGIC 50 | # MAGIC class pyspark.StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication = 1) 51 | # MAGIC 52 | # MAGIC Storage level in persist(): 53 | # MAGIC 54 | # MAGIC Now, to decide the storage of RDD, there are different storage levels, which are given below − 55 | # MAGIC 56 | # MAGIC * DISK_ONLY = StorageLevel(True, False, False, False, 1) 57 | # MAGIC * DISK_ONLY_2 = StorageLevel(True, False, False, False, 2) 58 | # MAGIC * MEMORY_AND_DISK = StorageLevel(True, True, False, False, 1) 59 | # MAGIC * MEMORY_AND_DISK_2 = StorageLevel(True, True, False, False, 2) 60 | # MAGIC * MEMORY_AND_DISK_SER = StorageLevel(True, True, False, False, 1) 61 | # MAGIC * MEMORY_AND_DISK_SER_2 = StorageLevel(True, True, False, False, 2) 62 | # MAGIC * MEMORY_ONLY = StorageLevel(False, True, False, False, 1) 63 | # MAGIC * MEMORY_ONLY_2 = StorageLevel(False, True, False, False, 2) 64 | # MAGIC * MEMORY_ONLY_SER = StorageLevel(False, True, False, False, 1) 65 | # MAGIC * MEMORY_ONLY_SER_2 = StorageLevel(False, True, False, False, 2) 66 | # MAGIC * OFF_HEAP = StorageLevel(True, True, True, False, 1) 67 | 68 | # COMMAND ---------- 69 | 70 | from pyspark.sql.types import StructField, StructType, StringType, DecimalType, IntegerType 71 | 72 | # COMMAND ---------- 73 | 74 | # We are using a game steam dataset. 75 | custom_schema = StructType( 76 | [ 77 | StructField("gamer_id", IntegerType(), True), 78 | StructField("game", StringType(), True), 79 | StructField("behaviour", StringType(), True), 80 | StructField("play_hours", DecimalType(), True), 81 | StructField("rating", IntegerType(), True) 82 | ]) 83 | df = spark.read.option("header", "true").schema(custom_schema).csv('/FileStore/datasets/steam-200k.csv') 84 | display(df) 85 | 86 | # COMMAND ---------- 87 | 88 | from pyspark.sql.functions import col 89 | 90 | # COMMAND ---------- 91 | 92 | df_pre_cache = df\ 93 | .groupBy("game", "behaviour")\ 94 | .mean("play_hours") 95 | 96 | 97 | 98 | # COMMAND ---------- 99 | 100 | display(df_pre_cache) 101 | 102 | # COMMAND ---------- 103 | 104 | df_post_cache = df_pre_cache.cache() 105 | 106 | # COMMAND ---------- 107 | 108 | display(df_post_cache) 109 | 110 | # COMMAND ---------- 111 | 112 | # What will happen if we make changes in original dataset? Will it automaticlly update the cached copy. Let's drop a column in original datset and then compare both. 113 | 114 | # step1: cache the original copy 115 | df_cach = df.cache() 116 | df = df.drop("gamer_id") 117 | 118 | 119 | # COMMAND ---------- 120 | 121 | display(df.limit(2)) 122 | display(df_cach.limit(2)) 123 | 124 | # so you can see cached data does not gets updated. So always keep this in mind. If you have any changes in original dataframe then you have to delete the cached copy and create a new cache. Do not forget to delete the previous one as it will take extra storage. Always use caching for the dataset which does not gets updated frequently. 125 | 126 | # COMMAND ---------- 127 | 128 | # MAGIC %md 129 | # MAGIC ### Persist() 130 | # MAGIC Persist is like cache but in this case you can define custom storage level. 131 | 132 | # COMMAND ---------- 133 | 134 | from pyspark.storagelevel import StorageLevel 135 | df_persist = df.persist( StorageLevel.MEMORY_AND_DISK_2) 136 | 137 | # COMMAND ---------- 138 | 139 | display(df_persist) 140 | 141 | # COMMAND ---------- 142 | 143 | # MAGIC %md 144 | # MAGIC ### Delta Caching 145 | # MAGIC Delta caching is stored as local file on worker node. The Delta cache automatically detects when data files are created or deleted and updates its content accordingly. You can write, modify, and delete table data with no need to explicitly invalidate cached data. 146 | # MAGIC 147 | # MAGIC The Delta cache automatically detects files that have been modified or overwritten after being cached. Any stale entries are automatically invalidated and evicted from the cache. 148 | # MAGIC 149 | # MAGIC You have to enable delta caching using: 150 | # MAGIC 151 | # MAGIC spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]") 152 | 153 | # COMMAND ---------- 154 | 155 | spark.conf.get("spark.databricks.io.cache.enabled") # it is default in my case. let's enable this 156 | 157 | # COMMAND ---------- 158 | 159 | spark.conf.set("spark.databricks.io.cache.enabled", "true") 160 | 161 | # COMMAND ---------- 162 | 163 | # Lets create a delta table and see how delta caching works 164 | df.write.format("delta").mode("overwrite").saveAsTable("default.gamestats") 165 | 166 | # COMMAND ---------- 167 | 168 | # MAGIC %sql -- we have our table 169 | # MAGIC SELECT * FROM default.gamestats 170 | # MAGIC WHERE play_hours > 10 171 | 172 | # COMMAND ---------- 173 | 174 | # MAGIC %sql 175 | # MAGIC -- lets cache above query using delta caching 176 | # MAGIC CACHE 177 | # MAGIC SELECT * FROM default.gamestats 178 | # MAGIC WHERE play_hours > 10 179 | 180 | # COMMAND ---------- 181 | 182 | # MAGIC %sql 183 | # MAGIC -- After caching this query will take lesser time. 184 | # MAGIC SELECT * FROM default.gamestats 185 | # MAGIC WHERE play_hours > 10 186 | 187 | # COMMAND ---------- 188 | 189 | # MAGIC %sql 190 | # MAGIC -- now lets make some changes in our table and then rerun the same query. Will cache be able to catch new changes? 191 | # MAGIC UPDATE default.gamestats 192 | # MAGIC SET rating = CASE WHEN play_hours <5 THEN 2.5 193 | # MAGIC WHEN play_hours >=5 AND play_hours<10 THEN 3.5 194 | # MAGIC ELSE 4.8 END 195 | 196 | # COMMAND ---------- 197 | 198 | # MAGIC %sql 199 | # MAGIC -- so now we have updated rating column. Let;s see whether these new changes will be available in our cached copy? 200 | # MAGIC SELECT * FROM default.gamestats LIMIT 10; 201 | # MAGIC -- Yes, we can see updated changes in our cached query result. 202 | 203 | # COMMAND ---------- 204 | 205 | # MAGIC %sql 206 | # MAGIC SELECT * FROM default.gamestats 207 | # MAGIC WHERE play_hours > 10 208 | 209 | # COMMAND ---------- 210 | 211 | # MAGIC %run ../SETUP/_pyspark_clean_up 212 | 213 | # COMMAND ---------- 214 | 215 | 216 | -------------------------------------------------------------------------------- /PySpark_ETL/PS16-User Defined Functions.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### User defined functions (UDF) 4 | # MAGIC UDF will allow us to apply the functions directly in the dataframes and SQL databases in python, without making them registering individually. It can also help us to create new columns to our dataframe, by applying a function via UDF to the dataframe column(s), hence it will extend our functionality of dataframe. It can be created using the udf() method. 5 | 6 | # COMMAND ---------- 7 | 8 | # MAGIC %run ../SETUP/_pyspark_init_setup 9 | 10 | # COMMAND ---------- 11 | 12 | df = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderlist.csv") 13 | display(df) 14 | 15 | # COMMAND ---------- 16 | 17 | # MAGIC %md 18 | # MAGIC Let's create a UDF to extract order number from order id column. for example: In B-25601, 25601 is the order id. 19 | 20 | # COMMAND ---------- 21 | 22 | # Step1: first create a python function 23 | def extract_order_no(order_id): 24 | if order_id != None and '-' in order_id: 25 | return order_id.split("-")[1] 26 | else: 27 | return 'NA' 28 | 29 | 30 | # COMMAND ---------- 31 | 32 | print(extract_order_no("B-25601")) 33 | print(extract_order_no("25601")) 34 | 35 | # COMMAND ---------- 36 | 37 | # Step 2: convert python function to udf 38 | from pyspark.sql.functions import udf, col 39 | from pyspark.sql.types import StringType 40 | extract_order_no = udf(extract_order_no) 41 | 42 | # COMMAND ---------- 43 | 44 | df_trans = df.withColumn("order_number", extract_order_no(col("Order ID")) ) 45 | display(df_trans) 46 | 47 | # COMMAND ---------- 48 | 49 | # MAGIC %run ../SETUP/_pyspark_clean_up 50 | -------------------------------------------------------------------------------- /PySpark_ETL/PS17-Write Data.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### Write DataFrame 4 | # MAGIC This is one of the most important but easy topic in ETL pipelines. write data frame is "L - Load" in ETL. After you transform your data, you need t write it to db or some location (datalake, dbfs). We can use write function to do so. 5 | # MAGIC 6 | # MAGIC 7 | # MAGIC df.write.mode("overwrite/append/ignore").csv("/path/file/") 8 | # MAGIC 9 | # MAGIC __We have three write modes:__ 10 | # MAGIC * __append__: Append content of the dataframe to existing data or table. 11 | # MAGIC * __overwrite__: Overwrite existing data with the content of dataframe. 12 | # MAGIC * __ignore__: Ignore current write operation if data / table already exists without any error 13 | # MAGIC * __error or errorifexists__: (default case): Throw an exception if data already exists. 14 | 15 | # COMMAND ---------- 16 | 17 | # MAGIC %run ../SETUP/_pyspark_init_setup 18 | 19 | # COMMAND ---------- 20 | 21 | df_ol = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderlist.csv") 22 | df_od = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderdetails.csv") 23 | 24 | display(df_ol.limit(3)) 25 | display(df_od.limit(3)) 26 | 27 | # COMMAND ---------- 28 | 29 | # MAGIC %md 30 | # MAGIC ### Problem 31 | # MAGIC Let's say our business user wants us to get all the orders which are above $500 & category is Clothing in Maharashtra state. 32 | 33 | # COMMAND ---------- 34 | 35 | 36 | df_mah = df_ol\ 37 | .join(df_od, df_ol["Order ID"] == df_od["Order ID"], "inner")\ 38 | .filter((df_od["Category"] == "Clothing") & (df_od["Amount"] > 500) & (df_ol["State"]=="Maharashtra"))\ 39 | .withColumn("Amount", df_od["Amount"].cast("decimal(10, 2)"))\ 40 | .withColumn("order_id", df_ol["Order ID"])\ 41 | .select("order_id", "State", "City", "Category", "Amount") 42 | display(df_mah) 43 | 44 | # COMMAND ---------- 45 | 46 | df_mah.printSchema() # so you can see we have rename column & Amount type is changed to decimal(10, 2) 47 | 48 | # COMMAND ---------- 49 | 50 | # MAGIC %md 51 | # MAGIC ### Write as parquet files 52 | 53 | # COMMAND ---------- 54 | 55 | # So now after doing this analysis & transformation we need to write this result to somewhere. For the sake of demo I will use delta lake to write the output. 56 | 57 | # Write parquet files 58 | df_mah.write.mode("overwrite").parquet("/FileStore/output/ClothingSalesMah_par/") 59 | 60 | # COMMAND ---------- 61 | 62 | # MAGIC %md 63 | # MAGIC ### Write as CSV 64 | 65 | # COMMAND ---------- 66 | 67 | 68 | # Write CSV files 69 | df_mah.write.mode("overwrite").csv("/FileStore/output/ClothingSalesMah_csv/") 70 | 71 | # COMMAND ---------- 72 | 73 | # MAGIC %md 74 | # MAGIC ### Write as JSON 75 | 76 | # COMMAND ---------- 77 | 78 | # Write JSON files 79 | df_mah.write.mode("overwrite").json("/FileStore/output/ClothingSalesMah_json/") 80 | 81 | # COMMAND ---------- 82 | 83 | # MAGIC %md 84 | # MAGIC ### Write as delta table (Managed) 85 | 86 | # COMMAND ---------- 87 | 88 | # Write delta lake table. This will create a delta table in default db (you can choose any db). The table name is OrderSalesMah. 89 | #df.write.format("delta").saveAsTable("default.people10m") 90 | df_mah.write.format("delta").saveAsTable("default.OrderSalesMah") 91 | 92 | # COMMAND ---------- 93 | 94 | # MAGIC %sql 95 | # MAGIC -- you can query above table 96 | # MAGIC SELECT * FROM default.OrderSalesMah LIMIT 10; 97 | 98 | # COMMAND ---------- 99 | 100 | # MAGIC %sql 101 | # MAGIC -- You can observer type, location and owner. It tells us that this is a managed delta table. 102 | # MAGIC DESCRIBE EXTENDED OrderSalesMah 103 | 104 | # COMMAND ---------- 105 | 106 | # MAGIC %md 107 | # MAGIC ### Write to delta lake 108 | 109 | # COMMAND ---------- 110 | 111 | ## Write to delta lake. It will save by default in parquet 112 | df_mah.write.format("delta").mode("overwrite").save("/FileStore/output/ClothingSalesMah_delta/") 113 | 114 | # COMMAND ---------- 115 | 116 | # MAGIC %run ../SETUP/_pyspark_clean_up 117 | -------------------------------------------------------------------------------- /PySpark_ETL/Z01- Case Study Sales Order Analysis.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### Case Study - Sales Order Analysis 4 | # MAGIC 5 | # MAGIC ![SALES_ORDER](https://raw.githubusercontent.com/martandsingh/images/master/case_study_1.jpg) 6 | # MAGIC 7 | # MAGIC We have Order-Sales dataset. It includes three dataset: 8 | # MAGIC 9 | # MAGIC __Order List - it contains the list of all the order with amount, city, state.__ 10 | # MAGIC 11 | # MAGIC Order List 12 | # MAGIC * Order ID: string (nullable = true) 13 | # MAGIC * Order Date: string (nullable = true) 14 | # MAGIC * CustomerName: string (nullable = true) 15 | # MAGIC * State: string (nullable = true) 16 | # MAGIC * City: string (nullable = true) 17 | # MAGIC 18 | # MAGIC 19 | # MAGIC __Order Details - detail of the order. Order list has 1-to-many relationship with this dataset.__ 20 | # MAGIC 21 | # MAGIC Order Details 22 | # MAGIC * Order ID: string (nullable = true) 23 | # MAGIC * Amount: string (nullable = true) 24 | # MAGIC * Profit: string (nullable = true) 25 | # MAGIC * Quantity: string (nullable = true) 26 | # MAGIC * Category: string (nullable = true) 27 | # MAGIC * Sub-Category: string (nullable = true) 28 | # MAGIC 29 | # MAGIC 30 | # MAGIC __Sales Target - This contains the monthly sales target of product category.__ 31 | # MAGIC 32 | # MAGIC Sales Target 33 | # MAGIC * Month of Order Date: string (nullable = true) 34 | # MAGIC * Category: string (nullable = true) 35 | # MAGIC * Target: string (nullable = true) 36 | # MAGIC 37 | # MAGIC 38 | # MAGIC __We will try to answer following question asked by our business user:__ 39 | # MAGIC 1. Top 10 most selling categories & sub-categories (based on number of orders). 40 | # MAGIC 1. Which order has the highest & lowest profit. 41 | # MAGIC 1. Top 10 states & cities with highest total bill amount 42 | # MAGIC 1. In which month & year we received most number of orders with total amount (show top 10). 43 | # MAGIC 1. Which category fullfiled the month target. Add one extra column "IsTargetCompleted" with values Yes or No. 44 | 45 | # COMMAND ---------- 46 | 47 | # MAGIC %md 48 | # MAGIC ### Load datasets 49 | 50 | # COMMAND ---------- 51 | 52 | # MAGIC %run ../SETUP/_pyspark_init_setup 53 | 54 | # COMMAND ---------- 55 | 56 | df_ol = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderlist.csv") 57 | df_od = spark.read.option("header", "true").csv("/FileStore/datasets/sales/orderdetails.csv") 58 | df_st = spark.read.option("header", "true").csv("/FileStore/datasets/sales/salestarget.csv") 59 | 60 | # COMMAND ---------- 61 | 62 | # MAGIC %md 63 | # MAGIC #### Check Schema 64 | 65 | # COMMAND ---------- 66 | 67 | df_ol.printSchema() 68 | df_od.printSchema() 69 | df_st.printSchema() 70 | 71 | # COMMAND ---------- 72 | 73 | # MAGIC %md 74 | # MAGIC #### Top 10 most selling categories & sub-categories. 75 | 76 | # COMMAND ---------- 77 | 78 | from pyspark.sql.functions import col, lit 79 | 80 | # COMMAND ---------- 81 | 82 | df_most_selling_cat = df_od \ 83 | .groupBy("category", "Sub-Category")\ 84 | .count()\ 85 | .withColumnRenamed("count", "total_records")\ 86 | .orderBy(col("total_records").desc())\ 87 | .limit(10) 88 | display(df_most_selling_cat) 89 | 90 | # COMMAND ---------- 91 | 92 | # MAGIC %md 93 | # MAGIC #### Which order has the highest & lowest profit. 94 | 95 | # COMMAND ---------- 96 | 97 | df_order_profit= df_od\ 98 | .withColumn("Profit_Numeric", col('Profit').cast("decimal") )\ 99 | .groupBy("Order ID")\ 100 | .sum("Profit_Numeric").withColumnRenamed("sum(Profit_Numeric)", "total_profit")\ 101 | 102 | lowest = df_order_profit\ 103 | .withColumn("Type", lit("Lowest"))\ 104 | .orderBy("total_profit")\ 105 | .limit(1)\ 106 | 107 | 108 | highest= df_order_profit\ 109 | .withColumn("Type", lit("Highest"))\ 110 | .orderBy(col("total_profit").desc())\ 111 | .limit(1)\ 112 | 113 | 114 | df_profit_stats =lowest.union(highest) 115 | display(df_profit_stats) 116 | 117 | # COMMAND ---------- 118 | 119 | # MAGIC %md 120 | # MAGIC #### Top 10 states & cities with highest total bill amount 121 | 122 | # COMMAND ---------- 123 | 124 | df_high_city = df_ol\ 125 | .join(df_od, df_ol["Order ID"] == df_od["Order ID"], "inner")\ 126 | .selectExpr("State", "City", "CAST(Amount AS Decimal) AS amount_decimal")\ 127 | .groupBy("State", "City")\ 128 | .sum("amount_decimal")\ 129 | .withColumnRenamed("sum(amount_decimal)", "total_amount")\ 130 | .orderBy(col("total_amount").desc())\ 131 | .limit(10) 132 | 133 | 134 | display(df_high_city) 135 | 136 | # COMMAND ---------- 137 | 138 | # MAGIC %md 139 | # MAGIC #### In which month & year we received most number of orders with total amount (show top 10) 140 | 141 | # COMMAND ---------- 142 | 143 | # to do this first we have to add a new columne which contain order date in date format 144 | df_date = df_ol\ 145 | .join(df_od, df_ol["Order ID"] == df_od["Order ID"], "inner")\ 146 | .select(df_ol["Order ID"], "Order Date", "Amount") 147 | display(df_date) 148 | 149 | # COMMAND ---------- 150 | 151 | df_date.printSchema() 152 | 153 | # COMMAND ---------- 154 | 155 | from pyspark.sql.functions import to_date, month, year, date_format 156 | 157 | # COMMAND ---------- 158 | 159 | df_year_month = df_date\ 160 | .withColumn("order_date", to_date("Order Date", "dd-MM-yyyy"))\ 161 | .withColumn("order_month", date_format("order_date", "MMM"))\ 162 | .withColumn("order_year", year("order_date"))\ 163 | .groupBy("order_year", "order_month")\ 164 | .agg({"Amount":"sum", "Order ID":"count"})\ 165 | .withColumnRenamed("count(Order ID)", "order_count")\ 166 | .withColumnRenamed("sum(Amount)", "total_amount")\ 167 | .orderBy(col("order_count").desc(), col("total_amount").desc())\ 168 | .limit(10) 169 | 170 | 171 | display(df_year_month) 172 | 173 | # COMMAND ---------- 174 | 175 | # MAGIC %md 176 | # MAGIC #### Which category fullfiled the month target. Add one extra column "IsTargetCompleted" with values Yes or No. 177 | 178 | # COMMAND ---------- 179 | 180 | from pyspark.sql.functions import concat_ws, substring, when 181 | 182 | # COMMAND ---------- 183 | 184 | df_order_details = df_ol\ 185 | .join(df_od, df_ol["Order ID"]==df_od["Order ID"], "inner")\ 186 | .select(df_ol["Order ID"], "Order Date", "Amount", "Category")\ 187 | .withColumn("order_date", to_date("Order Date", "dd-MM-yyyy"))\ 188 | .withColumn("target_month"\ 189 | , concat_ws("-", date_format("order_date", "MMM"), substring(year("order_date"), 3, 2) ) )\ 190 | .withColumn("amount_decimal", col("Amount").cast("decimal"))\ 191 | .groupBy("target_month", "Category")\ 192 | .sum("amount_decimal")\ 193 | .withColumnRenamed("sum(amount_decimal)", "total_month_sales_amount") 194 | 195 | df_final_target = df_order_details\ 196 | .join(df_st\ 197 | , (df_order_details["target_month"]==df_st["Month of Order Date"]) &\ 198 | (df_order_details["Category"]==df_st["Category"]), "inner")\ 199 | .select(\ 200 | df_order_details["Category"]\ 201 | , "target_month"\ 202 | , "total_month_sales_amount"\ 203 | , "Target")\ 204 | .withColumn("TargetAcheived", when(col("Target") < col("total_month_sales_amount"), "No" )\ 205 | .otherwise("Yes")) 206 | 207 | 208 | display(df_final_target) 209 | 210 | # COMMAND ---------- 211 | 212 | # you can see how many category achevied the targets 213 | display(df_final_target.groupBy("TargetAcheived").count()) 214 | 215 | # COMMAND ---------- 216 | 217 | # you can see how many category achevied the targets with Category name 218 | display(df_final_target.groupBy("Category", "TargetAcheived").count().orderBy("Category")) 219 | 220 | # COMMAND ---------- 221 | 222 | # MAGIC %run ../SETUP/_pyspark_clean_up 223 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Data Engineering Using Azure Databricks 2 | 3 | ### Introduction 4 | 5 | This course include multiple sections. We are mainly focusing on Databricks Data Engineer certification exam. We have following tutorials: 6 | 1. Spark SQL ETL 7 | 2. Pyspark ETL 8 | 9 | ### DATASETS 10 | All the datasets used in the tutorials are available at: https://github.com/martandsingh/datasets 11 | 12 | ### HOW TO USE? 13 | follow below article to learn how to clone this repository to your databricks workspace. 14 | 15 | https://www.linkedin.com/pulse/databricks-clone-github-repo-martand-singh/ 16 | 17 | ### Spark SQL 18 | This course is the first installment of databricks data engineering course. In this course you will learn basic SQL concept which include: 19 | 1. Create, Select, Update, Delete tables 20 | 1. Create database 21 | 1. Filtering data 22 | 1. Group by & aggregation 23 | 1. Ordering 24 | 1. [SQL joins](https://www.scaler.com/topics/sql/joins-in-sql/) 25 | 1. Common table expression (CTE) 26 | 1. External tables 27 | 1. [Sub queries](https://www.geeksforgeeks.org/sql-subquery/) 28 | 1. Views & temp views 29 | 1. UNION, INTERSECT, EXCEPT keywords 30 | 1. Versioning, time travel & optimization 31 | 32 | ### PySpark ETL 33 | This course will teach you how to perform ETL pipelines using pyspark. ETL stands for Extract, Load & Transformation. We will see how to load data from various sources & process it and finally will load the process data to our destination. 34 | 35 | This course includes: 36 | 1. Read files 37 | 2. Schema handling 38 | 3. Handling JSON files 39 | 4. Write files 40 | 5. Basic transformations 41 | 6. partitioning 42 | 7. caching 43 | 8. joins 44 | 9. missing value handling 45 | 10. Data profiling 46 | 11. date time functions 47 | 12. string function 48 | 13. deduplication 49 | 14. grouping & aggregation 50 | 15. User defined functions 51 | 16. Ordering data 52 | 17. Case study - sales order analysis 53 | 54 | 55 | 56 | you can download all the notebook from our 57 | 58 | github repo: https://github.com/martandsingh/ApacheSpark 59 | 60 | facebook: https://www.facebook.com/codemakerz 61 | 62 | email: martandsays@gmail.com 63 | 64 | ### SETUP folder 65 | you will see initial_setup & clean_up notebooks called in every notebooks. It is mandatory to run both the scripts in defined order. initial script will create all the mandatory tables & database for the demo. After you finish your notebook, execute clean up notebook, it will clean all the db objects. 66 | 67 | pyspark_init_setup - this notebook will copy dataset from my github repo to dbfs. It will also generate used car parquet dataset. All the datasets will be avalable at 68 | 69 | **/FileStore/datasets** 70 | 71 | 72 | ![d5859667-databricks-logo](https://user-images.githubusercontent.com/32331579/174993501-dc93102a-ec36-4607-a3dc-ab67a54a341b.png) 73 | -------------------------------------------------------------------------------- /SETUP/_clean_up.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | print('Cleaning up all the database & tables....') 3 | 4 | # COMMAND ---------- 5 | 6 | # MAGIC %sql 7 | # MAGIC DROP DATABASE IF EXISTS DB_DEMO CASCADE; 8 | 9 | # COMMAND ---------- 10 | 11 | print('Tables & database deleted sucessfully.') 12 | -------------------------------------------------------------------------------- /SETUP/_initial_setup.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %run ./_setup_database 3 | 4 | # COMMAND ---------- 5 | 6 | # MAGIC %run ./_setup_demo_table 7 | -------------------------------------------------------------------------------- /SETUP/_pyspark_clean_up.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | dbutils.fs.rm('/FileStore/datasets/', True) 3 | 4 | # COMMAND ---------- 5 | 6 | 7 | 8 | # COMMAND ---------- 9 | 10 | 11 | -------------------------------------------------------------------------------- /SETUP/_pyspark_init_setup.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | #spark.conf.set("da", self.username) 3 | 4 | # Defining datafile paths 5 | CANCER_FILE_NAME='cancer.csv' 6 | UNECE_FILE_NAME='unece.json' 7 | USED_CAR_FILE_NAME='used_cars_nested.json' 8 | MALL_CUSTOMER_FILE_NAME = 'Mall_Customers.csv' 9 | DBFS_DATASET_LOCATION = '/FileStore/datasets/' 10 | CANCER_CSV_PATH = 'https://raw.githubusercontent.com/martandsingh/datasets/master/' + CANCER_FILE_NAME # CSV 11 | UNECE_JSON_PATH = 'https://raw.githubusercontent.com/martandsingh/datasets/master/' + UNECE_FILE_NAME # simple JSON 12 | USED_CAR_JSON_PATH = 'https://raw.githubusercontent.com/martandsingh/datasets/master/' + USED_CAR_FILE_NAME # complex JSON 13 | MALL_CUSTOMER_PATH='https://raw.githubusercontent.com/martandsingh/datasets/master/' + MALL_CUSTOMER_FILE_NAME 14 | DBFS_PARQUET_FILE = '/FileStore/datasets/USED_CAR_PARQUET/' 15 | HOUSE_PRICE_FILE = 'missing_val_dataset.csv' 16 | HOUSE_PRICE_PATH = 'https://raw.githubusercontent.com/martandsingh/datasets/master/' + HOUSE_PRICE_FILE 17 | GAME_STREAM_FILE = 'steam-200k.csv' 18 | GAME_STREAM_PATH = 'https://raw.githubusercontent.com/martandsingh/datasets/master/' + GAME_STREAM_FILE 19 | ORDER_DETAIL_FILE = 'orderdetails.csv' 20 | ORDER_DETAIL_PATH = 'https://raw.githubusercontent.com/martandsingh/datasets/master/Sales-Order/'+ORDER_DETAIL_FILE 21 | ORDER_LIST_FILE = 'orderlist.csv' 22 | ORDER_LIST_PATH = 'https://raw.githubusercontent.com/martandsingh/datasets/master/Sales-Order/' + ORDER_LIST_FILE 23 | SALES_TARGET_FILE = 'salestarget.csv' 24 | SALES_TARGET_PATH = 'https://raw.githubusercontent.com/martandsingh/datasets/master/Sales-Order/'+SALES_TARGET_FILE 25 | DA = { 26 | "ORDER_DETAIL_FILE": ORDER_DETAIL_FILE, 27 | "ORDER_DETAIL_PATH": ORDER_DETAIL_PATH, 28 | "ORDER_LIST_FILE":ORDER_LIST_FILE, 29 | "ORDER_LIST_PATH":ORDER_LIST_PATH, 30 | "SALES_TARGET_FILE": SALES_TARGET_FILE, 31 | "SALES_TARGET_PATH": SALES_TARGET_PATH, 32 | "CANCER_FILE_NAME": CANCER_FILE_NAME, 33 | "UNECE_FILE_NAME": UNECE_FILE_NAME, 34 | "USED_CAR_FILE_NAME": USED_CAR_FILE_NAME, 35 | "CANCER_CSV_PATH": CANCER_CSV_PATH, 36 | "UNECE_JSON_PATH": UNECE_JSON_PATH, 37 | "USED_CAR_JSON_PATH": USED_CAR_JSON_PATH, 38 | "DBFS_DATASET_LOCATION": DBFS_DATASET_LOCATION, 39 | "DBFS_PARQUET_FILE": DBFS_PARQUET_FILE, 40 | "MALL_CUSTOMER_FILE_NAME": MALL_CUSTOMER_FILE_NAME, 41 | "MALL_CUSTOMER_PATH": MALL_CUSTOMER_PATH, 42 | "HOUSE_PRICE_FILE": HOUSE_PRICE_FILE, 43 | "HOUSE_PRICE_PATH": HOUSE_PRICE_PATH, 44 | "GAME_STREAM_FILE": GAME_STREAM_FILE, 45 | "GAME_STREAM_PATH": GAME_STREAM_PATH 46 | } 47 | 48 | 49 | 50 | # COMMAND ---------- 51 | 52 | print('Loading data files...') 53 | 54 | # COMMAND ---------- 55 | 56 | dbutils.notebook.run(path="../SETUP/_pyspark_setup_files", timeout_seconds=60, arguments= DA) 57 | 58 | # COMMAND ---------- 59 | 60 | print('Data files loaded.') 61 | -------------------------------------------------------------------------------- /SETUP/_pyspark_setup_files.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### INTIAL SETUP 4 | # MAGIC This will copy files from github repository to your DBFS. 5 | # MAGIC 6 | # MAGIC Repo: https://github.com/martandsingh/datasets 7 | # MAGIC 8 | # MAGIC You can customize your DBFS location by changin DBFS_DATASET_LOCATION variable. 9 | 10 | # COMMAND ---------- 11 | 12 | 13 | 14 | # COMMAND ---------- 15 | 16 | dbutils.fs.mkdirs('/FileStore/datasets') 17 | 18 | # COMMAND ---------- 19 | 20 | cancer_file= dbutils.widgets.get("DBFS_DATASET_LOCATION")+dbutils.widgets.get("CANCER_FILE_NAME") 21 | unece_file = dbutils.widgets.get("DBFS_DATASET_LOCATION")+dbutils.widgets.get("UNECE_FILE_NAME") 22 | used_car_file=dbutils.widgets.get("DBFS_DATASET_LOCATION")+dbutils.widgets.get("USED_CAR_FILE_NAME") 23 | mall_customer_file=dbutils.widgets.get("DBFS_DATASET_LOCATION")+dbutils.widgets.get("MALL_CUSTOMER_FILE_NAME") 24 | house_price_file=dbutils.widgets.get("DBFS_DATASET_LOCATION")+dbutils.widgets.get("HOUSE_PRICE_FILE") 25 | game_stream_file = dbutils.widgets.get("DBFS_DATASET_LOCATION")+dbutils.widgets.get("GAME_STREAM_FILE") 26 | order_list_file = dbutils.widgets.get("DBFS_DATASET_LOCATION")+'/sales/'+dbutils.widgets.get("ORDER_LIST_FILE") 27 | order_details_file = dbutils.widgets.get("DBFS_DATASET_LOCATION")+'/sales/'+dbutils.widgets.get("ORDER_DETAIL_FILE") 28 | sales_target_file = dbutils.widgets.get("DBFS_DATASET_LOCATION")+'/sales/'+dbutils.widgets.get("SALES_TARGET_FILE") 29 | print(cancer_file) 30 | print(unece_file) 31 | print(used_car_file) 32 | 33 | 34 | # COMMAND ---------- 35 | 36 | 37 | 38 | # COMMAND ---------- 39 | 40 | dbutils.fs.cp(dbutils.widgets.get("CANCER_CSV_PATH"), cancer_file) 41 | 42 | dbutils.fs.cp(dbutils.widgets.get("UNECE_JSON_PATH"), unece_file) 43 | 44 | dbutils.fs.cp(dbutils.widgets.get("USED_CAR_JSON_PATH"), used_car_file) 45 | 46 | dbutils.fs.cp(dbutils.widgets.get("MALL_CUSTOMER_PATH"), mall_customer_file) 47 | 48 | dbutils.fs.cp(dbutils.widgets.get("HOUSE_PRICE_PATH"), house_price_file) 49 | 50 | dbutils.fs.cp(dbutils.widgets.get("GAME_STREAM_PATH"), game_stream_file) 51 | 52 | dbutils.fs.cp(dbutils.widgets.get("ORDER_LIST_PATH"), order_list_file) 53 | 54 | dbutils.fs.cp(dbutils.widgets.get("ORDER_DETAIL_PATH"), order_details_file) 55 | 56 | dbutils.fs.cp(dbutils.widgets.get("SALES_TARGET_PATH"), sales_target_file) 57 | 58 | 59 | # COMMAND ---------- 60 | 61 | from pyspark.sql.functions import explode, col 62 | 63 | # COMMAND ---------- 64 | 65 | parquet_path = dbutils.widgets.get("DBFS_PARQUET_FILE") 66 | print("Writing parquet file to "+ parquet_path) 67 | df = spark \ 68 | .read \ 69 | .option("multiline", "true")\ 70 | .json(used_car_file) 71 | 72 | df_exploded = df \ 73 | .withColumn("usedCars", explode(df["usedCars"])) 74 | 75 | df_clean = df_exploded \ 76 | .withColumn("vehicle_type", col("usedCars")["@type"])\ 77 | .withColumn("body_type", col("usedCars")["bodyType"])\ 78 | .withColumn("brand_name", col("usedCars")["brand"]["name"])\ 79 | .withColumn("color", col("usedCars")["color"])\ 80 | .withColumn("description", col("usedCars")["description"])\ 81 | .withColumn("model", col("usedCars")["model"])\ 82 | .withColumn("manufacturer", col("usedCars")["manufacturer"])\ 83 | .withColumn("ad_title", col("usedCars")["name"])\ 84 | .withColumn("currency", col("usedCars")["priceCurrency"])\ 85 | .withColumn("seller_location", col("usedCars")["sellerLocation"])\ 86 | .withColumn("displacement", col("usedCars")["vehicleEngine"]["engineDisplacement"])\ 87 | .withColumn("transmission", col("usedCars")["vehicleTransmission"])\ 88 | .withColumn("price", col("usedCars")["price"]) \ 89 | .drop("usedCars") 90 | df_clean.write.mode("overwrite").parquet(parquet_path) 91 | print("Parquet file is read.") 92 | 93 | # COMMAND ---------- 94 | 95 | print('File loaded to DBFS ' + dbutils.widgets.get("DBFS_DATASET_LOCATION")) 96 | -------------------------------------------------------------------------------- /SETUP/_setup_database.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | print('Creating database DB_DEMO...') 3 | 4 | # COMMAND ---------- 5 | 6 | # MAGIC %sql 7 | # MAGIC CREATE DATABASE IF NOT EXISTS DB_DEMO; 8 | # MAGIC USE DB_DEMO; 9 | 10 | # COMMAND ---------- 11 | 12 | print('Database DB_DEMO created successfully.') 13 | -------------------------------------------------------------------------------- /SETUP/_setup_demo_table.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %sql 3 | # MAGIC CREATE TABLE IF NOT EXISTS club ( 4 | # MAGIC club_id VARCHAR(10), 5 | # MAGIC club_name VARCHAR(50) 6 | # MAGIC ); 7 | # MAGIC 8 | # MAGIC 9 | # MAGIC CREATE TABLE IF NOT EXISTS department ( 10 | # MAGIC dept_id VARCHAR(10), 11 | # MAGIC dept_name VARCHAR(50) 12 | # MAGIC ); 13 | # MAGIC 14 | # MAGIC CREATE TABLE IF NOT EXISTS employee 15 | # MAGIC ( 16 | # MAGIC empcode VARCHAR(10), 17 | # MAGIC firstname VARCHAR(50), 18 | # MAGIC lastname VARCHAR(50), 19 | # MAGIC dept_id VARCHAR(10), 20 | # MAGIC club_id VARCHAR(10) 21 | # MAGIC ); 22 | # MAGIC 23 | # MAGIC CREATE TABLE meal 24 | # MAGIC ( 25 | # MAGIC meal_id VARCHAR(10), 26 | # MAGIC meal_name VARCHAR(50) 27 | # MAGIC ); 28 | # MAGIC CREATE TABLE drink 29 | # MAGIC ( 30 | # MAGIC drink_id VARCHAR(10), 31 | # MAGIC drink_name VARCHAR(50) 32 | # MAGIC ); 33 | # MAGIC 34 | # MAGIC CREATE TABLE emp_salary 35 | # MAGIC ( 36 | # MAGIC empcode VARCHAR(10), 37 | # MAGIC basic_salary DECIMAL(10, 2), 38 | # MAGIC transport DECIMAL(10, 2), 39 | # MAGIC accomodation DECIMAL(10, 2), 40 | # MAGIC food DECIMAL(10, 2), 41 | # MAGIC extra DECIMAL(10, 2) 42 | # MAGIC ); 43 | # MAGIC CREATE TABLE supplier_india 44 | # MAGIC ( 45 | # MAGIC supp_id VARCHAR(10), 46 | # MAGIC supp_name VARCHAR(50), 47 | # MAGIC city VARCHAR(50) 48 | # MAGIC ); 49 | # MAGIC CREATE TABLE supplier_nepal 50 | # MAGIC ( 51 | # MAGIC supp_id VARCHAR(10), 52 | # MAGIC supp_name VARCHAR(50), 53 | # MAGIC city VARCHAR(50) 54 | # MAGIC ); 55 | 56 | # COMMAND ---------- 57 | 58 | # MAGIC %python 59 | # MAGIC print('Preparing tables...') 60 | # MAGIC print('Success: table club created') 61 | # MAGIC print('Success: table department created') 62 | # MAGIC print('Success: table employee created') 63 | # MAGIC print('Success: table meal created') 64 | # MAGIC print('Success: table drink created') 65 | # MAGIC print('Success: table emp_salary created') 66 | # MAGIC print('Success: table supplier_india created') 67 | # MAGIC print('Success: table supplier_nepal created') 68 | 69 | # COMMAND ---------- 70 | 71 | # MAGIC %sql 72 | # MAGIC TRUNCATE TABLE club; 73 | # MAGIC TRUNCATE TABLE department; 74 | # MAGIC TRUNCATE TABLE employee; 75 | # MAGIC TRUNCATE TABLE meal; 76 | # MAGIC TRUNCATE TABLE drink; 77 | # MAGIC TRUNCATE TABLE emp_salary; 78 | # MAGIC TRUNCATE TABLE supplier_india; 79 | # MAGIC TRUNCATE TABLE supplier_nepal; 80 | 81 | # COMMAND ---------- 82 | 83 | # MAGIC %python 84 | # MAGIC print('Success: table club truncated') 85 | # MAGIC print('Success: table department truncated') 86 | # MAGIC print('Success: table employee truncated') 87 | # MAGIC print('Success: table meal truncated') 88 | # MAGIC print('Success: table drink truncated') 89 | # MAGIC print('Success: table emp_salary truncated') 90 | # MAGIC print('Success: table supplier_india truncated') 91 | # MAGIC print('Success: table supplier_nepal truncated') 92 | 93 | # COMMAND ---------- 94 | 95 | # MAGIC %sql 96 | # MAGIC INSERT INTO club 97 | # MAGIC (club_id, club_name) 98 | # MAGIC VALUES 99 | # MAGIC ('C1', 'Cricket'), 100 | # MAGIC ('C2', 'Football'), 101 | # MAGIC ('C3', 'Golf'), 102 | # MAGIC ('C4', 'Wildlife & Nature'), 103 | # MAGIC ('C5', 'Photography'), 104 | # MAGIC ('C6', 'Art & Music'); 105 | # MAGIC 106 | # MAGIC INSERT INTO department 107 | # MAGIC (dept_id, dept_name) 108 | # MAGIC VALUES 109 | # MAGIC ('DEP001', 'IT'), 110 | # MAGIC ('DEP002', 'Marketing'), 111 | # MAGIC ('DEP003', 'Finance'), 112 | # MAGIC ('DEP004', 'BI'), 113 | # MAGIC ('DEP005', 'Admin'), 114 | # MAGIC ('DEP006', 'HR'); 115 | # MAGIC 116 | # MAGIC INSERT INTO employee 117 | # MAGIC (empcode, firstname, lastname, dept_id, club_id) 118 | # MAGIC VALUES 119 | # MAGIC ('EMP001', 'Albert', 'Einstein', 'DEP001', 'C1'), 120 | # MAGIC ('EMP002', 'Isaac', 'Newton', 'DEP001', 'C1'), 121 | # MAGIC ('EMP003', 'Elvis', 'Bose', 'DEP001', 'C2'), 122 | # MAGIC ('EMP004', 'Jose', 'Baldwin', 'DEP001', 'C3'), 123 | # MAGIC ('EMP005', 'Christian', 'Baldwin', 'DEP002', 'C1'), 124 | # MAGIC ('EMP006', 'Stephenie', 'Margarete', 'DEP002', 'C3'), 125 | # MAGIC ('EMP007', 'P.K', 'Chand', 'DEP003', 'C1'), 126 | # MAGIC ('EMP008', 'Eric', 'Clapton', 'DEP004', 'C6'), 127 | # MAGIC ('EMP009', 'Eric', 'Jhonson', 'DEP001', 'C9'), 128 | # MAGIC ('EMP010', 'Martand', 'Singh', 'DEP010', 'C3'), 129 | # MAGIC ('EMP011', 'Rajiv', 'Singh', 'DEP0010', 'C31'), 130 | # MAGIC ('EMP012', 'Jose', 'Peter', 'DEP0011', 'C1'); 131 | # MAGIC 132 | # MAGIC INSERT INTO meal 133 | # MAGIC (meal_id, meal_name) 134 | # MAGIC VALUES 135 | # MAGIC ('M001', 'Pizza'), 136 | # MAGIC ('M002', 'Burger'), 137 | # MAGIC ('M003', 'Sandwich'), 138 | # MAGIC ('M004', 'Pasta'); 139 | # MAGIC 140 | # MAGIC INSERT INTO drink 141 | # MAGIC (drink_id, drink_name) 142 | # MAGIC VALUES 143 | # MAGIC ('D001', 'Coke'), 144 | # MAGIC ('D002', 'Pepsi'), 145 | # MAGIC ('D003', 'Beer'), 146 | # MAGIC ('D004', 'Water'); 147 | # MAGIC 148 | # MAGIC INSERT INTO emp_salary 149 | # MAGIC (empcode, basic_salary, transport, accomodation, food , extra) 150 | # MAGIC VALUES 151 | # MAGIC ('EMP001', 25000.99, 2000, 3000.99, 4500, 4500), 152 | # MAGIC ('EMP002', 35000.99, 1200, 3000.99, 3500, 4500), 153 | # MAGIC ('EMP003', 45000.99, 2500.99, 3000, 3560, 4500), 154 | # MAGIC ('EMP004', 15000.99, 2670.50, 3500, 3580, 7500), 155 | # MAGIC ('EMP005', 25600.99, 2120.50, 3000, 3589, 6500), 156 | # MAGIC ('EMP006', 67000.99, 2760, 4000, 3590, 5500), 157 | # MAGIC ('EMP007', 89000.99, 2000, 4000, 3511, 5500); 158 | # MAGIC 159 | # MAGIC INSERT INTO supplier_india 160 | # MAGIC (supp_id, supp_name, city) 161 | # MAGIC VALUES 162 | # MAGIC ('SI001', 'Martand Singh', 'New Delhi'), 163 | # MAGIC ('SI002', 'Gaurav Chandawani', 'Mumbai'), 164 | # MAGIC ('SI003', 'Shweta Gupta', 'U.P'), 165 | # MAGIC ('SI004', 'Naresh Chawla', 'Punjab'); 166 | # MAGIC 167 | # MAGIC INSERT INTO supplier_nepal 168 | # MAGIC (supp_id, supp_name, city) 169 | # MAGIC VALUES 170 | # MAGIC ('SN001', 'Himal Gurung', 'Kathmandu'), 171 | # MAGIC ('SN002', 'Naina Shah', 'Pokhara'), 172 | # MAGIC ('SN003', 'Vicky Magar', 'Gorkha'), 173 | # MAGIC ('SN004', 'Martand Singh', 'Surkhet'), 174 | # MAGIC ('SN005', 'Gaurav Chandawani', 'Thankot'), 175 | # MAGIC ('SN006', 'Barkha Tiwari', 'Butwal'); 176 | 177 | # COMMAND ---------- 178 | 179 | 180 | 181 | # COMMAND ---------- 182 | 183 | print('Success: demo tables are ready.') 184 | -------------------------------------------------------------------------------- /SQL Refresher/PS000-INTRODUCTION.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | 3 | -------------------------------------------------------------------------------- /SQL Refresher/SR000-Introduction.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### Introduction 4 | # MAGIC This course is the first installment of databricks data engineering course. In this course you will learn basic SQL concept which include: 5 | # MAGIC 1. Create, Select, Update, Delete tables 6 | # MAGIC 1. Create database 7 | # MAGIC 1. Filtering data 8 | # MAGIC 1. Group by & aggregation 9 | # MAGIC 1. Ordering 10 | # MAGIC 1. SQL joins 11 | # MAGIC 1. Common table expression (CTE) 12 | # MAGIC 1. External tables 13 | # MAGIC 1. Sub queries 14 | # MAGIC 1. Views & temp views 15 | # MAGIC 1. UNION, INTERSECT, EXCEPT keywords 16 | # MAGIC 17 | # MAGIC you can download all the notebook from our 18 | # MAGIC 19 | # MAGIC github repo: https://github.com/martandsingh/ApacheSpark 20 | # MAGIC 21 | # MAGIC facebook: https://www.facebook.com/codemakerz 22 | # MAGIC 23 | # MAGIC email: martandsays@gmail.com 24 | # MAGIC 25 | # MAGIC ### SETUP folder 26 | # MAGIC you will see initial_setup & clean_up notebooks called in every notebooks. It is mandatory to run both the scripts in defined order. initial script will create all the mandatory tables & database for the demo. After you finish your notebook, execute clean up notebook, it will clean all the db objects. 27 | # MAGIC 28 | # MAGIC ![SQL](https://raw.githubusercontent.com/martandsingh/images/master/sql.png) 29 | 30 | # COMMAND ---------- 31 | 32 | # MAGIC % 33 | -------------------------------------------------------------------------------- /SQL Refresher/SR001-Basic CRUD.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### What is CRUD? 4 | # MAGIC CRUD stands for CREATE/RETRIEVE/UPDATE/DELETE. In this demo we will see how can we create a table and perform basic operations on it. 5 | # MAGIC 6 | # MAGIC ![CRUD](https://raw.githubusercontent.com/martandsingh/images/master/crud.png) 7 | 8 | # COMMAND ---------- 9 | 10 | # MAGIC %sql 11 | # MAGIC -- CREATE A TABLE. Below command will create student table with three columns. IF NOT EXISTS clause will create table only if it does not exists, if you have table then it will ignore the statement. 12 | # MAGIC CREATE TABLE IF NOT EXISTS students( 13 | # MAGIC student_id VARCHAR(10), 14 | # MAGIC student_name VARCHAR(50), 15 | # MAGIC course VARCHAR(50) 16 | # MAGIC ); 17 | 18 | # COMMAND ---------- 19 | 20 | # MAGIC %sql 21 | # MAGIC -- check table details. it will return you table information like column name, datatypes 22 | # MAGIC DESCRIBE students 23 | 24 | # COMMAND ---------- 25 | 26 | # MAGIC %sql 27 | # MAGIC -- Let's add some data. 28 | # MAGIC INSERT INTO students 29 | # MAGIC (student_id, student_name, course) 30 | # MAGIC VALUES 31 | # MAGIC ('ST001', 'ABC', 'BBA'), 32 | # MAGIC ('ST002', 'XYZ', 'MBA'), 33 | # MAGIC ('ST003', 'PQR', 'BCA'); 34 | # MAGIC --above statement will insert three records to student tables 35 | 36 | # COMMAND ---------- 37 | 38 | # MAGIC %sql 39 | # MAGIC -- query students table 40 | # MAGIC SELECT * FROM students; 41 | 42 | # COMMAND ---------- 43 | 44 | # MAGIC %sql 45 | # MAGIC -- now lets update cours for student_id ST003. The student wants to change his/her course in the middle of semester. Now we have to update course in the student table. 46 | # MAGIC UPDATE students 47 | # MAGIC SET course = 'B.Tech' 48 | # MAGIC WHERE student_id = 'ST003' 49 | 50 | # COMMAND ---------- 51 | 52 | # MAGIC %sql 53 | # MAGIC SELECT * FROM students 54 | # MAGIC -- course changed. 55 | 56 | # COMMAND ---------- 57 | 58 | # MAGIC %sql 59 | # MAGIC -- lets delete ST003 as he was not happy with the college, he decided to move to another college. So we have to remove his name from the table. 60 | # MAGIC DELETE FROM students WHERE student_id = 'ST003' 61 | 62 | # COMMAND ---------- 63 | 64 | # MAGIC %sql 65 | # MAGIC SELECT * FROM students 66 | 67 | # COMMAND ---------- 68 | 69 | # This was a very basic CRUD demo to give you a quick start. You will more detailed queries in further demos. So keep going... You can do it. 70 | 71 | # COMMAND ---------- 72 | 73 | 74 | -------------------------------------------------------------------------------- /SQL Refresher/SR002-Select & Filtering.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### What is filtering? 4 | # MAGIC Selecting a subset of data based on some business logic is called filtering. 5 | # MAGIC e.g. You have data for multiple countries, then you may want to select data only for one particular country or city or both. 6 | # MAGIC 7 | # MAGIC ![Filtering](https://raw.githubusercontent.com/martandsingh/images/master/filtering.png) 8 | 9 | # COMMAND ---------- 10 | 11 | # MAGIC %run ../SETUP/_initial_setup 12 | 13 | # COMMAND ---------- 14 | 15 | # MAGIC %sql 16 | # MAGIC -- * astrix will select all the columns and rows. As a big data engineer, you should avoid this because in real life scenario your table will have billions of records which you do not want to fetch frequently. 17 | # MAGIC SELECT * FROM club; 18 | 19 | # COMMAND ---------- 20 | 21 | # MAGIC %sql 22 | # MAGIC SELECT * FROM department; 23 | 24 | # COMMAND ---------- 25 | 26 | # MAGIC %sql 27 | # MAGIC SELECT * FROM employee; 28 | 29 | # COMMAND ---------- 30 | 31 | # MAGIC %sql -- Projection: select only few columns. this is a prefferd practice. Only select the columns which you need. It will optimize your query. 32 | # MAGIC SELECT 33 | # MAGIC firstname, 34 | # MAGIC lastname, 35 | # MAGIC dept_id 36 | # MAGIC FROM 37 | # MAGIC employee; 38 | 39 | # COMMAND ---------- 40 | 41 | # MAGIC %sql 42 | # MAGIC -- SELECT top 5 records. 43 | # MAGIC SELECT 44 | # MAGIC firstname, 45 | # MAGIC lastname, 46 | # MAGIC dept_id 47 | # MAGIC FROM 48 | # MAGIC employee 49 | # MAGIC LIMIT 50 | # MAGIC 5; 51 | 52 | # COMMAND ---------- 53 | 54 | # MAGIC %sql 55 | # MAGIC -- apply filters using WHERE keyword choose all the employee of department DEP001 56 | # MAGIC SELECT 57 | # MAGIC * 58 | # MAGIC FROM 59 | # MAGIC employee 60 | # MAGIC WHERE 61 | # MAGIC dept_id = 'DEP001' 62 | 63 | # COMMAND ---------- 64 | 65 | # MAGIC %sql 66 | # MAGIC -- apply filters using WHERE keyword choose all the employee of department DEP001 & club C1 67 | # MAGIC SELECT 68 | # MAGIC * 69 | # MAGIC FROM 70 | # MAGIC employee 71 | # MAGIC WHERE 72 | # MAGIC dept_id = 'DEP001' 73 | # MAGIC AND club_id = 'C1' 74 | 75 | # COMMAND ---------- 76 | 77 | # MAGIC %sql 78 | # MAGIC -- Find all the employees from club C1, C2 & C3 79 | # MAGIC SELECT * FROM employee 80 | # MAGIC WHERE club_id IN ('C1', 'C2', 'C3') 81 | 82 | # COMMAND ---------- 83 | 84 | # MAGIC %sql 85 | # MAGIC -- Find all the employees which are not in club C1, C2 & C3 86 | # MAGIC SELECT * FROM employee 87 | # MAGIC WHERE club_id NOT IN ('C1', 'C2', 'C3') 88 | 89 | # COMMAND ---------- 90 | 91 | # MAGIC %run ../SETUP/_clean_up 92 | 93 | # COMMAND ---------- 94 | 95 | 96 | -------------------------------------------------------------------------------- /SQL Refresher/SR003-JOINS.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### What are Joins? 4 | # MAGIC Joins are used to combine two or more tables based on one or more column. This is used to select data from multiple table. 5 | # MAGIC 6 | # MAGIC ### Types of join 7 | # MAGIC There are 4 major kind of joins: 8 | # MAGIC 1. Inner join 9 | # MAGIC 1. Left outer join 10 | # MAGIC 1. Right outer join 11 | # MAGIC 1. Full outer join 12 | # MAGIC 1. Cross join 13 | # MAGIC ![SQL_JOIN](https://raw.githubusercontent.com/martandsingh/images/master/join_demo.png) 14 | 15 | # COMMAND ---------- 16 | 17 | # MAGIC %run ../SETUP/_initial_setup 18 | 19 | # COMMAND ---------- 20 | 21 | # MAGIC %md 22 | # MAGIC ### Database & table details 23 | # MAGIC _initial_setup command will run our setup notebook where we are creating DB_DEMO database & 3 table (employee, department, club). employee table include employee details including department id(dept_id) & club id (club_id) which are the Foreign key related with department & club table respectively. Below entity diagram shows the relation between tables. 24 | # MAGIC 25 | # MAGIC 26 | # MAGIC ![my_test_image](https://raw.githubusercontent.com/martandsingh/images/master/entity_diag.png) 27 | 28 | # COMMAND ---------- 29 | 30 | # MAGIC %md 31 | 32 | # COMMAND ---------- 33 | 34 | # MAGIC %md 35 | # MAGIC #### INNER JOIN 36 | # MAGIC Returns rows that have matching values in both the table (LEFT table, RIGHT table). Left table is the one mentioned before JOIN clause & RIGHT table is the one mentioned after JOIN clause. 37 | # MAGIC 38 | # MAGIC Syntax: 39 | # MAGIC 40 | # MAGIC SELECT A.col, B.col 41 | # MAGIC 42 | # MAGIC FROM {LEFT_TABLE} A 43 | # MAGIC 44 | # MAGIC INNER {JOIN RIGHT_TABLE} B 45 | # MAGIC 46 | # MAGIC ON A.{col} = B.{col} 47 | 48 | # COMMAND ---------- 49 | 50 | # MAGIC %sql 51 | # MAGIC select * from employee 52 | 53 | # COMMAND ---------- 54 | 55 | # MAGIC %sql -- We can see, we have department_id in our employee table which tell us about department of the employee. But what if we want department 56 | # MAGIC -- name instead? We have to put an inner join employee table with department table based on dept id 57 | # MAGIC SELECT 58 | # MAGIC E.firstname, E.lastname, D.dept_name AS department 59 | # MAGIC FROM 60 | # MAGIC employee E 61 | # MAGIC INNER JOIN department D ON E.dept_id = D.dept_id 62 | # MAGIC 63 | 64 | # COMMAND ---------- 65 | 66 | # MAGIC %sql 67 | # MAGIC SELECT 68 | # MAGIC E.firstname, E.lastname, D.dept_name AS department 69 | # MAGIC FROM 70 | # MAGIC employee E 71 | # MAGIC LEFT JOIN department D ON E.dept_id = D.dept_id 72 | 73 | # COMMAND ---------- 74 | 75 | # MAGIC %sql 76 | # MAGIC SELECT 77 | # MAGIC E.firstname, E.lastname, D.dept_name AS department 78 | # MAGIC FROM 79 | # MAGIC employee E 80 | # MAGIC RIGHT JOIN department D ON E.dept_id = D.dept_id 81 | 82 | # COMMAND ---------- 83 | 84 | # MAGIC %sql 85 | # MAGIC SELECT 86 | # MAGIC E.firstname, E.lastname, D.dept_name AS department 87 | # MAGIC FROM 88 | # MAGIC employee E 89 | # MAGIC FULL JOIN department D ON E.dept_id = D.dept_id 90 | 91 | # COMMAND ---------- 92 | 93 | # MAGIC %sql 94 | # MAGIC SELECT M.meal_name, D.drink_name 95 | # MAGIC FROM meal M CROSS JOIN drink D 96 | 97 | # COMMAND ---------- 98 | 99 | # MAGIC %run ../SETUP/_clean_up 100 | 101 | # COMMAND ---------- 102 | 103 | 104 | -------------------------------------------------------------------------------- /SQL Refresher/SR004-Order & Grouping.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### What is Grouping? 4 | # MAGIC The GROUP BY statement groups rows that have the same values into summary rows, like "find the number of customers in each country". The GROUP BY statement is often used with aggregate functions ( COUNT() , MAX() , MIN() , SUM() , AVG() ) to group the result-set by one or more columns. 5 | # MAGIC 6 | # MAGIC Keyword: GROUP BY 7 | # MAGIC 8 | # MAGIC ### What is ordering? 9 | # MAGIC The SQL ORDER BY clause is used to sort the data in ascending or descending order, based on one or more columns. Some databases sort the query results in an ascending order by default. 10 | # MAGIC 11 | # MAGIC Keyword: ORDER BY 12 | # MAGIC 13 | # MAGIC ![Grouping](https://raw.githubusercontent.com/martandsingh/images/master/grouping.png) 14 | 15 | # COMMAND ---------- 16 | 17 | # MAGIC %run ../SETUP/_initial_setup 18 | 19 | # COMMAND ---------- 20 | 21 | # MAGIC %sql 22 | # MAGIC SELECT * FROM club 23 | 24 | # COMMAND ---------- 25 | 26 | # MAGIC %sql -- Let's caculate number of employees for each club 27 | # MAGIC SELECT 28 | # MAGIC club_id, 29 | # MAGIC COUNT(1) AS total_members 30 | # MAGIC FROM 31 | # MAGIC employee 32 | # MAGIC GROUP BY 33 | # MAGIC club_id 34 | 35 | # COMMAND ---------- 36 | 37 | # MAGIC %sql -- Let's caculate number of employees for each club and sort the result in DECREASING order of total_members 38 | # MAGIC SELECT 39 | # MAGIC club_id, 40 | # MAGIC COUNT(1) AS total_members 41 | # MAGIC FROM 42 | # MAGIC employee 43 | # MAGIC GROUP BY 44 | # MAGIC club_id 45 | # MAGIC ORDER BY 46 | # MAGIC total_members DESC 47 | 48 | # COMMAND ---------- 49 | 50 | # MAGIC %sql -- Let's caculate number of employees for each club and sort the result in INCREASING order of total_members 51 | # MAGIC SELECT 52 | # MAGIC club_id, 53 | # MAGIC COUNT(1) AS total_members 54 | # MAGIC FROM 55 | # MAGIC employee 56 | # MAGIC GROUP BY 57 | # MAGIC club_id 58 | # MAGIC ORDER BY 59 | # MAGIC total_members 60 | 61 | # COMMAND ---------- 62 | 63 | # MAGIC %md 64 | # MAGIC Above query does not seems perfect, As our user does not know what is C1, C2... so on. We need to specify the name of the club. For that we have to perform an inner join with club table. 65 | # MAGIC Lets find out name of the club and total members. You have to select only club which has more than one member. Arrange them in decreasing order of total members. 66 | 67 | # COMMAND ---------- 68 | 69 | # MAGIC %sql 70 | # MAGIC SELECT 71 | # MAGIC C.club_name, 72 | # MAGIC COUNT(1) AS total_members 73 | # MAGIC FROM 74 | # MAGIC employee E 75 | # MAGIC INNER JOIN club C ON E.club_id = C.club_id 76 | # MAGIC GROUP BY 77 | # MAGIC C.club_name 78 | # MAGIC HAVING 79 | # MAGIC total_members > 1 80 | # MAGIC ORDER BY 81 | # MAGIC total_members DESC 82 | 83 | # COMMAND ---------- 84 | 85 | # MAGIC %md 86 | # MAGIC We can group & order our resultset based on more than one column. lets group our data based on club & department. 87 | 88 | # COMMAND ---------- 89 | 90 | # MAGIC %sql 91 | # MAGIC SELECT 92 | # MAGIC D.dept_name AS Department, 93 | # MAGIC C.club_name AS Club, 94 | # MAGIC COUNT(1) AS total_members 95 | # MAGIC FROM 96 | # MAGIC employee E 97 | # MAGIC INNER JOIN club C ON E.club_id = C.club_id 98 | # MAGIC INNER JOIN department D ON E.dept_id = D.dept_id 99 | # MAGIC GROUP BY 100 | # MAGIC D.dept_name, 101 | # MAGIC C.club_name 102 | # MAGIC ORDER BY 103 | # MAGIC total_members -- so now you can see, there are multiple rows for marketing & IT. As these two department has members from different club. 104 | 105 | # COMMAND ---------- 106 | 107 | # MAGIC %md 108 | # MAGIC Most of the time we use aggregation functions with grouped data. Lets calculate average basic salary of each department. 109 | 110 | # COMMAND ---------- 111 | 112 | # MAGIC %sql 113 | # MAGIC SELECT 114 | # MAGIC * 115 | # MAGIC FROM 116 | # MAGIC employee 117 | # MAGIC order by 118 | # MAGIC empcode 119 | 120 | # COMMAND ---------- 121 | 122 | # MAGIC %sql 123 | # MAGIC SELECT 124 | # MAGIC D.dept_name, 125 | # MAGIC ROUND(AVG(ES.basic_salary), 2) AS AVG_BASIC_SALARY 126 | # MAGIC FROM 127 | # MAGIC employee E 128 | # MAGIC INNER JOIN emp_salary ES ON E.empcode = ES.empcode 129 | # MAGIC INNER JOIN department D ON E.dept_id = D.dept_id 130 | # MAGIC GROUP BY 131 | # MAGIC D.dept_name 132 | # MAGIC ORDER BY 133 | # MAGIC AVG_BASIC_SALARY 134 | 135 | # COMMAND ---------- 136 | 137 | # MAGIC %md 138 | # MAGIC Our finance team is now wants to downsize the company(not a good news for employees... :( ). They want you to calculate total salary distributed by each department. 139 | 140 | # COMMAND ---------- 141 | 142 | # MAGIC %sql 143 | # MAGIC SELECT 144 | # MAGIC D.dept_name, 145 | # MAGIC ROUND(SUM(ES.basic_salary), 2) AS TOTAL_BASIC_SALARY 146 | # MAGIC FROM 147 | # MAGIC employee E 148 | # MAGIC INNER JOIN emp_salary ES ON E.empcode = ES.empcode 149 | # MAGIC INNER JOIN department D ON E.dept_id = D.dept_id 150 | # MAGIC GROUP BY 151 | # MAGIC D.dept_name 152 | # MAGIC ORDER BY 153 | # MAGIC TOTAL_BASIC_SALARY 154 | 155 | # COMMAND ---------- 156 | 157 | # MAGIC %run ../SETUP/_clean_up 158 | 159 | # COMMAND ---------- 160 | 161 | 162 | -------------------------------------------------------------------------------- /SQL Refresher/SR005-Sub Queries.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### Whats is subquery? 4 | # MAGIC A subquery is a SQL query nested inside a larger query. The subquery can be nested inside a SELECT, INSERT, UPDATE, or DELETE statement or inside another subquery. A subquery is usually added within the WHERE Clause of another SQL SELECT statement. 5 | # MAGIC 6 | # MAGIC A subquery can be used anywhere an expression is allowed. A subquery is also called an inner query or inner select, while the statement containing a subquery is also called an outer query or outer select. 7 | # MAGIC 8 | # MAGIC In Transact-SQL, there is usually no performance difference between a statement that includes a subquery and a semantically equivalent version that does not. For architectural information on how SQL Server processes queries, see SQL statement processing.However, in some cases where existence must be checked, a join yields better performance. Otherwise, the nested query must be processed for each result of the outer query to ensure elimination of duplicates. In such cases, a join approach would yield better results. 9 | # MAGIC 10 | # MAGIC 11 | # MAGIC *Note: there are multiple ways to write same query. You have to select the best way. This notebook is specifically for discussing sub queries, so some queries may not make sense but that is just for example. I want to show you multiple ways of generating same result. Later in this series we will have a specific notebook to talk about query execution order & optimization. There we will talk in detail about performance.* 12 | # MAGIC 13 | # MAGIC ![Subquery](https://raw.githubusercontent.com/martandsingh/images/master/subquery.jpg) 14 | 15 | # COMMAND ---------- 16 | 17 | # MAGIC %run ../SETUP/_initial_setup 18 | 19 | # COMMAND ---------- 20 | 21 | # MAGIC %md 22 | # MAGIC -- lets say we want to get all the employees who are the member of existing club(the club exists in club table). There are two ways to achieve this: 23 | # MAGIC 1. Inner Join between employee & club based on club_id 24 | # MAGIC 1. Using sub query 25 | 26 | # COMMAND ---------- 27 | 28 | # MAGIC %sql -- Lets use join first. This will return full name of all the employees who are member of a valid club(existing club). 29 | # MAGIC SELECT 30 | # MAGIC concat(E.firstname, ' ', E.lastname) AS FullName 31 | # MAGIC FROM 32 | # MAGIC employee E 33 | # MAGIC INNER JOIN club C ON E.club_id = C.club_id 34 | # MAGIC ORDER BY 35 | # MAGIC FullName 36 | 37 | # COMMAND ---------- 38 | 39 | # MAGIC %sql -- other way to get same result using sub query or inner query. Below query will return exactly same result as above. The query inside the paranthesis (SELECT club_id FROM club) is your sub query. First this query is executing and providing a resultset which later will be used in WHERE condition for the parent query. 40 | # MAGIC SELECT 41 | # MAGIC concat(E.firstname, ' ', E.lastname) AS FullName 42 | # MAGIC FROM 43 | # MAGIC employee E 44 | # MAGIC WHERE 45 | # MAGIC club_id IN ( 46 | # MAGIC SELECT 47 | # MAGIC club_id 48 | # MAGIC FROM 49 | # MAGIC club 50 | # MAGIC ) 51 | # MAGIC ORDER BY 52 | # MAGIC FullName 53 | 54 | # COMMAND ---------- 55 | 56 | # MAGIC %md 57 | # MAGIC Do not use sub queries blindly, sometimes it is not efficient to use sub queries. As we mentioned earlier, In case of existence check we should prefer JOINS over sub queries. You can compare the execution time of both the queries. We have a very small set of data, which may not show you a significat difference between queries. In real life case where you deal with GB, TB of data, the difference can be huge. 58 | # MAGIC 59 | # MAGIC Let's take one more example. We have to find out average basic salary of IT department. The output resultset must return only one column which is avg salary for IT department. Let's do this task with inner join & sub query. 60 | 61 | # COMMAND ---------- 62 | 63 | # MAGIC %sql 64 | # MAGIC SELECT 65 | # MAGIC ES.basic_salary 66 | # MAGIC FROM 67 | # MAGIC employee E 68 | # MAGIC INNER JOIN emp_salary ES ON E.empcode = ES.empcode 69 | # MAGIC WHERE 70 | # MAGIC dept_id = 'DEP001' 71 | 72 | # COMMAND ---------- 73 | 74 | # MAGIC %sql -- INNER JOIN 75 | # MAGIC SELECT 76 | # MAGIC ROUND(AVG(ES.basic_salary), 2) AS AVG_BASIC_SALARY 77 | # MAGIC from 78 | # MAGIC employee E 79 | # MAGIC INNER JOIN department D ON E.dept_id = D.dept_id 80 | # MAGIC INNER JOIN emp_salary ES ON E.empcode = ES.empcode 81 | # MAGIC GROUP BY 82 | # MAGIC D.dept_name 83 | # MAGIC HAVING 84 | # MAGIC D.dept_name = 'IT' 85 | 86 | # COMMAND ---------- 87 | 88 | # MAGIC %sql -- Above task using inner query or subquery 89 | # MAGIC SELECT 90 | # MAGIC ROUND(AVG(ES.basic_salary), 2) AS AVG_BASIC_SALARY 91 | # MAGIC from 92 | # MAGIC employee E 93 | # MAGIC INNER JOIN emp_salary ES ON E.empcode = ES.empcode 94 | # MAGIC WHERE 95 | # MAGIC E.dept_id = ( 96 | # MAGIC SELECT 97 | # MAGIC dept_id 98 | # MAGIC FROM 99 | # MAGIC department 100 | # MAGIC WHERE 101 | # MAGIC dept_name = 'IT' 102 | # MAGIC ) 103 | 104 | # COMMAND ---------- 105 | 106 | 107 | 108 | # COMMAND ---------- 109 | 110 | # MAGIC %run ../SETUP/_clean_up 111 | 112 | # COMMAND ---------- 113 | 114 | 115 | -------------------------------------------------------------------------------- /SQL Refresher/SR006-Views & Temp Views.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### What is View? 4 | # MAGIC View is a virtual table based on resultset of a query. A view always gives you latest resultset. In real scenario, if you have a complex query with many joins & sub queries which you want to reuse, you can create a view for that query and use it like a physical table. 5 | # MAGIC 6 | # MAGIC SYNTAX: 7 | # MAGIC 8 | # MAGIC CREATE VIEW IF NOT EXISTS {ViewName} AS 9 | # MAGIC 10 | # MAGIC 11 | # MAGIC SELECT * FROM TABLE 12 | # MAGIC 13 | # MAGIC ### What is Temp View? 14 | # MAGIC TEMPORARY views are session-scoped and is dropped when session ends because it skips persisting the definition in the underlying metastore, if any. GLOBAL TEMPORARY views are tied to a system preserved temporary schema global_temp. 15 | # MAGIC 16 | # MAGIC SYNTAX: 17 | # MAGIC 18 | # MAGIC CREATE TEMP VIEW IF NOT EXISTS {tempviewname} AS 19 | # MAGIC 20 | # MAGIC SELECT * FROM TABLE 21 | # MAGIC 22 | # MAGIC 23 | # MAGIC ![SQL_JOINS](https://raw.githubusercontent.com/martandsingh/images/master/view-demo.png) 24 | 25 | # COMMAND ---------- 26 | 27 | # MAGIC %run ../SETUP/_initial_setup 28 | 29 | # COMMAND ---------- 30 | 31 | # MAGIC %sql --Lets say we want to find all the the employee with their department & club name. For that we have to write a complex query with multiple join. 32 | # MAGIC SELECT 33 | # MAGIC E.firstname, 34 | # MAGIC E.lastname, 35 | # MAGIC D.dept_name AS Department, 36 | # MAGIC C.club_name AS Club 37 | # MAGIC FROM 38 | # MAGIC employee E 39 | # MAGIC INNER JOIN department D ON E.dept_id = D.dept_id 40 | # MAGIC INNER JOIN club C ON E.club_id = C.club_id 41 | 42 | # COMMAND ---------- 43 | 44 | # MAGIC %md 45 | # MAGIC Now we have a complex query which we may want to reuse that query in multiple procedure. You are in your scrum meeting and find out, you are using the wrong logic. Now you have to make changes in multiple procedure. It will: 46 | # MAGIC 1. Waste your time & effort 47 | # MAGIC 1. You may miss some procedures 48 | # MAGIC 49 | # MAGIC So best way to peform this task is to create a view. You can use that view in your procedures. In case of any logic change, now you only have to update your view. As view gives you the latest result, your changes will immediately reflect to all the procedures. tada!!!! 50 | 51 | # COMMAND ---------- 52 | 53 | # MAGIC %sql CREATE VIEW IF NOT EXISTS VW_GET_EMPLOYEES AS 54 | # MAGIC SELECT 55 | # MAGIC E.firstname, 56 | # MAGIC E.lastname, 57 | # MAGIC D.dept_name AS Department, 58 | # MAGIC C.club_name AS Club 59 | # MAGIC FROM 60 | # MAGIC employee E 61 | # MAGIC INNER JOIN department D ON E.dept_id = D.dept_id 62 | # MAGIC INNER JOIN club C ON E.club_id = C.club_id 63 | 64 | # COMMAND ---------- 65 | 66 | # MAGIC %md 67 | # MAGIC Now we have our complex query available as a view VW_GET_EMPLOYEES, which we can use as a regular table. 68 | 69 | # COMMAND ---------- 70 | 71 | # MAGIC %sql 72 | # MAGIC -- it will give you exact same result as our complex query 73 | # MAGIC SELECT * FROM VW_GET_EMPLOYEES 74 | 75 | # COMMAND ---------- 76 | 77 | # MAGIC %sql 78 | # MAGIC -- We can list views using SHOW VIEWS 79 | # MAGIC SHOW VIEWS 80 | 81 | # COMMAND ---------- 82 | 83 | 84 | 85 | # COMMAND ---------- 86 | 87 | # MAGIC %md 88 | # MAGIC Now let's create a temp view. How is it different than a regular view? 89 | # MAGIC 90 | # MAGIC Well a temp view will be available only for your current session. If you restart your session, you will loose your temp view. It will not affect any physical table or data. 91 | 92 | # COMMAND ---------- 93 | 94 | # MAGIC %sql 95 | # MAGIC CREATE OR REPLACE TEMP VIEW TEMP_VW_GET_EMPLOYEES AS 96 | # MAGIC SELECT 97 | # MAGIC E.firstname, 98 | # MAGIC E.lastname, 99 | # MAGIC D.dept_name AS Department, 100 | # MAGIC C.club_name AS Club 101 | # MAGIC FROM 102 | # MAGIC employee E 103 | # MAGIC INNER JOIN department D ON E.dept_id = D.dept_id 104 | # MAGIC INNER JOIN club C ON E.club_id = C.club_id 105 | 106 | # COMMAND ---------- 107 | 108 | # MAGIC %sql 109 | # MAGIC SELECT * FROM TEMP_VW_GET_EMPLOYEES 110 | 111 | # COMMAND ---------- 112 | 113 | # MAGIC %sql 114 | # MAGIC -- You can drop views using DROP VIEW 115 | # MAGIC DROP VIEW VW_GET_EMPLOYEES 116 | 117 | # COMMAND ---------- 118 | 119 | # MAGIC %run ../SETUP/_clean_up 120 | 121 | # COMMAND ---------- 122 | 123 | 124 | -------------------------------------------------------------------------------- /SQL Refresher/SR007-Common Table Expressions.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### What is common table expression (CTE)? 4 | # MAGIC A Common Table Expression (or CTE) is a feature in several SQL versions to improve the maintainability and readability of an SQL query. 5 | # MAGIC 6 | # MAGIC It is also known as: 7 | # MAGIC 1. Common Table Expression 8 | # MAGIC 1. Subquery Factoring 9 | # MAGIC 1. SQL WITH Clause 10 | # MAGIC 11 | # MAGIC A Common Table Expression (or CTE) is a query you can define within another SQL query. 12 | # MAGIC 13 | # MAGIC ### What is the difference between Subquery & CTE? 14 | # MAGIC If you are new to SQL concept, then it may sound like a subquery (believe me you are totally normal, it happened to me too :p). A CTE also generates a result that contains rows and columns of data. The difference is that you can give this result a name, and you can refer to it multiple times within your main query. So in other words, CTE provides you a named resultset. 15 | # MAGIC 16 | # MAGIC You can use a CTE in: 17 | # MAGIC 1. SELECT 18 | # MAGIC 1. INSERT 19 | # MAGIC 1. UPDATE 20 | # MAGIC 1. DELETE 21 | # MAGIC 22 | # MAGIC 23 | # MAGIC ![cte](https://raw.githubusercontent.com/martandsingh/images/master/cte.jpg) 24 | 25 | # COMMAND ---------- 26 | 27 | # MAGIC %run ../SETUP/_initial_setup 28 | 29 | # COMMAND ---------- 30 | 31 | # MAGIC %sql WITH cte_department_employee_count AS( 32 | # MAGIC SELECT 33 | # MAGIC D.dept_name AS Department, 34 | # MAGIC COUNT(1) AS `Total Members` 35 | # MAGIC FROM 36 | # MAGIC employee E 37 | # MAGIC INNER JOIN department D ON E.dept_id = D.dept_id 38 | # MAGIC GROUP BY 39 | # MAGIC D.dept_name 40 | # MAGIC ORDER BY 41 | # MAGIC `Total Members` DESC 42 | # MAGIC ) -- so here we can use cte_department_employee_count as our named result set immediately after CTE is defined. 43 | # MAGIC SELECT 44 | # MAGIC * 45 | # MAGIC FROM 46 | # MAGIC cte_department_employee_count; 47 | # MAGIC -- Keep in mind once you reed CTE, it will dissappear & throw error if you try to run it again. 48 | 49 | # COMMAND ---------- 50 | 51 | # MAGIC %md 52 | # MAGIC ### Nested CTE 53 | # MAGIC You can use nested CTE if you want to reuse your CTE in another. 54 | # MAGIC 55 | # MAGIC Syntax: 56 | # MAGIC 57 | # MAGIC WITH CTE1 AS ( 58 | # MAGIC 59 | # MAGIC {Query1} 60 | # MAGIC 61 | # MAGIC ), 62 | # MAGIC 63 | # MAGIC CTE2 AS ( 64 | # MAGIC 65 | # MAGIC {QUERY2} 66 | # MAGIC 67 | # MAGIC ) 68 | # MAGIC 69 | # MAGIC SELECT * FROM CTE1; 70 | # MAGIC 71 | # MAGIC SELECT * FROM CTE2; 72 | # MAGIC 73 | # MAGIC let's try it out. 74 | 75 | # COMMAND ---------- 76 | 77 | # MAGIC %sql -- let's create one CTE which will calculate department wise salary. We will use full join so that we can include department which does not exists, this will generate NULL values for AVG salary field. In the second expression we will remove those invalid rows (basic salary with null values). This can be done in a simpler way using a single query but for the sake of tutorial, I am using 2 different CTE to achevie this task, but in real life scenario CTE is not a good way to acheive this. You can simple do this by one group statement & filter (WHERE). You can see in the below query we applied outer join (just for the sake of tutorial, do not think about the business logic here). You can see many invalid rows with dept_name and AVG_BASIC_SALARY as null. Now we will change this query to CTE & using another CTE we will clean these rows. 78 | # MAGIC --WITH cte_dept_salary AS( 79 | # MAGIC SELECT 80 | # MAGIC D.dept_name, 81 | # MAGIC AVG(ES.basic_salary) AS AVG_BASIC_SALARY 82 | # MAGIC FROM 83 | # MAGIC employee E FULL 84 | # MAGIC JOIN emp_salary ES ON E.empcode = ES.empcode FULL 85 | # MAGIC JOIN department D ON E.dept_id = D.dept_id 86 | # MAGIC GROUP BY 87 | # MAGIC D.dept_name --) 88 | 89 | # COMMAND ---------- 90 | 91 | # MAGIC %sql WITH cte_avg_salary AS ( 92 | # MAGIC SELECT 93 | # MAGIC D.dept_name, 94 | # MAGIC AVG(ES.basic_salary) AS AVG_BASIC_SALARY 95 | # MAGIC FROM 96 | # MAGIC employee E FULL 97 | # MAGIC JOIN emp_salary ES ON E.empcode = ES.empcode FULL 98 | # MAGIC JOIN department D ON E.dept_id = D.dept_id 99 | # MAGIC GROUP BY 100 | # MAGIC D.dept_name 101 | # MAGIC ), 102 | # MAGIC cte_avg_salary_clean AS ( 103 | # MAGIC SELECT 104 | # MAGIC * 105 | # MAGIC FROM 106 | # MAGIC cte_avg_salary 107 | # MAGIC WHERE 108 | # MAGIC dept_name IS NOT NULL 109 | # MAGIC AND AVG_BASIC_SALARY IS NOT NULL 110 | # MAGIC ) 111 | # MAGIC SELECT 112 | # MAGIC * 113 | # MAGIC FROM 114 | # MAGIC cte_avg_salary_clean; 115 | 116 | # COMMAND ---------- 117 | 118 | # MAGIC %md 119 | # MAGIC ### CTE with View 120 | # MAGIC You can also create view using your CTE. 121 | # MAGIC 122 | # MAGIC Syntax: 123 | # MAGIC 124 | # MAGIC CREATE OR REPLACE VIEW {ViewName} AS 125 | # MAGIC 126 | # MAGIC WITH CTE AS ( 127 | # MAGIC 128 | # MAGIC {COMPLEX QUERY} 129 | # MAGIC 130 | # MAGIC ) 131 | # MAGIC 132 | # MAGIC SELECT * FROM CTE 133 | 134 | # COMMAND ---------- 135 | 136 | # MAGIC %sql -- We can use CTE with views also. Let's use above nested cte in a view 137 | # MAGIC CREATE 138 | # MAGIC OR REPLACE VIEW VW_DEPT_SALARY AS WITH cte_avg_salary AS ( 139 | # MAGIC SELECT 140 | # MAGIC D.dept_name, 141 | # MAGIC AVG(ES.basic_salary) AS AVG_BASIC_SALARY 142 | # MAGIC FROM 143 | # MAGIC employee E FULL 144 | # MAGIC JOIN emp_salary ES ON E.empcode = ES.empcode FULL 145 | # MAGIC JOIN department D ON E.dept_id = D.dept_id 146 | # MAGIC GROUP BY 147 | # MAGIC D.dept_name 148 | # MAGIC ), 149 | # MAGIC cte_avg_salary_clean AS ( 150 | # MAGIC SELECT 151 | # MAGIC dept_name AS Department, 152 | # MAGIC ROUND(AVG_BASIC_SALARY, 2) AS AVG_BASIC_SALARY 153 | # MAGIC FROM 154 | # MAGIC cte_avg_salary 155 | # MAGIC WHERE 156 | # MAGIC dept_name IS NOT NULL 157 | # MAGIC AND AVG_BASIC_SALARY IS NOT NULL 158 | # MAGIC ) 159 | # MAGIC SELECT 160 | # MAGIC * 161 | # MAGIC FROM 162 | # MAGIC cte_avg_salary_clean; 163 | 164 | # COMMAND ---------- 165 | 166 | # MAGIC %sql 167 | # MAGIC SELECT 168 | # MAGIC * 169 | # MAGIC FROM 170 | # MAGIC VW_DEPT_SALARY; 171 | 172 | # COMMAND ---------- 173 | 174 | # MAGIC %run ../SETUP/_clean_up 175 | 176 | # COMMAND ---------- 177 | 178 | 179 | -------------------------------------------------------------------------------- /SQL Refresher/SR008 - EXCEPT, UNION, UNION ALL, INTERSECTION.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### UNION 4 | # MAGIC The UNION operator is used to combine the result-set of two or more SELECT statements. It will remove duplicate rows from the final resultset. 5 | # MAGIC 6 | # MAGIC ### UNION ALL 7 | # MAGIC The UNION operator is used to combine the result-set of two or more SELECT statements. It will include duplicate rows in the final resultset. 8 | # MAGIC 9 | # MAGIC ### INTERSECT 10 | # MAGIC The INTERSECT clause in SQL is used to combine two SELECT statements but the dataset returned by the INTERSECT statement will be the intersection of the data-sets of the two SELECT statements. In simple words, the INTERSECT statement will return only those rows which will be common to both of the SELECT statements. 11 | # MAGIC 12 | # MAGIC ### EXCEPT 13 | # MAGIC The SQL EXCEPT operator is used to return all rows in the first SELECT statement that are not returned by the second SELECT statement. 14 | # MAGIC 15 | # MAGIC *to use all these operations, both the table should have same number of columns & same types of columns* 16 | # MAGIC 17 | # MAGIC ![Union_Intersection](https://raw.githubusercontent.com/martandsingh/images/master/union.jpg) 18 | 19 | # COMMAND ---------- 20 | 21 | # MAGIC %run ../SETUP/_initial_setup 22 | 23 | # COMMAND ---------- 24 | 25 | # MAGIC %sql 26 | # MAGIC SELECT * FROM supplier_india 27 | 28 | # COMMAND ---------- 29 | 30 | # MAGIC %sql 31 | # MAGIC SELECT * FROM supplier_nepal 32 | 33 | # COMMAND ---------- 34 | 35 | # MAGIC %md 36 | # MAGIC ### UNION DEMO 37 | 38 | # COMMAND ---------- 39 | 40 | # MAGIC %sql 41 | # MAGIC -- here union will combine our both the dataset excluding duplicates. But you must be wondering why Martand Singh & Gaurav Chadwani are showing twice? 42 | # MAGIC -- The reason behind this is, as you are selecing supp_id & city in your resultset, which makes your row unique than other row. UNION check for duplicate value for the combination of the columns in the resultset. If you remove supp_id & city from query then it will remove duplicate suppliers. Let's try it in the next cell. 43 | # MAGIC 44 | # MAGIC SELECT supp_id, supp_name, city FROM supplier_nepal 45 | # MAGIC UNION 46 | # MAGIC SELECT supp_id, supp_name, city FROM supplier_india 47 | 48 | # COMMAND ---------- 49 | 50 | # MAGIC %sql 51 | # MAGIC -- now you can see thoe tw duplicate suppliers are gone now. This is because we are only selecting supplier_name 52 | # MAGIC SELECT supp_name FROM supplier_nepal 53 | # MAGIC UNION 54 | # MAGIC SELECT supp_name FROM supplier_india 55 | 56 | # COMMAND ---------- 57 | 58 | # MAGIC %md 59 | # MAGIC ### UNION ALL DEMO 60 | # MAGIC UNION ALL perform same as UNION except union all will include duplicate rows also in resultset. 61 | 62 | # COMMAND ---------- 63 | 64 | # MAGIC %sql 65 | # MAGIC -- Now you will see duplicate records as we are using UNION ALL. You can see Martand Singh & Gaurav Chandawani are duplicate. If all the records are unique then union and union all behave same. UNION ALL is faster than UNION as it does not have to perform checks for duplicate rows. So until you dont need duplicate check, try to use UNION ALL. 66 | # MAGIC SELECT supp_name FROM supplier_nepal 67 | # MAGIC UNION ALL 68 | # MAGIC SELECT supp_name FROM supplier_india 69 | 70 | # COMMAND ---------- 71 | 72 | # MAGIC %md 73 | # MAGIC ###INTERSECT DEMO 74 | # MAGIC Let's say we want to find common supplier names. Intersection will give you records which are available in both the tables. Again it will find out common rows based on all the columns in your select query. So if we include city or supplier id, it will not give you any result as all three column makes your record unique. 75 | 76 | # COMMAND ---------- 77 | 78 | # MAGIC %sql 79 | # MAGIC -- this will return zero rows 80 | # MAGIC SELECT supp_id, supp_name, city FROM supplier_india 81 | # MAGIC INTERSECT 82 | # MAGIC SELECT supp_id, supp_name, city FROM supplier_nepal 83 | 84 | # COMMAND ---------- 85 | 86 | # MAGIC %sql 87 | # MAGIC -- this will return 2 rows as these two supplier names are common in both the tables. 88 | # MAGIC SELECT supp_name FROM supplier_india 89 | # MAGIC INTERSECT 90 | # MAGIC SELECT supp_name FROM supplier_nepal 91 | 92 | # COMMAND ---------- 93 | 94 | # MAGIC %md 95 | # MAGIC ### EXCEPT DEMO 96 | # MAGIC The SQL EXCEPT operator is used to return all rows in the first SELECT statement that are not returned by the second SELECT statement. 97 | # MAGIC 98 | # MAGIC Let's find all the suppliers from India which are only managing Indian market. 99 | # MAGIC Or in other words, Let's find all the Indian supplier which are not operating in Nepal. 100 | 101 | # COMMAND ---------- 102 | 103 | # MAGIC %sql 104 | # MAGIC -- so to calculate above query, we will select all the Indian supplier which are not available in nepali supplier table 105 | # MAGIC SELECT supp_name FROM supplier_india 106 | # MAGIC EXCEPT 107 | # MAGIC SELECT supp_name FROM supplier_nepal 108 | # MAGIC 109 | # MAGIC --Above query says, get all supplier name from India which are not available in Nepal. 110 | # MAGIC --So there are only two suppliers which operation only in India other two (Martand & Gaurav) they operate in Nepal also, as they are available in Nepal supplier list too. 111 | 112 | # COMMAND ---------- 113 | 114 | # MAGIC %sql 115 | # MAGIC -- so if you reverse the query, then it says, get all the supplier from nepal which are only operating in nepal (or not operating in india). 116 | # MAGIC SELECT supp_name FROM supplier_nepal 117 | # MAGIC EXCEPT 118 | # MAGIC SELECT supp_name FROM supplier_india 119 | 120 | # COMMAND ---------- 121 | 122 | # MAGIC %run ../SETUP/_clean_up 123 | 124 | # COMMAND ---------- 125 | 126 | 127 | -------------------------------------------------------------------------------- /SQL Refresher/SR009-External Tables.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### What is external table? 4 | # MAGIC In a typical table, the data is stored in the database; however, in an external table, the data is stored in files in an external stage. External tables store file-level metadata about the data files, such as the filename, a version identifier and related properties. In external table, the data is not stored in database system, it is saved somewhere else in external location. These tables are slow to load as data is loaded from external source every time your run query on it. 5 | # MAGIC 6 | # MAGIC SYNTAX: 7 | # MAGIC We use "external" keyword to create an external table. 8 | # MAGIC 9 | # MAGIC Some characterstics: 10 | # MAGIC 1. Data in external storage 11 | # MAGIC 1. Slower to load 12 | # MAGIC 1. Dropping the table will delete only table meta data. Your external data will be safe. 13 | # MAGIC 1. It give you latest result from the file. If you delete external file then your query will throw exception. 14 | # MAGIC 15 | # MAGIC Use case: 16 | # MAGIC 1. As a data engineer, I use external table when I want to perform some adhoc analysis on some external file which we use infrequently. 17 | # MAGIC 1. Sometimes you have some data in csv or flat files which your business user provides you & it frequently gets updated (my personal experience), in that case you can use external table(considering file size is small) to load that data, so that everytime you get latest result. 18 | # MAGIC 19 | # MAGIC ![External_Table](https://raw.githubusercontent.com/martandsingh/images/master/external.png) 20 | 21 | # COMMAND ---------- 22 | 23 | # MAGIC %run ../SETUP/_initial_setup 24 | 25 | # COMMAND ---------- 26 | 27 | # MAGIC %sql -- here we are creating an external table from CSV file which is stored in my databricks storage. below command may not run in your system as you may not have same file at the same location. So you can upload a CSV file (in my case it was tab delimited) in your databricks storage and update the path and columns accordingly. th csv file which I used is available at : https://raw.githubusercontent.com/martandsingh/datasets/master/bank-full.csv 28 | # MAGIC DROP TABLE IF EXISTS bank_report; 29 | # MAGIC CREATE EXTERNAL TABLE bank_report ( 30 | # MAGIC age STRING, 31 | # MAGIC job STRING, 32 | # MAGIC marital STRING, 33 | # MAGIC education STRING, 34 | # MAGIC default STRING, 35 | # MAGIC balance STRING, 36 | # MAGIC housing STRING, 37 | # MAGIC loan STRING, 38 | # MAGIC contact STRING, 39 | # MAGIC day STRING, 40 | # MAGIC month STRING, 41 | # MAGIC duration STRING, 42 | # MAGIC campaign STRING, 43 | # MAGIC pdays STRING, 44 | # MAGIC previous STRING, 45 | # MAGIC poutcome STRING, 46 | # MAGIC y STRING 47 | # MAGIC ) USING CSV OPTIONS ( 48 | # MAGIC path "/FileStore/tables/dataset/*.csv", 49 | # MAGIC delimiter ";", 50 | # MAGIC header "true" 51 | # MAGIC ); 52 | 53 | # COMMAND ---------- 54 | 55 | # MAGIC %sql -- now let's select top 100 records from our external table 56 | # MAGIC SELECT 57 | # MAGIC * 58 | # MAGIC FROM 59 | # MAGIC bank_report 60 | # MAGIC LIMIT 61 | # MAGIC 100; 62 | 63 | # COMMAND ---------- 64 | 65 | # MAGIC %sql -- We can also create external table without defining schema. 66 | # MAGIC DROP TABLE IF EXISTS bank_report_nc; 67 | # MAGIC CREATE EXTERNAL TABLE bank_report_nc USING CSV OPTIONS ( 68 | # MAGIC path "/FileStore/tables/dataset/*.csv", 69 | # MAGIC delimiter ";", 70 | # MAGIC header "true" 71 | # MAGIC ); 72 | 73 | # COMMAND ---------- 74 | 75 | # MAGIC %sql 76 | # MAGIC SELECT * FROM bank_report_nc LIMIT 10; 77 | 78 | # COMMAND ---------- 79 | 80 | # MAGIC %sql -- creating an external table in delta location. We are getting a subset of our external table bank_report and saving the output to a new delta table bank_report_del. 81 | # MAGIC DROP TABLE IF EXISTS bank_report_del; 82 | # MAGIC CREATE TABLE bank_report_del USING DELTA AS 83 | # MAGIC SELECT 84 | # MAGIC * 85 | # MAGIC FROM 86 | # MAGIC bank_report 87 | # MAGIC WHERE 88 | # MAGIC balance > 1500 89 | 90 | # COMMAND ---------- 91 | 92 | # MAGIC %sql 93 | # MAGIC SELECT * FROM bank_report_del 94 | 95 | # COMMAND ---------- 96 | 97 | # DO NOT WORRY ABOUT THIS CODE. I am using this code to save parquet file in your dbfs, for the external table demo using parquet file in next cell. We will undertand these code in future. For now, just run this cell. 98 | df = spark.sql("SELECT * FROM bank_report WHERE balance > 1500") 99 | #display(df) 100 | 101 | df.write.format("delta").mode("overwrite").save("/delta/bank_users_1500") 102 | 103 | # COMMAND ---------- 104 | 105 | # You can check the files using dbutils 106 | display(dbutils.fs.ls('/delta/bank_users_1500')) 107 | 108 | # COMMAND ---------- 109 | 110 | # MAGIC %sql --We are creating an external table using delta location. We have saved parquet file in the given delta location. 111 | # MAGIC DROP TABLE IF EXISTS bank_report_parq; 112 | # MAGIC CREATE EXTERNAL TABLE bank_report_parq USING DELTA LOCATION "/delta/bank_users_1500/" 113 | 114 | # COMMAND ---------- 115 | 116 | # MAGIC %sql 117 | # MAGIC SELECT * FROM bank_report_parq 118 | 119 | # COMMAND ---------- 120 | 121 | # MAGIC %run ../SETUP/_clean_up 122 | 123 | # COMMAND ---------- 124 | 125 | 126 | -------------------------------------------------------------------------------- /SQL Refresher/SR010-Drop database & tables.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### Drop database 4 | # MAGIC DROP DATABASE {DB_NAME} 5 | # MAGIC 6 | # MAGIC ### DROP table 7 | # MAGIC DROP TABLE {TABLE_NAME} 8 | # MAGIC 9 | # MAGIC Our initial setup script creates DB_DEMO & employee table. Let's check a demo. 10 | 11 | # COMMAND ---------- 12 | 13 | # MAGIC %run ../SETUP/_initial_setup 14 | 15 | # COMMAND ---------- 16 | 17 | # MAGIC %sql 18 | # MAGIC --list databases 19 | # MAGIC SHOW DATABASES; 20 | 21 | # COMMAND ---------- 22 | 23 | # MAGIC %sql 24 | # MAGIC SHOW TABLES; 25 | 26 | # COMMAND ---------- 27 | 28 | # MAGIC %sql 29 | # MAGIC -- we can see there is a database name db_demo & few tables. let's drop employee table & then whole database. 30 | # MAGIC DROP TABLE employee; 31 | 32 | # COMMAND ---------- 33 | 34 | # MAGIC %sql 35 | # MAGIC SHOW TABLES; 36 | # MAGIC -- now you will not see employee table in the list 37 | 38 | # COMMAND ---------- 39 | 40 | # MAGIC %sql 41 | # MAGIC -- lets delete whole database. if your database has table then we have to use one keyword CASCADE, it will delete all the table and other objects in side the database and then drop the database. 42 | # MAGIC DROP DATABASE DB_DEMO CASCADE 43 | 44 | # COMMAND ---------- 45 | 46 | # MAGIC %sql 47 | # MAGIC SHOW DATABASES 48 | # MAGIC -- db_demo will not be in the list. 49 | 50 | # COMMAND ---------- 51 | 52 | # MAGIC %run ../SETUP/_clean_up 53 | 54 | # COMMAND ---------- 55 | 56 | 57 | -------------------------------------------------------------------------------- /SQL Refresher/SR011-Check Table & Database Details.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### What is metadata? 4 | # MAGIC Metadata is descriptive information. So whenever you create a database or table, metadata is generated in backend. During your data engineering process, you may need to see details or metadata about your table & database. We can do it using "DESCRIBE" keyword. 5 | # MAGIC 6 | # MAGIC Let's have a demo. 7 | # MAGIC 8 | # MAGIC ![METADATA](https://raw.githubusercontent.com/martandsingh/images/master/metadat.png) 9 | 10 | # COMMAND ---------- 11 | 12 | # MAGIC %run ../SETUP/_initial_setup 13 | 14 | # COMMAND ---------- 15 | 16 | # MAGIC %sql 17 | # MAGIC -- check database details 18 | # MAGIC DESCRIBE DATABASE DB_DEMO; 19 | # MAGIC 20 | # MAGIC -- you can see it will give you basic information about database. You can see the location of data stored. You can change this location when creating the database,. 21 | 22 | # COMMAND ---------- 23 | 24 | # MAGIC %sql 25 | # MAGIC DESCRIBE DATABASE EXTENDED DB_DEMO; 26 | # MAGIC -- When you write extended, it will give you more details about database. We have not defined any Properties so it is showing empty column for that. 27 | 28 | # COMMAND ---------- 29 | 30 | # MAGIC %sql 31 | # MAGIC DESCRIBE TABLE employee 32 | # MAGIC 33 | # MAGIC -- or you can simply write DESCRIBE employee. You can see all the columns and data types. We do not have any partitioning for now. 34 | 35 | # COMMAND ---------- 36 | 37 | # MAGIC %sql 38 | # MAGIC DESCRIBE EXTENDED employee 39 | # MAGIC -- this will show detailed information about table. 40 | 41 | # COMMAND ---------- 42 | 43 | # In above column you can see the location of the table. location shows where are all the data files saved for that particular table. let's explore that folder 44 | display(dbutils.fs.ls("dbfs:/user/hive/warehouse/db_demo.db/employee/")) 45 | 46 | # COMMAND ---------- 47 | 48 | # All the transaction logs is saved in _delta_log folder 49 | display(dbutils.fs.ls("dbfs:/user/hive/warehouse/db_demo.db/employee/_delta_log/")) 50 | 51 | # COMMAND ---------- 52 | 53 | # transaction logs will be saved in json files. let's explore one. 54 | df_trans = spark.sql(f"SELECT * FROM json.`dbfs:/user/hive/warehouse/db_demo.db/employee/_delta_log/00000000000000000001.json`") 55 | display(df_trans) 56 | 57 | # add column shows files added for that particular transaction & remove shows which file is removed for that transaction 58 | 59 | # COMMAND ---------- 60 | 61 | # MAGIC %sql 62 | # MAGIC DESCRIBE DETAIL employee 63 | # MAGIC -- Using this you can more details about table. We are interested in numFiles. numfiles shows the data files used for this table. But we earlier saw there are more than 8 files in file location. what is that? 64 | # MAGIC -- Delta lake keeps the older version of data to use time travel functionality. When you run query it checks the meta data and only include the files which are valid and ignore all others. We do not have to go so deeply in this just better to know how it works. 65 | 66 | # COMMAND ---------- 67 | 68 | # MAGIC %sql 69 | # MAGIC -- You can check history for a table. It will show you all the changes saved in transaction logs 70 | # MAGIC DESCRIBE HISTORY db_demo.employee 71 | 72 | # COMMAND ---------- 73 | 74 | # MAGIC %sql 75 | # MAGIC --Check all the databases in system 76 | # MAGIC SHOW DATABASES; 77 | 78 | # COMMAND ---------- 79 | 80 | # MAGIC %sql 81 | # MAGIC -- Check all the tables in all the databases. before running this command you have to select a database. We have selected the DB_DEMO database in our initial_setup script 82 | # MAGIC SHOW TABLES 83 | 84 | # COMMAND ---------- 85 | 86 | # MAGIC %sql 87 | # MAGIC -- SHOW PARTITIONS DB_DEMO.employee; 88 | # MAGIC -- we can use above table to check partitions. As our table does not have partition, we cannot run the command. 89 | 90 | # COMMAND ---------- 91 | 92 | 93 | -------------------------------------------------------------------------------- /SQL Refresher/SR012-Versioning, Time Travel & Optimization.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | # MAGIC %md 3 | # MAGIC ### Versioning, Time Travel & Optimization 4 | # MAGIC 5 | # MAGIC ### OPTIMIZE 6 | # MAGIC Delta Lake on Databricks can improve the speed of read queries from a table. One way to improve this speed is to coalesce small files into larger ones. You trigger compaction by running the OPTIMIZE command 7 | # MAGIC 8 | # MAGIC ### Z-ORDER 9 | # MAGIC Data Skipping is a performance optimization that aims at speeding up queries that contain filters (WHERE clauses). 10 | # MAGIC 11 | # MAGIC As new data is inserted into a Databricks Delta table, file-level min/max statistics are collected for all columns (including nested ones) of supported types. Then, when there’s a lookup query against the table, Databricks Delta first consults these statistics to determine which files can safely be skipped. This is done automatically and no specific commands are required to be run for this. 12 | # MAGIC 13 | # MAGIC * Z-Ordering is a technique to co-locate related information in the same set of files. 14 | # MAGIC * Z-Ordering maps multidimensional data to one dimension while preserving the locality of the data points. 15 | # MAGIC 16 | # MAGIC 17 | # MAGIC ### Z-Order Vs Partition 18 | # MAGIC Partitioning physically splits the data into different files/directories having only one specific value, while ZOrder provides clustering of related data inside the files that may contain multiple possible values for given column. 19 | # MAGIC 20 | # MAGIC Partitioning is useful when you have a low cardinality column - when there are not so many different possible values - for example, you can easily partition by year & month (maybe by day), but if you partition in addition by hour, then you'll have too many partitions with too many files, and it will lead to big performance problems. 21 | # MAGIC 22 | # MAGIC ZOrder allows to create bigger files that are more efficient to read compared to many small files. 23 | 24 | # COMMAND ---------- 25 | 26 | # MAGIC %run ../SETUP/_pyspark_init_setup 27 | 28 | # COMMAND ---------- 29 | 30 | # MAGIC %run ../SETUP/_initial_setup 31 | 32 | # COMMAND ---------- 33 | 34 | from pyspark.sql.types import StructField, StructType, StringType, DecimalType, IntegerType 35 | 36 | # COMMAND ---------- 37 | 38 | # MAGIC %md 39 | # MAGIC ### Load Data 40 | 41 | # COMMAND ---------- 42 | 43 | # We are using a game steam dataset. 44 | custom_schema = StructType( 45 | [ 46 | StructField("gamer_id", IntegerType(), True), 47 | StructField("game", StringType(), True), 48 | StructField("behaviour", StringType(), True), 49 | StructField("play_hours", DecimalType(), True), 50 | StructField("rating", IntegerType(), True) 51 | ]) 52 | df = spark.read.option("header", "true").schema(custom_schema).csv('/FileStore/datasets/steam-200k.csv') 53 | display(df) 54 | 55 | # COMMAND ---------- 56 | 57 | df.write.format("delta").saveAsTable("DB_DEMO.game_stats") 58 | 59 | # COMMAND ---------- 60 | 61 | # MAGIC %sql 62 | # MAGIC DESCRIBE EXTENDED DB_DEMO.game_stats 63 | 64 | # COMMAND ---------- 65 | 66 | display(dbutils.fs.ls("dbfs:/user/hive/warehouse/db_demo.db/game_stats")) 67 | 68 | # COMMAND ---------- 69 | 70 | # MAGIC %sql 71 | # MAGIC -- Check num of files before optimize. The moment I run this command it shows 3 as number of files. 72 | # MAGIC DESCRIBE DETAIL db_demo.game_stats 73 | 74 | # COMMAND ---------- 75 | 76 | # MAGIC %md 77 | # MAGIC ### OPTIMIZE combines smaller files and create bigger file. 78 | 79 | # COMMAND ---------- 80 | 81 | # MAGIC %sql 82 | # MAGIC -- let;s optimize the table and then see the number of files 83 | # MAGIC OPTIMIZE db_demo.game_stats 84 | 85 | # COMMAND ---------- 86 | 87 | # MAGIC %sql 88 | # MAGIC -- Check num of files after optimize. After optimizing it is showing 1 file. 89 | # MAGIC DESCRIBE DETAIL db_demo.game_stats 90 | 91 | # COMMAND ---------- 92 | 93 | display(dbutils.fs.ls("dbfs:/user/hive/warehouse/db_demo.db/game_stats")) 94 | 95 | # COMMAND ---------- 96 | 97 | # _delta_log file contains all the transactions files 98 | display(dbutils.fs.ls("dbfs:/user/hive/warehouse/db_demo.db/game_stats/_delta_log/")) 99 | 100 | # COMMAND ---------- 101 | 102 | display(spark.sql(f"SELECT * FROM json.`dbfs:/user/hive/warehouse/db_demo.db/game_stats/_delta_log/00000000000000000001.json`")) 103 | 104 | # this was the last transaction which is our optimize command. You will see add column which includes new files added, in our case there is only one new file but in delete columns you can see 3 files deleted. We saw same result using DESCRIBE DETAIL employee command earlier. 105 | 106 | # COMMAND ---------- 107 | 108 | display(spark.read.format("delta").load('dbfs:/user/hive/warehouse/db_demo.db/game_stats/')) 109 | 110 | # COMMAND ---------- 111 | 112 | df.write.format("delta").saveAsTable("DB_DEMO.game_stats_new") 113 | 114 | # COMMAND ---------- 115 | 116 | # MAGIC %md 117 | # MAGIC ### Z-ORDER 118 | # MAGIC It will keep simillary data closer to optimize query. 119 | 120 | # COMMAND ---------- 121 | 122 | # MAGIC %sql 123 | # MAGIC OPTIMIZE db_demo.game_stats_new 124 | # MAGIC ZORDER BY (game) 125 | 126 | # COMMAND ---------- 127 | 128 | display(spark.read.format("delta").load('dbfs:/user/hive/warehouse/db_demo.db/game_stats_new/')) 129 | # if you compare this result with above game_stats result, you will see here data of same types are grouped because we applied ZORDER based on game colmn. It is keeping data of same game together. 130 | 131 | # COMMAND ---------- 132 | 133 | # MAGIC %md 134 | # MAGIC ### Versioning & Timetravel 135 | 136 | # COMMAND ---------- 137 | 138 | # MAGIC %md 139 | # MAGIC Lets update table, each transation will create a new version of table. Does that mean we can still access the old version of data? 140 | # MAGIC 141 | # MAGIC Let's try it out. 142 | 143 | # COMMAND ---------- 144 | 145 | # MAGIC %sql 146 | # MAGIC UPDATE DB_DEMO.game_stats 147 | # MAGIC SET rating = CASE WHEN play_hours <10 THEN 3 ELSE 4.5 END 148 | 149 | # COMMAND ---------- 150 | 151 | # MAGIC %sql 152 | # MAGIC DESCRIBE HISTORY DB_DEMO.game_stats 153 | # MAGIC -- IN our case we have three versions. Let's access the version befor update. that mean version 1 as version2 is the latest version which was created by our update command. 154 | 155 | # COMMAND ---------- 156 | 157 | # MAGIC %sql 158 | # MAGIC -- IN this version we should see our rating as 0 159 | # MAGIC SELECT * FROM DB_DEMO.game_stats VERSION AS OF 1 160 | 161 | # COMMAND ---------- 162 | 163 | # MAGIC %sql 164 | # MAGIC -- You can use timestamp also to access older version of data 165 | # MAGIC SELECT * FROM DB_DEMO.game_stats TIMESTAMP AS OF "2022-06-17 07:40:14.000+0000" 166 | 167 | # COMMAND ---------- 168 | 169 | # MAGIC %sql 170 | # MAGIC -- This will give you the latest version with non-zero ratings 171 | # MAGIC SELECT * FROM DB_DEMO.game_stats 172 | 173 | # COMMAND ---------- 174 | 175 | # MAGIC %md 176 | # MAGIC ### ROLLBACK 177 | # MAGIC Let's say you update or delete your table by mistake & now you want to rollback to the previous version. Is it possible? 178 | # MAGIC Lets try it out. 179 | 180 | # COMMAND ---------- 181 | 182 | # MAGIC %sql 183 | # MAGIC -- let's delete our data from game_stats table 184 | # MAGIC DELETE FROM DB_DEMO.game_stats 185 | 186 | # COMMAND ---------- 187 | 188 | # MAGIC %sql 189 | # MAGIC -- now we can see there are no records 190 | # MAGIC SELECT * FROM DB_DEMO.game_stats 191 | 192 | # COMMAND ---------- 193 | 194 | # MAGIC %sql 195 | # MAGIC DESCRIBE HISTORY DB_DEMO.game_stats 196 | 197 | # COMMAND ---------- 198 | 199 | # MAGIC %sql 200 | # MAGIC -- remember version 2 was our version created after we updated the rating . 3rd version is created by delete command. So we want to restore before our delete. 201 | # MAGIC RESTORE TABLE DB_DEMO.game_stats TO VERSION AS OF 2 202 | 203 | # COMMAND ---------- 204 | 205 | # MAGIC %sql 206 | # MAGIC --- TADA!! We have restored our data. 207 | # MAGIC SELECT * FROM DB_DEMO.game_stats 208 | 209 | # COMMAND ---------- 210 | 211 | # MAGIC %md 212 | # MAGIC ### Purging data using VACUUM 213 | # MAGIC Delta lake versioning is very helpful to restore or access old data but it is not possible to keep all the versions of big data. It will be very expensive if you have GBs or TBs of data. So we should purge or remove outdated file. VACUUME command helps us to acheive it. 214 | # MAGIC 215 | # MAGIC __This will delete data files permanently.__ 216 | 217 | # COMMAND ---------- 218 | 219 | # MAGIC %sql 220 | # MAGIC -- This will give you error, because it is not a best practice to delete all your old versions. below command woll delete all the old versions because we are using 0 hours. DEFAULT is 168 hours (7 days) 221 | # MAGIC 222 | # MAGIC -- UNCOMMENT BELOW COMMAND AND RUN 223 | # MAGIC --VACUUM DB_DEMO.game_stats retain 0 hours 224 | 225 | # COMMAND ---------- 226 | 227 | # MAGIC %sql 228 | # MAGIC --# If you want to use 0 hourse as retention period, in that case you have to change spark configuration which checks retention duration. 229 | # MAGIC 230 | # MAGIC SET spark.databricks.delta.retentionDurationCheck.enabled=false; 231 | # MAGIC SET spark.databricks.delta.vacuum.logging.enabled=true; 232 | 233 | # COMMAND ---------- 234 | 235 | # MAGIC %md 236 | # MAGIC You can use DRY RUN version of VACUUM to print out all the records to be deleted. it will not delete the files. 237 | 238 | # COMMAND ---------- 239 | 240 | # MAGIC %sql 241 | # MAGIC VACUUM DB_DEMO.game_stats retain 1 HOURS DRY RUN 242 | # MAGIC -- below versions will be deleted if you run this query. 243 | 244 | # COMMAND ---------- 245 | 246 | # Before deleting the data files let;s check the folder 247 | display(dbutils.fs.ls('dbfs:/user/hive/warehouse/db_demo.db/game_stats/')) 248 | 249 | # COMMAND ---------- 250 | 251 | # MAGIC %sql 252 | # MAGIC VACUUM DB_DEMO.game_stats retain 1 HOURS 253 | 254 | # COMMAND ---------- 255 | 256 | # check the folder after purging. It deleted the versions older than 1 hour 257 | display(dbutils.fs.ls('dbfs:/user/hive/warehouse/db_demo.db/game_stats/')) 258 | 259 | # COMMAND ---------- 260 | 261 | # MAGIC %md 262 | # MAGIC __Sometimes you can still query the older versions because of caching, so it is always better to restart your cluster.__ 263 | 264 | # COMMAND ---------- 265 | 266 | # MAGIC %run ../SETUP/_clean_up 267 | 268 | # COMMAND ---------- 269 | 270 | # MAGIC %run ../SETUP/_pyspark_clean_up 271 | 272 | # COMMAND ---------- 273 | 274 | 275 | --------------------------------------------------------------------------------