├── .github
    └── workflows
    │   └── jekyll-gh-pages.yml
├── README.md
├── _config.yml
└── files
    ├── ADEWD_Knowledge_Checks.pdf
    ├── ade-mod-1-incremental-processing-with-spark-structured-streaming.pdf
    ├── ade-mod-2-streaming-etl-patterns-with-dlt.pdf
    ├── ade-mod-3-data-privacy-patterns.pdf
    ├── ade-mod-4-performance-optimization-with-spark-and-delta-lake.pdf
    ├── ade-mod-5-swe-practices-with-dlt.pdf
    ├── ade-mod-6-automate-production-workflows.pdf
    ├── advanced-data-engineering-with-databricks.dbc
    └── advanced-data-engineering-with-databricks.pdf


/.github/workflows/jekyll-gh-pages.yml:
--------------------------------------------------------------------------------
 1 | # Sample workflow for building and deploying a Jekyll site to GitHub Pages
 2 | name: Deploy Jekyll with GitHub Pages dependencies preinstalled
 3 | 
 4 | on:
 5 |   # Runs on pushes targeting the default branch
 6 |   push:
 7 |     branches: ["main"]
 8 | 
 9 |   # Allows you to run this workflow manually from the Actions tab
10 |   workflow_dispatch:
11 | 
12 | # Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
13 | permissions:
14 |   contents: read
15 |   pages: write
16 |   id-token: write
17 | 
18 | # Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
19 | # However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
20 | concurrency:
21 |   group: "pages"
22 |   cancel-in-progress: false
23 | 
24 | jobs:
25 |   # Build job
26 |   build:
27 |     runs-on: ubuntu-latest
28 |     steps:
29 |       - name: Checkout
30 |         uses: actions/checkout@v4
31 |       - name: Setup Pages
32 |         uses: actions/configure-pages@v4
33 |       - name: Build with Jekyll
34 |         uses: actions/jekyll-build-pages@v1
35 |         with:
36 |           source: ./
37 |           destination: ./_site
38 |       - name: Upload artifact
39 |         uses: actions/upload-pages-artifact@v3
40 | 
41 |   # Deployment job
42 |   deploy:
43 |     environment:
44 |       name: github-pages
45 |       url: ${{ steps.deployment.outputs.page_url }}
46 |     runs-on: ubuntu-latest
47 |     needs: build
48 |     steps:
49 |       - name: Deploy to GitHub Pages
50 |         id: deployment
51 |         uses: actions/deploy-pages@v4
52 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Databricks Certified Data Engineer Professional Questions
  2 | 
  3 | ## Suggestions
  4 | 
  5 | - [Practice from these notebooks throroughly](/files/advanced-data-engineering-with-databricks.dbc) and below pdf
  6 |   * [Incremental processing](/files/ade-mod-1-incremental-processing-with-spark-structured-streaming.pdf)
  7 |   * [ETL patterns](/files/ade-mod-2-streaming-etl-patterns-with-dlt.pdf)
  8 |   * [Data Privacy](/files/ade-mod-3-data-privacy-patterns.pdf)
  9 |   * [Performnace Optimization](/files/ade-mod-4-performance-optimization-with-spark-and-delta-lake.pdf)
 10 |   * [DLT practices](/files/ade-mod-5-swe-practices-with-dlt.pdf)
 11 |   * [Automate prod workflows](/files/files/ade-mod-6-automate-production-workflows.pdf)
 12 |   * [Knowledge check](/files/ADEWD_Knowledge_Checks.pdf)
 13 |   * [Udemy Practice](https://www.udemy.com/course/practice-exams-databricks-data-engineer-professional-k/)[I personally opted this which is MUST. Use Udemy for Bussiness for free]
 14 | 
 15 | Repo [link](https://github.com/Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions)
 16 | 
 17 | ## Topics
 18 | 
 19 | I was able to note down these topics memory based.
 20 | 
 21 | 1. How can you read parameters using `dbutils.widgets.text` and retrieve their values?
 22 |    * **Hint** : Focus on using `dbutils.widgets.get` to retrieve the values.
 23 | 2. How do you provide read access for a production notebook to a new data engineer for review?
 24 |    * **Hint** : The answer involves setting the notebook's permissions to "Can Read."
 25 | 3. When attaching a notebook to a cluster, which permission allows you to run the notebook?
 26 |    * **Hint** : The user needs "Can Restart" permission.
 27 | 4. Should production DLT pipelines be run on a job cluster or an all-purpose cluster?
 28 | 5. Does a CTAS (CREATE TABLE AS SELECT) operation execute the load every time or only during table creation?
 29 | 6. How can you control access to read production secrets using scope access control?
 30 |    * **Hint** : The answer involves setting "Read" permissions on the scope or secret.
 31 | 7. Where does the `%sh` command run in Databricks?
 32 |    * **Hint** : It runs on the driver node.
 33 | 8. If a query contains a filter, how does Databricks use file statistics in the transaction log?
 34 | 9. What happens when you run a `VACUUM` command on a shallow clone table?
 35 |    * **Hint** : Running `VACUUM` on a shallow clone table will result in an error?
 36 | 10. Which type of join (left, inner, right) is not possible when performing a join between a static DataFrame and a streaming DataFrame?
 37 |     * **Hint** : Consider the limitations of streaming joins.
 38 | 11. When the source is a CDC (Change Data Capture), should you use `MERGE INTO` or leverage the Change Data Feed (CDF) feature?
 39 | 12. How can you find the difference between the previous and present commit in a Delta table?
 40 | 13. What is the best approach for nightly jobs that overwrite a table for the business team with the least latency?
 41 |     * **Hint** : Should you write to the table nightly or create a view?
 42 | 14. What does the `OPTIMIZE TABLE` command do, and what is the target file size?
 43 |     * **Hint** : Focus on the target file size of 1GB.
 44 | 15. In a streaming scenario, what does the `.withWatermark` function do with a delay of 10 minutes?
 45 | 16. How does aggregating on the source and then overwriting/appending to the target impact data load?
 46 | 17. Why did you receive three email notifications when a job was set to trigger an email if `mean(temp) > 120`?
 47 |     * **Hint** : Investigate multiple triggers for the email alert.
 48 | 18. Why should the checkpoint directory be unique for each stream in a streaming job?
 49 | 19. How would you set up an Autoloader scenario to load data into a bronze table with history and update the target table?
 50 | 20. How can you handle streaming deduplication based on a given code scenario?
 51 | 21. For batch loading, what happens if the load is set to overwrite or append?
 52 |     * **Hint** : Consider the impact on the target table.
 53 | 22. In a Change Data Feed (CDF) scenario, if `readChangeFeed` starts at version 0 and append is used, will there be deduplication?
 54 | 23. How can you identify whether a table is SCD Type 1 or 2 based on an upsert operation?
 55 | 24. To avoid performance issues, should you decrease the trigger time or not?
 56 | 25. How does Delta Lake decide on file skipping based on columns in a query, and what are the implications for nested columns?
 57 | 26. What does granting "Usage" and "Select" permissions on a Delta table allow a user to do?
 58 | 27. How do you create an unmanaged table in Databricks?
 59 | 28. What makes a date column a good candidate for partitioning in a Delta table?
 60 | 29. What happens in the transaction log when you rename a Delta table using `ALTER TABLE xx RENAME xx`?
 61 | 30. How would you handle an error with a check constraint and what would you recommend?
 62 | 31. When using `DESCRIBE` commands, how can you retrieve table properties, comments, and partition details?
 63 |     * **Hint** : Use `DESCRIBE HISTORY`, `DESCRIBE EXTENDED`, or `DESCRIBE DETAIL`.?
 64 | 32. How are file statistics used in Delta Lake, and why are they important?
 65 | 33. In the Ganglia UI, how can you detect a spill during query execution?
 66 | 34. If a repo branch is missing locally, how can you retrieve that branch with the latest code changes?
 67 | 35. After deleting records with a query like `DELETE FROM A WHERE id IN (SELECT id FROM B)`, can you time travel to see the deleted records, and how can you prevent their permanent deletion?
 68 | 36. What are the differences between DBFS and mounts in Databricks?
 69 | 37. If the API `2.0/jobs/create` is executed three times with the same JSON, what will happen? Will it execute or create three jobs?
 70 | 38. What is DBFS in Databricks?
 71 | 39. How do you install a Python library using `%pip` in a Databricks notebook?
 72 | 40. If Task 1 has downstream Task 2 and Task 3 running in parallel, and Task 1 and Task 2 succeed while Task 3 fails, what will be the final job status?
 73 |     * **Hint** : The job may show as partially completed.
 74 | 41. How do you handle streaming job retries in production, specifically with job clusters, unlimited retries, and a maximum of one concurrent run?
 75 | 42. How can you clone an existing job and version it using the Databricks CLI?
 76 | 43. When converting a large JSON file (1TB) to Parquet with a partition size of 512 MB, what is the correct order of steps? Should you read, perform narrow transformations, repartition (2048 partitions), then convert to Parquet?
 77 | 44. What happens in the target table when duplicates are dropped during a batch read and append operation?
 78 | 45. If a column was missed during profiling from Kafka, how can you ensure that the data is fully replayable in the future?
 79 |     * **Hint** : Consider writing to a bronze table.
 80 | 46. How do you handle access control for users in Databricks?
 81 | 47. What is the use of the `pyspark.sql.functions.broadcast` function in a Spark job?
 82 |     * **Hint** : It distributes the data to all worker nodes.
 83 | 48. What happens when performing a join on `orders_id` with a condition "when not matched, insert *"?
 84 |     * **Hint** : The operation will insert records that don’t have a match.
 85 | 49. Given a function definition for loading bronze data, how would you write a silver load function to transform and update downstream tables?
 86 | 50. If the code includes `CASE WHEN is_member("group") THEN email ELSE 'redacted' END AS email`, what will be the output if the user is not a member of the group?
 87 | 51. How can you use the Ganglia UI to view logs and troubleshoot a Databricks job?
 88 | 52. When working with multi-task jobs, how do you list or get the tasks using the API `2.0/jobs/list` or `2.0/jobs/run/list`?
 89 | 53. What is unit testing, and how is it applied in a Databricks environment?
 90 | 54. What happens when multiple `display()` commands are executed repeatedly in development, and what is the impact in production?
 91 | 55. Will the `option("readChangeFeed")` work on a source Delta table with no CDC enabled?
 92 | 56. How can you identify whether a tumbling or sliding window is being used based on the code provided?
 93 | 57. What performance tuning considerations are involved with `spark.sql.files.maxPartitionBytes` and `spark.sql.shuffle.partitions`?
 94 | 
 95 | ## Must Read hyperlinks
 96 | 
 97 | No matter what, please read these databricks docs. Note the Important tags in these pages and questions at the end on some pages.
 98 | 
 99 | 1. [Data skipping with Z-order indexes for Delta Lake](https://docs.databricks.com/en/delta/data-skipping.html)
100 | 2. [Clone a table on Databricks](https://docs.databricks.com/en/delta/clone.html)
101 | 3. [Delta table streaming reads and writes](https://docs.databricks.com/en/structured-streaming/delta-lake.html)
102 | 4. [Structured Streaming Programming Guide - Spark 3.5.0 Documentation](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
103 | 5. [Configure Structured Streaming trigger intervals](https://docs.databricks.com/en/structured-streaming/triggers.html)
104 | 6. [Configure Delta Lake to control data file size](https://docs.databricks.com/en/delta/tune-file-size.html)
105 | 7. [Introducing Stream-Stream Joins in Apache Spark 2.3](https://www.databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html)
106 | 8. [Best Practices for Using Structured Streaming in Production - The Databricks Blog](https://www.databricks.com/blog/streaming-production-collected-best-practices)
107 | 9. [What is Auto Loader?](https://docs.databricks.com/en/ingestion/auto-loader/index.html)
108 | 10. [Upsert into a Delta Lake table using merge](https://docs.databricks.com/en/delta/merge.html)
109 | 11. [Use Delta Lake change data feed on Databricks](https://docs.databricks.com/en/delta/delta-change-data-feed.html)
110 | 12. [Apply watermarks to control data processing thresholds](https://docs.databricks.com/en/structured-streaming/watermarks.html)
111 | 13. [Use foreachBatch to write to arbitrary data sinks](https://docs.databricks.com/en/structured-streaming/foreach.html)
112 | 14. [How to Simplify CDC With Delta Lake&#39;s Change Data Feed](https://www.databricks.com/blog/2021/06/09/how-to-simplify-cdc-with-delta-lakes-change-data-feed.html)
113 | 15. [VACUUM](https://docs.databricks.com/en/sql/language-manual/delta-vacuum.html)
114 | 16. [Jobs access control](https://docs.databricks.com/en/security/auth-authz/access-control/jobs-acl.html)
115 | 17. [Cluster access control](https://docs.databricks.com/en/security/auth-authz/access-control/cluster-acl.html)
116 | 18. [Secret access control](https://docs.databricks.com/en/security/auth-authz/access-control/secret-acl.html)
117 | 19. [Hive metastore privileges and securable objects (legacy)](https://docs.databricks.com/en/data-governance/table-acls/object-privileges.html)
118 | 20. [Data objects in the Databricks lakehouse](https://docs.databricks.com/en/lakehouse/data-objects.html)
119 | 21. [Constraints on Databricks](https://docs.databricks.com/en/tables/constraints.html)
120 | 22. [When to partition tables on Databricks](https://docs.databricks.com/en/tables/partitions.html)
121 | 23. [Manage clusters](https://docs.databricks.com/en/compute/clusters-manage.html)
122 | 24. [Export and import Databricks notebooks](https://docs.databricks.com/en/notebooks/notebook-export-import.html)
123 | 25. [Unit testing for notebooks](https://docs.databricks.com/en/notebooks/testing.html)
124 | 26. [Databricks SQL Statement Execution API – Announcing the Public Preview](https://www.databricks.com/blog/2023/03/07/databricks-sql-statement-execution-api-announcing-public-preview.html)
125 | 27. [Transform data with Delta Live Tables](https://docs.databricks.com/en/delta-live-tables/transform.html)
126 | 28. [Manage data quality with Delta Live Tables](https://docs.databricks.com/en/delta-live-tables/expectations.html)
127 | 29. [Simplified change data capture with the APPLY CHANGES API in Delta Live Tables](https://docs.databricks.com/en/delta-live-tables/cdc.html)
128 | 30. [Monitor Delta Live Tables pipelines](https://docs.databricks.com/en/delta-live-tables/observability.html)
129 | 31. [Load data with Delta Live Tables](https://docs.databricks.com/en/delta-live-tables/load.html)
130 | 32. [What is Delta Live Tables?](https://docs.databricks.com/en/delta-live-tables/index.html)
131 | 33. [Solved: Re: What is the difference between Streaming live ... - Databricks - 17121](https://community.databricks.com/t5/data-engineering/what-is-the-difference-between-streaming-live-table-and-live/m-p/17122#M11172)
132 | 34. [What are all the Delta things in Databricks?](https://docs.databricks.com/en/introduction/delta-comparison.html)
133 | 35. [Parameterized queries with PySpark](https://www.databricks.com/blog/parameterized-queries-pyspark)
134 | 36. [Recover from Structured Streaming query failures with workflows](https://docs.gcp.databricks.com/en/structured-streaming/query-recovery.html)
135 | 37. [Jobs API 2.0](https://docs.databricks.com/en/workflows/jobs/jobs-2.0-api.html)
136 | 38. [OPTIMIZE](https://docs.databricks.com/en/sql/language-manual/delta-optimize.html)
137 | 39. [Adding and Deleting Partitions in Delta Lake tables](https://delta.io/blog/2023-01-18-add-remove-partition-delta-lake/)
138 | 40. [What is the Databricks File System (DBFS)?](https://docs.databricks.com/en/dbfs/index.html)
139 | 41. [Mounting cloud object storage on Databricks](https://docs.databricks.com/en/dbfs/mounts.html)
140 | 42. [Databricks widgets](https://docs.databricks.com/en/notebooks/widgets.html)
141 | 43. [Performance Tuning](https://spark.apache.org/docs/latest/sql-performance-tuning.html)
142 | 


--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | title: Databricks Certified Data Engineer Professional Questions
2 | description: These are memory based topics for the assessment I appeared on Jan 2024. If found it useful, leave a ⭐ in the repo.
3 | theme: jekyll-theme-cayman
4 | 


--------------------------------------------------------------------------------
/files/ADEWD_Knowledge_Checks.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions/723f496aa5256f3a039f6adba43b02eed6460084/files/ADEWD_Knowledge_Checks.pdf


--------------------------------------------------------------------------------
/files/ade-mod-1-incremental-processing-with-spark-structured-streaming.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions/723f496aa5256f3a039f6adba43b02eed6460084/files/ade-mod-1-incremental-processing-with-spark-structured-streaming.pdf


--------------------------------------------------------------------------------
/files/ade-mod-2-streaming-etl-patterns-with-dlt.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions/723f496aa5256f3a039f6adba43b02eed6460084/files/ade-mod-2-streaming-etl-patterns-with-dlt.pdf


--------------------------------------------------------------------------------
/files/ade-mod-3-data-privacy-patterns.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions/723f496aa5256f3a039f6adba43b02eed6460084/files/ade-mod-3-data-privacy-patterns.pdf


--------------------------------------------------------------------------------
/files/ade-mod-4-performance-optimization-with-spark-and-delta-lake.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions/723f496aa5256f3a039f6adba43b02eed6460084/files/ade-mod-4-performance-optimization-with-spark-and-delta-lake.pdf


--------------------------------------------------------------------------------
/files/ade-mod-5-swe-practices-with-dlt.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions/723f496aa5256f3a039f6adba43b02eed6460084/files/ade-mod-5-swe-practices-with-dlt.pdf


--------------------------------------------------------------------------------
/files/ade-mod-6-automate-production-workflows.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions/723f496aa5256f3a039f6adba43b02eed6460084/files/ade-mod-6-automate-production-workflows.pdf


--------------------------------------------------------------------------------
/files/advanced-data-engineering-with-databricks.dbc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions/723f496aa5256f3a039f6adba43b02eed6460084/files/advanced-data-engineering-with-databricks.dbc


--------------------------------------------------------------------------------
/files/advanced-data-engineering-with-databricks.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions/723f496aa5256f3a039f6adba43b02eed6460084/files/advanced-data-engineering-with-databricks.pdf


--------------------------------------------------------------------------------