├── .github └── workflows │ └── jekyll-gh-pages.yml ├── README.md ├── _config.yml └── files ├── ADEWD_Knowledge_Checks.pdf ├── ade-mod-1-incremental-processing-with-spark-structured-streaming.pdf ├── ade-mod-2-streaming-etl-patterns-with-dlt.pdf ├── ade-mod-3-data-privacy-patterns.pdf ├── ade-mod-4-performance-optimization-with-spark-and-delta-lake.pdf ├── ade-mod-5-swe-practices-with-dlt.pdf ├── ade-mod-6-automate-production-workflows.pdf ├── advanced-data-engineering-with-databricks.dbc └── advanced-data-engineering-with-databricks.pdf /.github/workflows/jekyll-gh-pages.yml: -------------------------------------------------------------------------------- 1 | # Sample workflow for building and deploying a Jekyll site to GitHub Pages 2 | name: Deploy Jekyll with GitHub Pages dependencies preinstalled 3 | 4 | on: 5 | # Runs on pushes targeting the default branch 6 | push: 7 | branches: ["main"] 8 | 9 | # Allows you to run this workflow manually from the Actions tab 10 | workflow_dispatch: 11 | 12 | # Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages 13 | permissions: 14 | contents: read 15 | pages: write 16 | id-token: write 17 | 18 | # Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued. 19 | # However, do NOT cancel in-progress runs as we want to allow these production deployments to complete. 20 | concurrency: 21 | group: "pages" 22 | cancel-in-progress: false 23 | 24 | jobs: 25 | # Build job 26 | build: 27 | runs-on: ubuntu-latest 28 | steps: 29 | - name: Checkout 30 | uses: actions/checkout@v4 31 | - name: Setup Pages 32 | uses: actions/configure-pages@v4 33 | - name: Build with Jekyll 34 | uses: actions/jekyll-build-pages@v1 35 | with: 36 | source: ./ 37 | destination: ./_site 38 | - name: Upload artifact 39 | uses: actions/upload-pages-artifact@v3 40 | 41 | # Deployment job 42 | deploy: 43 | environment: 44 | name: github-pages 45 | url: ${{ steps.deployment.outputs.page_url }} 46 | runs-on: ubuntu-latest 47 | needs: build 48 | steps: 49 | - name: Deploy to GitHub Pages 50 | id: deployment 51 | uses: actions/deploy-pages@v4 52 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Databricks Certified Data Engineer Professional Questions 2 | 3 | ## Suggestions 4 | 5 | - [Practice from these notebooks throroughly](/files/advanced-data-engineering-with-databricks.dbc) and below pdf 6 | * [Incremental processing](/files/ade-mod-1-incremental-processing-with-spark-structured-streaming.pdf) 7 | * [ETL patterns](/files/ade-mod-2-streaming-etl-patterns-with-dlt.pdf) 8 | * [Data Privacy](/files/ade-mod-3-data-privacy-patterns.pdf) 9 | * [Performnace Optimization](/files/ade-mod-4-performance-optimization-with-spark-and-delta-lake.pdf) 10 | * [DLT practices](/files/ade-mod-5-swe-practices-with-dlt.pdf) 11 | * [Automate prod workflows](/files/files/ade-mod-6-automate-production-workflows.pdf) 12 | * [Knowledge check](/files/ADEWD_Knowledge_Checks.pdf) 13 | * [Udemy Practice](https://www.udemy.com/course/practice-exams-databricks-data-engineer-professional-k/)[I personally opted this which is MUST. Use Udemy for Bussiness for free] 14 | 15 | Repo [link](https://github.com/Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions) 16 | 17 | ## Topics 18 | 19 | I was able to note down these topics memory based. 20 | 21 | 1. How can you read parameters using `dbutils.widgets.text` and retrieve their values? 22 | * **Hint** : Focus on using `dbutils.widgets.get` to retrieve the values. 23 | 2. How do you provide read access for a production notebook to a new data engineer for review? 24 | * **Hint** : The answer involves setting the notebook's permissions to "Can Read." 25 | 3. When attaching a notebook to a cluster, which permission allows you to run the notebook? 26 | * **Hint** : The user needs "Can Restart" permission. 27 | 4. Should production DLT pipelines be run on a job cluster or an all-purpose cluster? 28 | 5. Does a CTAS (CREATE TABLE AS SELECT) operation execute the load every time or only during table creation? 29 | 6. How can you control access to read production secrets using scope access control? 30 | * **Hint** : The answer involves setting "Read" permissions on the scope or secret. 31 | 7. Where does the `%sh` command run in Databricks? 32 | * **Hint** : It runs on the driver node. 33 | 8. If a query contains a filter, how does Databricks use file statistics in the transaction log? 34 | 9. What happens when you run a `VACUUM` command on a shallow clone table? 35 | * **Hint** : Running `VACUUM` on a shallow clone table will result in an error? 36 | 10. Which type of join (left, inner, right) is not possible when performing a join between a static DataFrame and a streaming DataFrame? 37 | * **Hint** : Consider the limitations of streaming joins. 38 | 11. When the source is a CDC (Change Data Capture), should you use `MERGE INTO` or leverage the Change Data Feed (CDF) feature? 39 | 12. How can you find the difference between the previous and present commit in a Delta table? 40 | 13. What is the best approach for nightly jobs that overwrite a table for the business team with the least latency? 41 | * **Hint** : Should you write to the table nightly or create a view? 42 | 14. What does the `OPTIMIZE TABLE` command do, and what is the target file size? 43 | * **Hint** : Focus on the target file size of 1GB. 44 | 15. In a streaming scenario, what does the `.withWatermark` function do with a delay of 10 minutes? 45 | 16. How does aggregating on the source and then overwriting/appending to the target impact data load? 46 | 17. Why did you receive three email notifications when a job was set to trigger an email if `mean(temp) > 120`? 47 | * **Hint** : Investigate multiple triggers for the email alert. 48 | 18. Why should the checkpoint directory be unique for each stream in a streaming job? 49 | 19. How would you set up an Autoloader scenario to load data into a bronze table with history and update the target table? 50 | 20. How can you handle streaming deduplication based on a given code scenario? 51 | 21. For batch loading, what happens if the load is set to overwrite or append? 52 | * **Hint** : Consider the impact on the target table. 53 | 22. In a Change Data Feed (CDF) scenario, if `readChangeFeed` starts at version 0 and append is used, will there be deduplication? 54 | 23. How can you identify whether a table is SCD Type 1 or 2 based on an upsert operation? 55 | 24. To avoid performance issues, should you decrease the trigger time or not? 56 | 25. How does Delta Lake decide on file skipping based on columns in a query, and what are the implications for nested columns? 57 | 26. What does granting "Usage" and "Select" permissions on a Delta table allow a user to do? 58 | 27. How do you create an unmanaged table in Databricks? 59 | 28. What makes a date column a good candidate for partitioning in a Delta table? 60 | 29. What happens in the transaction log when you rename a Delta table using `ALTER TABLE xx RENAME xx`? 61 | 30. How would you handle an error with a check constraint and what would you recommend? 62 | 31. When using `DESCRIBE` commands, how can you retrieve table properties, comments, and partition details? 63 | * **Hint** : Use `DESCRIBE HISTORY`, `DESCRIBE EXTENDED`, or `DESCRIBE DETAIL`.? 64 | 32. How are file statistics used in Delta Lake, and why are they important? 65 | 33. In the Ganglia UI, how can you detect a spill during query execution? 66 | 34. If a repo branch is missing locally, how can you retrieve that branch with the latest code changes? 67 | 35. After deleting records with a query like `DELETE FROM A WHERE id IN (SELECT id FROM B)`, can you time travel to see the deleted records, and how can you prevent their permanent deletion? 68 | 36. What are the differences between DBFS and mounts in Databricks? 69 | 37. If the API `2.0/jobs/create` is executed three times with the same JSON, what will happen? Will it execute or create three jobs? 70 | 38. What is DBFS in Databricks? 71 | 39. How do you install a Python library using `%pip` in a Databricks notebook? 72 | 40. If Task 1 has downstream Task 2 and Task 3 running in parallel, and Task 1 and Task 2 succeed while Task 3 fails, what will be the final job status? 73 | * **Hint** : The job may show as partially completed. 74 | 41. How do you handle streaming job retries in production, specifically with job clusters, unlimited retries, and a maximum of one concurrent run? 75 | 42. How can you clone an existing job and version it using the Databricks CLI? 76 | 43. When converting a large JSON file (1TB) to Parquet with a partition size of 512 MB, what is the correct order of steps? Should you read, perform narrow transformations, repartition (2048 partitions), then convert to Parquet? 77 | 44. What happens in the target table when duplicates are dropped during a batch read and append operation? 78 | 45. If a column was missed during profiling from Kafka, how can you ensure that the data is fully replayable in the future? 79 | * **Hint** : Consider writing to a bronze table. 80 | 46. How do you handle access control for users in Databricks? 81 | 47. What is the use of the `pyspark.sql.functions.broadcast` function in a Spark job? 82 | * **Hint** : It distributes the data to all worker nodes. 83 | 48. What happens when performing a join on `orders_id` with a condition "when not matched, insert *"? 84 | * **Hint** : The operation will insert records that don’t have a match. 85 | 49. Given a function definition for loading bronze data, how would you write a silver load function to transform and update downstream tables? 86 | 50. If the code includes `CASE WHEN is_member("group") THEN email ELSE 'redacted' END AS email`, what will be the output if the user is not a member of the group? 87 | 51. How can you use the Ganglia UI to view logs and troubleshoot a Databricks job? 88 | 52. When working with multi-task jobs, how do you list or get the tasks using the API `2.0/jobs/list` or `2.0/jobs/run/list`? 89 | 53. What is unit testing, and how is it applied in a Databricks environment? 90 | 54. What happens when multiple `display()` commands are executed repeatedly in development, and what is the impact in production? 91 | 55. Will the `option("readChangeFeed")` work on a source Delta table with no CDC enabled? 92 | 56. How can you identify whether a tumbling or sliding window is being used based on the code provided? 93 | 57. What performance tuning considerations are involved with `spark.sql.files.maxPartitionBytes` and `spark.sql.shuffle.partitions`? 94 | 95 | ## Must Read hyperlinks 96 | 97 | No matter what, please read these databricks docs. Note the Important tags in these pages and questions at the end on some pages. 98 | 99 | 1. [Data skipping with Z-order indexes for Delta Lake](https://docs.databricks.com/en/delta/data-skipping.html) 100 | 2. [Clone a table on Databricks](https://docs.databricks.com/en/delta/clone.html) 101 | 3. [Delta table streaming reads and writes](https://docs.databricks.com/en/structured-streaming/delta-lake.html) 102 | 4. [Structured Streaming Programming Guide - Spark 3.5.0 Documentation](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) 103 | 5. [Configure Structured Streaming trigger intervals](https://docs.databricks.com/en/structured-streaming/triggers.html) 104 | 6. [Configure Delta Lake to control data file size](https://docs.databricks.com/en/delta/tune-file-size.html) 105 | 7. [Introducing Stream-Stream Joins in Apache Spark 2.3](https://www.databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html) 106 | 8. [Best Practices for Using Structured Streaming in Production - The Databricks Blog](https://www.databricks.com/blog/streaming-production-collected-best-practices) 107 | 9. [What is Auto Loader?](https://docs.databricks.com/en/ingestion/auto-loader/index.html) 108 | 10. [Upsert into a Delta Lake table using merge](https://docs.databricks.com/en/delta/merge.html) 109 | 11. [Use Delta Lake change data feed on Databricks](https://docs.databricks.com/en/delta/delta-change-data-feed.html) 110 | 12. [Apply watermarks to control data processing thresholds](https://docs.databricks.com/en/structured-streaming/watermarks.html) 111 | 13. [Use foreachBatch to write to arbitrary data sinks](https://docs.databricks.com/en/structured-streaming/foreach.html) 112 | 14. [How to Simplify CDC With Delta Lake's Change Data Feed](https://www.databricks.com/blog/2021/06/09/how-to-simplify-cdc-with-delta-lakes-change-data-feed.html) 113 | 15. [VACUUM](https://docs.databricks.com/en/sql/language-manual/delta-vacuum.html) 114 | 16. [Jobs access control](https://docs.databricks.com/en/security/auth-authz/access-control/jobs-acl.html) 115 | 17. [Cluster access control](https://docs.databricks.com/en/security/auth-authz/access-control/cluster-acl.html) 116 | 18. [Secret access control](https://docs.databricks.com/en/security/auth-authz/access-control/secret-acl.html) 117 | 19. [Hive metastore privileges and securable objects (legacy)](https://docs.databricks.com/en/data-governance/table-acls/object-privileges.html) 118 | 20. [Data objects in the Databricks lakehouse](https://docs.databricks.com/en/lakehouse/data-objects.html) 119 | 21. [Constraints on Databricks](https://docs.databricks.com/en/tables/constraints.html) 120 | 22. [When to partition tables on Databricks](https://docs.databricks.com/en/tables/partitions.html) 121 | 23. [Manage clusters](https://docs.databricks.com/en/compute/clusters-manage.html) 122 | 24. [Export and import Databricks notebooks](https://docs.databricks.com/en/notebooks/notebook-export-import.html) 123 | 25. [Unit testing for notebooks](https://docs.databricks.com/en/notebooks/testing.html) 124 | 26. [Databricks SQL Statement Execution API – Announcing the Public Preview](https://www.databricks.com/blog/2023/03/07/databricks-sql-statement-execution-api-announcing-public-preview.html) 125 | 27. [Transform data with Delta Live Tables](https://docs.databricks.com/en/delta-live-tables/transform.html) 126 | 28. [Manage data quality with Delta Live Tables](https://docs.databricks.com/en/delta-live-tables/expectations.html) 127 | 29. [Simplified change data capture with the APPLY CHANGES API in Delta Live Tables](https://docs.databricks.com/en/delta-live-tables/cdc.html) 128 | 30. [Monitor Delta Live Tables pipelines](https://docs.databricks.com/en/delta-live-tables/observability.html) 129 | 31. [Load data with Delta Live Tables](https://docs.databricks.com/en/delta-live-tables/load.html) 130 | 32. [What is Delta Live Tables?](https://docs.databricks.com/en/delta-live-tables/index.html) 131 | 33. [Solved: Re: What is the difference between Streaming live ... - Databricks - 17121](https://community.databricks.com/t5/data-engineering/what-is-the-difference-between-streaming-live-table-and-live/m-p/17122#M11172) 132 | 34. [What are all the Delta things in Databricks?](https://docs.databricks.com/en/introduction/delta-comparison.html) 133 | 35. [Parameterized queries with PySpark](https://www.databricks.com/blog/parameterized-queries-pyspark) 134 | 36. [Recover from Structured Streaming query failures with workflows](https://docs.gcp.databricks.com/en/structured-streaming/query-recovery.html) 135 | 37. [Jobs API 2.0](https://docs.databricks.com/en/workflows/jobs/jobs-2.0-api.html) 136 | 38. [OPTIMIZE](https://docs.databricks.com/en/sql/language-manual/delta-optimize.html) 137 | 39. [Adding and Deleting Partitions in Delta Lake tables](https://delta.io/blog/2023-01-18-add-remove-partition-delta-lake/) 138 | 40. [What is the Databricks File System (DBFS)?](https://docs.databricks.com/en/dbfs/index.html) 139 | 41. [Mounting cloud object storage on Databricks](https://docs.databricks.com/en/dbfs/mounts.html) 140 | 42. [Databricks widgets](https://docs.databricks.com/en/notebooks/widgets.html) 141 | 43. [Performance Tuning](https://spark.apache.org/docs/latest/sql-performance-tuning.html) 142 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | title: Databricks Certified Data Engineer Professional Questions 2 | description: These are memory based topics for the assessment I appeared on Jan 2024. If found it useful, leave a ⭐ in the repo. 3 | theme: jekyll-theme-cayman 4 | -------------------------------------------------------------------------------- /files/ADEWD_Knowledge_Checks.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions/723f496aa5256f3a039f6adba43b02eed6460084/files/ADEWD_Knowledge_Checks.pdf -------------------------------------------------------------------------------- /files/ade-mod-1-incremental-processing-with-spark-structured-streaming.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions/723f496aa5256f3a039f6adba43b02eed6460084/files/ade-mod-1-incremental-processing-with-spark-structured-streaming.pdf -------------------------------------------------------------------------------- /files/ade-mod-2-streaming-etl-patterns-with-dlt.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions/723f496aa5256f3a039f6adba43b02eed6460084/files/ade-mod-2-streaming-etl-patterns-with-dlt.pdf -------------------------------------------------------------------------------- /files/ade-mod-3-data-privacy-patterns.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions/723f496aa5256f3a039f6adba43b02eed6460084/files/ade-mod-3-data-privacy-patterns.pdf -------------------------------------------------------------------------------- /files/ade-mod-4-performance-optimization-with-spark-and-delta-lake.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions/723f496aa5256f3a039f6adba43b02eed6460084/files/ade-mod-4-performance-optimization-with-spark-and-delta-lake.pdf -------------------------------------------------------------------------------- /files/ade-mod-5-swe-practices-with-dlt.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions/723f496aa5256f3a039f6adba43b02eed6460084/files/ade-mod-5-swe-practices-with-dlt.pdf -------------------------------------------------------------------------------- /files/ade-mod-6-automate-production-workflows.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions/723f496aa5256f3a039f6adba43b02eed6460084/files/ade-mod-6-automate-production-workflows.pdf -------------------------------------------------------------------------------- /files/advanced-data-engineering-with-databricks.dbc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions/723f496aa5256f3a039f6adba43b02eed6460084/files/advanced-data-engineering-with-databricks.dbc -------------------------------------------------------------------------------- /files/advanced-data-engineering-with-databricks.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Amrit-Hub/Databricks-Certified-Data-Engineer-Professional-Questions/723f496aa5256f3a039f6adba43b02eed6460084/files/advanced-data-engineering-with-databricks.pdf --------------------------------------------------------------------------------