Latest Databricks Databricks-Certified-Professional-Data-Engineer Dumps PDF Questions Answers 2025

Databricks Certified Data Engineer Professional Exam Questions and Answers

Question 1

An upstream system is emitting change data capture (CDC) logs that are being written to a cloud object storage directory. Each record in the log indicates the change type (insert, update, or delete) and the values for each field after the change. The source table has a primary key identified by the field pk_id.

For auditing purposes, the data governance team wishes to maintain a full record of all values that have ever been valid in the source system. For analytical purposes, only the most recent value for each record needs to be recorded. The Databricks job to ingest these records occurs once per hour, but each individual record may have changed multiple times over the course of an hour.

Which solution meets these requirements?

Options:

Create a separate history table for each pk_id resolve the current state of the table by running a union all filtering the history tables for the most recent state.

Use merge into to insert, update, or delete the most recent entry for each pk_id into a bronze table, then propagate all changes throughout the system.

Iterate through an ordered set of changes to the table, applying each in turn; rely on Delta Lake's versioning ability to create an audit log.

Use Delta Lake's change data feed to automatically process CDC data from an external system, propagating all changes to all dependent tables in the Lakehouse.

Ingest all log information into a bronze table; use merge into to insert, update, or delete the most recent entry for each pk_id into a silver table to recreate the current table state.

Buy Now

Question 2

A Delta Lake table representing metadata about content from user has the following schema:

user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE

Based on the above schema, which column is a good candidate for partitioning the Delta Table?

Options:

Date

Post_id

User_id

Post_time

Question 3

To reduce storage and compute costs, the data engineering team has been tasked with curating a series of aggregate tables leveraged by business intelligence dashboards, customer-facing applications, production machine learning models, and ad hoc analytical queries.

The data engineering team has been made aware of new requirements from a customer-facing application, which is the only downstream workload they manage entirely. As a result, an aggregate table used by numerous teams across the organization will need to have a number of fields renamed, and additional fields will also be added.

Which of the solutions addresses the situation while minimally interrupting other teams in the organization without increasing the number of tables that need to be managed?

Options:

Send all users notice that the schema for the table will be changing; include in the communication the logic necessary to revert the new table schema to match historic queries.

Configure a new table with all the requisite fields and new names and use this as the source for the customer-facing application; create a view that maintains the original data schema and table name by aliasing select fields from the new table.

Create a new table with the required schema and new fields and use Delta Lake's deep clone functionality to sync up changes committed to one table to the corresponding table.

Replace the current table definition with a logical view defined with the query logic currently writing the aggregate table; create a new table to power the customer-facing application.

Add a table comment warning all users that the table schema and field names will be changing on a given date; overwrite the table in place to the specifications of the customer-facing application.

Question 4

A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

email STRING, age INT, ltv INT

The following view definition is executed:

An analyst who is not a member of the marketing group executes the following query:

SELECT * FROM email_ltv

Which statement describes the results returned by this query?

Options:

Three columns will be returned, but one column will be named "redacted" and contain only null values.

Only the email and itv columns will be returned; the email column will contain all null values.

The email and ltv columns will be returned with the values in user itv.

The email, age. and ltv columns will be returned with the values in user ltv.

Only the email and ltv columns will be returned; the email column will contain the string "REDACTED" in each row.

Question 5

A Delta table of weather records is partitioned by date and has the below schema:

date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT

To find all the records from within the Arctic Circle, you execute a query with the below filter:

latitude > 66.3

Which statement describes how the Delta engine identifies which files to load?

Options:

All records are cached to an operational database and then the filter is applied

The Parquet file footers are scanned for min and max statistics for the latitude column

All records are cached to attached storage and then the filter is applied

The Delta log is scanned for min and max statistics for the latitude column

The Hive metastore is scanned for min and max statistics for the latitude column

Question 6

A Delta Lake table representing metadata about content posts from users has the following schema:

user_id LONG

post_text STRING

post_id STRING

longitude FLOAT

latitude FLOAT

post_time TIMESTAMP

date DATE

Based on the above schema, which column is a good candidate for partitioning the Delta Table?

Options:

date

user_id

post_id

post_time

Answer:

Explanation:

Partitioning a Delta Lake table is a strategy used to improve query performance by dividing the table into distinct segments based on the values of a specific column. This approach allows queries to scan only the relevant partitions, thereby reducing the amount of data read and enhancing performance.

Considerations for Choosing a Partition Column:

Cardinality: Columns with high cardinality (i.e., a large number of unique values) are generally poor choices for partitioning. High cardinality can lead to a large number of small partitions, which can degrade performance.

Query Patterns: The partition column should align with common query filters. If queries frequently filter data based on a particular column, partitioning by that column can be beneficial.

Partition Size: Each partition should ideally contain at least 1 GB of data. This ensures that partitions are neither too small (leading to too many partitions) nor too large (negating the benefits of partitioning).

Evaluation of Columns:

date:

Cardinality: Typically low, especially if data spans over days, months, or years.

Query Patterns: Many analytical queries filter data based on date ranges.

Partition Size: Likely to meet the 1 GB threshold per partition, depending on data volume.

user_id:

Cardinality: High, as each user has a unique ID.

Query Patterns: While some queries might filter by user_id, the high cardinality makes it unsuitable for partitioning.

Partition Size: Partitions could be too small, leading to inefficiencies.

post_id:

Cardinality: Extremely high, with each post having a unique ID.

Query Patterns: Unlikely to be used for filtering large datasets.

Partition Size: Each partition would be very small, resulting in a large number of partitions.

post_time:

Cardinality: High, especially if it includes exact timestamps.

Query Patterns: Queries might filter by time, but the high cardinality poses challenges.

Partition Size: Similar to user_id, partitions could be too small.

Conclusion:

Given the considerations, the date column is the most suitable candidate for partitioning. It has low cardinality, aligns with common query patterns, and is likely to result in appropriately sized partitions.

[References:, Delta Lake Best Practices, Partitioning in Delta Lake, , ]

Question 7

A data engineer is performing a join operating to combine values from a static userlookup table with a streaming DataFrame streamingDF.

Which code block attempts to perform an invalid stream-static join?

Options:

userLookup.join(streamingDF, ["userid"], how="inner")

streamingDF.join(userLookup, ["user_id"], how="outer")

streamingDF.join(userLookup, ["user_id”], how="left")

streamingDF.join(userLookup, ["userid"], how="inner")

userLookup.join(streamingDF, ["user_id"], how="right")

Question 8

The Databricks CLI is use to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a filed run_id.

Which statement describes what the number alongside this field represents?

Options:

The job_id is returned in this field.

The job_id and number of times the job has been are concatenated and returned.

The number of times the job definition has been run in the workspace.

The globally unique ID of the newly triggered run.

Question 9

The following code has been migrated to a Databricks notebook from a legacy workload:

The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.

Which statement is a possible explanation for this behavior?

Options:

%sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.

Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.

%sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.

Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.

%sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.

Question 10

Which is a key benefit of an end-to-end test?

Options:

It closely simulates real world usage of your application.

It pinpoint errors in the building blocks of your application.

It provides testing coverage for all code paths and branches.

It makes it easier to automate your test suite

Question 11

The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:

item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING

The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.

A junior data engineer suggests converting this data to Delta Lake will improve query performance.

Which response to the junior data engineer s suggestion is correct?

Options:

Delta Lake statistics are not optimized for free text fields with high cardinality.

Text data cannot be stored with Delta Lake.

ZORDER ON review will need to be run to see performance gains.

The Delta log creates a term matrix for free text fields to support selective filtering.

Delta Lake statistics are only collected on the first 4 columns in a table.

Question 12

Which statement characterizes the general programming model used by Spark Structured Streaming?

Options:

Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput.

Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.

Structured Streaming uses specialized hardware and I/O streams to achieve sub-second latency for data transfer.

Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.

Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.

Question 13

The Databricks CLI is used to trigger a run of an existing job by passing the job_id parameter. The response indicating the job run request was submitted successfully includes a field run_id. Which statement describes what the number alongside this field represents?

Options:

The job_id and number of times the job has been run are concatenated and returned.

The globally unique ID of the newly triggered run.

The job_id is returned in this field.

The number of times the job definition has been run in this workspace.

Question 14

A DLT pipeline includes the following streaming tables:

Raw_lot ingest raw device measurement data from a heart rate tracking device.

Bgm_stats incrementally computes user statistics based on BPM measurements from raw_lot.

How can the data engineer configure this pipeline to be able to retain manually deleted or updated records in the raw_iot table while recomputing the downstream table when a pipeline update is run?

Options:

Set the skipChangeCommits flag to true on bpm_stats

Set the SkipChangeCommits flag to true raw_lot

Set the pipelines, reset, allowed property to false on bpm_stats

Set the pipelines, reset, allowed property to false on raw_iot

Question 15

The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.

The following logic is used to process these records.

MERGE INTO customers

USING (

SELECT updates.customer_id as merge_ey, updates .*

FROM updates

UNION ALL

SELECT NULL as merge_key, updates .*

FROM updates JOIN customers

ON updates.customer_id = customers.customer_id

WHERE customers.current = true AND updates.address <> customers.address

) staged_updates

ON customers.customer_id = mergekey

WHEN MATCHED AND customers. current = true AND customers.address <> staged_updates.address THEN

UPDATE SET current = false, end_date = staged_updates.effective_date

WHEN NOT MATCHED THEN

INSERT (customer_id, address, current, effective_date, end_date)

VALUES (staged_updates.customer_id, staged_updates.address, true, staged_updates.effective_date, null)

Which statement describes this implementation?

The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.

Options:

The customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.

The customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.

The customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.

Question 16

The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings.

The team has decided to process all deletions from the previous week as a batch job at 1am each Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job executes a series of VACUUM commands on all Delta Lake tables throughout the organization.

The compliance officer has recently learned about Delta Lake's time travel functionality. They are concerned that this might allow continued access to deleted data.

Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?

Options:

Because the vacuum command permanently deletes all files containing deleted records, deleted records may be accessible with time travel for around 24 hours.

Because the default data retention threshold is 24 hours, data files containing deleted records will be retained until the vacuum job is run the following day.

Because Delta Lake time travel provides full access to the entire history of a table, deleted records can always be recreated by users with full admin privileges.

Because Delta Lake's delete statements have ACID guarantees, deleted records will be permanently purged from all storage systems as soon as a delete job completes.

Because the default data retention threshold is 7 days, data files containing deleted records will be retained until the vacuum job is run 8 days later.

Question 17

A Databricks SQL dashboard has been configured to monitor the total number of records present in a collection of Delta Lake tables using the following query pattern:

SELECT COUNT (*) FROM table -

Which of the following describes how results are generated each time the dashboard is updated?

Options:

The total count of rows is calculated by scanning all data files

The total count of rows will be returned from cached results unless REFRESH is run

The total count of records is calculated from the Delta transaction logs

The total count of records is calculated from the parquet file metadata

The total count of records is calculated from the Hive metastore

Question 18

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.

Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

Options:

Scala is the only language that can be accurately tested using interactive notebooks; because the best performance is achieved by using Scala code compiled to JARs. all PySpark and Spark SQL logic should be refactored.

The only way to meaningfully troubleshoot code execution times in development notebooks Is to use production-sized data and production-sized clusters with Run All execution.

Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production.

Calling display () forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results.

The Jobs Ul should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs.

Question 19

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.

The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.

The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.

Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?

Options:

The Tungsten encoding used by Databricks is optimized for storing string data; newly-added native support for querying JSON strings means that string types are always most efficient.

Because Delta Lake uses Parquet for data storage, data types can be easily evolved by just modifying file footer information in place.

Human labor in writing code is the largest cost associated with data engineering workloads; as such, automating table declaration logic should be a priority in all migration workloads.

Because Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement.

Schema inference and evolution on .Databricks ensure that inferred types will always accurately match the data types used by downstream systems.

Question 20

A Delta Lake table was created with the below query:

Consider the following query:

DROP TABLE prod.sales_by_store -

If this statement is executed by a workspace admin, which result will occur?

Options:

Nothing will occur until a COMMIT command is executed.

The table will be removed from the catalog but the data will remain in storage.

The table will be removed from the catalog and the data will be deleted.

An error will occur because Delta Lake prevents the deletion of production data.

Data will be marked as deleted but still recoverable with Time Travel.

Question 21

A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor.

When evaluating the Ganglia Metrics for this cluster, which indicator would signal a bottleneck caused by code executing on the driver?

Options:

The five Minute Load Average remains consistent/flat

Bytes Received never exceeds 80 million bytes per second

Total Disk Space remains constant

Network I/O never spikes

Overall cluster CPU utilization is around 25%

Question 22

The data engineering team is migrating an enterprise system with thousands of tables and views into the Lakehouse. They plan to implement the target architecture using a series of bronze, silver, and gold tables. Bronze tables will almost exclusively be used by production data engineering workloads, while silver tables will be used to support both data engineering and machine learning workloads. Gold tables will largely serve business intelligence and reporting purposes. While personal identifying information (PII) exists in all tiers of data, pseudonymization and anonymization rules are in place for all data at the silver and gold levels.

The organization is interested in reducing security concerns while maximizing the ability to collaborate across diverse teams.

Which statement exemplifies best practices for implementing this system?

Options:

Isolating tables in separate databases based on data quality tiers allows for easy permissions management through database ACLs and allows physical separation of default storage locations for managed tables.

Because databases on Databricks are merely a logical construct, choices around database organization do not impact security or discoverability in the Lakehouse.

Storinq all production tables in a single database provides a unified view of all data assets available throughout the Lakehouse, simplifying discoverability by granting all users view privileges on this database.

Working in the default Databricks database provides the greatest security when working with managed tables, as these will be created in the DBFS root.

Because all tables must live in the same storage containers used for the database they're created in, organizations should be prepared to create between dozens and thousands of databases depending on their data isolation requirements.

Question 23

A Delta Lake table representing metadata about content posts from users has the following schema:

user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE

This table is partitioned by the date column. A query is run with the following filter:

longitude < 20 & longitude > -20

Which statement describes how data will be filtered?

Options:

Statistics in the Delta Log will be used to identify partitions that might Include files in the filtered range.

No file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude.

The Delta Engine will use row-level statistics in the transaction log to identify the flies that meet the filter criteria.

Statistics in the Delta Log will be used to identify data files that might include records in the filtered range.

The Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria.

Question 24

All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema:

key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG

There are 5 unique topics being ingested. Only the "registration" topic contains Personal Identifiable Information (PII). The company wishes to restrict access to PII. The company also wishes to only retain records containing PII in this table for 14 days after initial ingestion. However, for non-PII information, it would like to retain these records indefinitely.

Which of the following solutions meets the requirements?

Options:

All data should be deleted biweekly; Delta Lake's time travel functionality should be leveraged to maintain a history of non-PII information.

Data should be partitioned by the registration field, allowing ACLs and delete statements to be set for the PII directory.

Because the value field is stored as binary data, this information is not considered PII and no special precautions should be taken.

Separate object storage containers should be specified based on the partition field, allowing isolation at the storage level.

Data should be partitioned by the topic field, allowing ACLs and delete statements to leverage partition boundaries.

Question 25

The data engineering team maintains the following code:

Assuming that this code produces logically correct results and the data in the source tables has been de-duplicated and validated, which statement describes what will occur when this code is executed?

Options:

A batch job will update the enriched_itemized_orders_by_account table, replacing only those rows that have different values than the current version of the table, using accountID as the primary key.

The enriched_itemized_orders_by_account table will be overwritten using the current valid version of data in each of the three tables referenced in the join logic.

An incremental job will leverage information in the state store to identify unjoined rows in the source tables and write these rows to the enriched_iteinized_orders_by_account table.

An incremental job will detect if new rows have been written to any of the source tables; if new rows are detected, all results will be recalculated and used to overwrite the enriched_itemized_orders_by_account table.

No computation will occur until enriched_itemized_orders_by_account is queried; upon query materialization, results will be calculated using the current valid version of data in each of the three tables referenced in the join logic.

Question 26

A small company based in the United States has recently contracted a consulting firm in India to implement several new data engineering pipelines to power artificial intelligence applications. All the company's data is stored in regional cloud storage in the United States.

The workspace administrator at the company is uncertain about where the Databricks workspace used by the contractors should be deployed.

Assuming that all data governance considerations are accounted for, which statement accurately informs this decision?

Options:

Databricks runs HDFS on cloud volume storage; as such, cloud virtual machines must be deployed in the region where the data is stored.

Databricks workspaces do not rely on any regional infrastructure; as such, the decision should be made based upon what is most convenient for the workspace administrator.

Cross-region reads and writes can incur significant costs and latency; whenever possible, compute should be deployed in the same region the data is stored.

Databricks leverages user workstations as the driver during interactive development; as such, users should always use a workspace deployed in a region they are physically near.

Databricks notebooks send all executable code from the user's browser to virtual machines over the open internet; whenever possible, choosing a workspace region near the end users is the most secure.

Question 27

What is a method of installing a Python package scoped at the notebook level to all nodes in the currently active cluster?

Options:

Use &Pip install in a notebook cell

Run source env/bin/activate in a notebook setup script

Install libraries from PyPi using the cluster UI

Use &sh install in a notebook cell

Question 28

An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema:

user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT

New records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id.

Assuming there are millions of user accounts and tens of thousands of records processed hourly, which implementation can be used to efficiently update the described account_current table as part of each hourly batch job?

Options:

Use Auto Loader to subscribe to new files in the account history directory; configure a Structured Streaminq trigger once job to batch update newly detected files into the account current table.

Overwrite the account current table with each batch using the results of a query against the account history table grouping by user id and filtering for the max value of last updated.

Filter records in account history using the last updated field and the most recent hour processed, as well as the max last iogin by user id write a merge statement to update or insert the most recent value for each user id.

Use Delta Lake version history to get the difference between the latest version of account history and one version prior, then write these records to account current.

Filter records in account history using the last updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or insert the

most recent value for each username.

Question 29

When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which indicator would signal proper utilization of the VM's resources?

Options:

The five Minute Load Average remains consistent/flat

Bytes Received never exceeds 80 million bytes per second

Network I/O never spikes

Total Disk Space remains constant

CPU Utilization is around 75%

Question 30

A nightly job ingests data into a Delta Lake table using the following code:

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.

Which code snippet completes this function definition?

def new_records():

Options:

return spark.readStream.table("bronze")

return spark.readStream.load("bronze")

return spark.read.option("readChangeFeed", "true").table ("bronze")

Question 31

What is the first of a Databricks Python notebook when viewed in a text editor?

Options:

%python

% Databricks notebook source

-- Databricks notebook source

//Databricks notebook source

Question 32

A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming production job. Recent review of data files shows that most data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB.

Which of the following likely explains these smaller file sizes?

Options:

Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations

Z-order indices calculated on the table are preventing file compaction

C Bloom filler indices calculated on the table are preventing file compaction

Databricks has autotuned to a smaller target file size based on the overall size of data in the table

Databricks has autotuned to a smaller target file size based on the amount of data in each partition

Question 33

Which statement describes integration testing?

Options:

Validates interactions between subsystems of your application

Requires an automated testing framework

Requires manual intervention

Validates an application use case

Validates behavior of individual elements of your application

Question 34

Which statement describes the correct use of pyspark.sql.functions.broadcast?

Options:

It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.

It marks a column as small enough to store in memory on all executors, allowing a broadcast join.

It caches a copy of the indicated table on attached storage volumes for all active clusters within a Databricks workspace.

It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.

It caches a copy of the indicated table on all nodes in the cluster for use in all future queries during the cluster lifetime.

Question 35

A data engineer wants to join a stream of advertisement impressions (when an ad was shown) with another stream of user clicks on advertisements to correlate when impression led to monitizable clicks.

Which solution would improve the performance?

Options:

Option A

Option B

Option C

Option D

Question 36

Which distribution does Databricks support for installing custom Python code packages?

Options:

sbt

CRAN

CRAM

nom

Wheels

jars

Question 37

What statement is true regarding the retention of job run history?

Options:

It is retained until you export or delete job run logs

It is retained for 30 days, during which time you can deliver job run logs to DBFS or S3

t is retained for 60 days, during which you can export notebook run results to HTML

It is retained for 60 days, after which logs are archived

It is retained for 90 days or until the run-id is re-used through custom run configuration

Question 38

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.

Streaming DataFrame df has the following schema:

"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"

Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.

Options:

withWatermark("event_time", "10 minutes")

awaitArrival("event_time", "10 minutes")

await("event_time + ‘10 minutes'")

slidingWindow("event_time", "10 minutes")

delayWrite("event_time", "10 minutes")

Question 39

A junior member of the data engineering team is exploring the language interoperability of Databricks notebooks. The intended outcome of the below code is to register a view of all sales that occurred in countries on the continent of Africa that appear in the geo_lookup table.

Before executing the code, running SHOW TABLES on the current database indicates the database contains only two tables: geo_lookup and sales.

Which statement correctly describes the outcome of executing these command cells in order in an interactive notebook?

Options:

Both commands will succeed. Executing show tables will show that countries at and sales at have been registered as views.

Cmd 1 will succeed. Cmd 2 will search all accessible databases for a table or view named countries af: if this entity exists, Cmd 2 will succeed.

Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable representing a PySpark DataFrame.

Both commands will fail. No new variables, tables, or views will be created.

Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable containing a list of strings.

Question 40

The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.

Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.

Options:

"Can Manage" privileges on the required cluster

Workspace Admin privileges, cluster creation allowed. "Can Attach To" privileges on the required cluster

Cluster creation allowed. "Can Attach To" privileges on the required cluster

"Can Restart" privileges on the required cluster

Cluster creation allowed. "Can Restart" privileges on the required cluster

Exam Detail

Vendor: Databricks

Certification: Databricks Certification

Exam Code: Databricks-Certified-Professional-Data-Engineer

Exam Name: Databricks Certified Data Engineer Professional Exam

Last Update: Nov 3, 2025

Databricks-Certified-Professional-Data-Engineer Question Answers

Big Halloween Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: save70

Free and Premium Databricks Databricks-Certified-Professional-Data-Engineer Dumps Questions Answers

Databricks Certified Data Engineer Professional Exam Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer: