Pre-Summer Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: save70

Databricks Databricks-Certified-Professional-Data-Engineer Online Access

Databricks Certified Data Engineer Professional Exam Questions and Answers

Question 5

A data engineer is implementing Unity Catalog governance for a multi-team environment. Data scientists need interactive clusters for basic data exploration tasks, while automated ETL jobs require dedicated processing.

How should the data engineer configure cluster isolation policies to enforce least privilege and ensure Unity Catalog compliance?

Options:

A.

Use only DEDICATED access mode for both interactive workloads and automated jobs to maximize security isolation.

B.

Allow all users to create any cluster type and rely on manual configuration to enable Unity Catalog access modes.

C.

Configure all clusters with NO ISOLATION_SHARED access mode since Unity Catalog works with any cluster configuration.

D.

Create compute policies with STANDARD access mode for interactive workloads and DEDICATED access mode for automated jobs.

Question 6

Which statement describes the correct use of pyspark.sql.functions.broadcast?

Options:

A.

It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.

B.

It marks a column as small enough to store in memory on all executors, allowing a broadcast join.

C.

It caches a copy of the indicated table on attached storage volumes for all active clusters within a Databricks workspace.

D.

It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.

E.

It caches a copy of the indicated table on all nodes in the cluster for use in all future queries during the cluster lifetime.

Question 7

A security analytics pipeline must enrich billions of raw connection logs with geolocation data. The join hinges on finding which IPv4 range each event’s address falls into.

Table 1: network_events (≈ 5 billion rows)

event_id ip_int

42 3232235777

Table 2: ip_ranges (≈ 2 million rows)

start_ip_int end_ip_int country

3232235520 3232236031 US

The query is currently very slow:

SELECT n.event_id, n.ip_int, r.country

FROM network_events n

JOIN ip_ranges r

ON n.ip_int BETWEEN r.start_ip_int AND r.end_ip_int;

Question:

Which change will most dramatically accelerate the query while preserving its logic?

Options:

A.

Increase spark.sql.shuffle.partitions from 200 to 10000.

B.

Add a range-join hint /*+ RANGE_JOIN(r, 65536) */.

C.

Force a sort-merge join with /*+ MERGE(r) */.

D.

Add a broadcast hint: /*+ BROADCAST(r) */ for ip_ranges.

Question 8

The data engineer team has been tasked with configured connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user group already created in Databricks that represent various teams within the company.

A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users.

Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials?

Options:

A.

‘’Read’’ permissions should be set on a secret key mapped to those credentials that will be used by a given team.

B.

No additional configuration is necessary as long as all users are configured as administrators in the workspace where secrets have been added.

C.

“Read” permissions should be set on a secret scope containing only those credentials that will be used by a given team.

D.

“Manage” permission should be set on a secret scope containing only those credentials that will be used by a given team.