Pass Using Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Exam Dumps

Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions and Answers

Question 13

Which of the following statements about data skew is incorrect?

Options:

Spark will not automatically optimize skew joins by default.

Broadcast joins are a viable way to increase join performance for skewed data over sort-merge joins.

In skewed DataFrames, the largest and the smallest partition consume very different amounts of memory.

To mitigate skew, Spark automatically disregards null values in keys when joining.

Salting can resolve data skew.

Answer:

Explanation:

Explanation

To mitigate skew, Spark automatically disregards null values in keys when joining.

This statement is incorrect, and thus the correct answer to the question. Joining keys that contain null values is of particular concern with regard to data skew.

In real-world applications, a table may contain a great number of records that do not have a value assigned to the column used as a join key. During the join, the data is at risk of being heavily

skewed. This is because all records with a null-value join key are then evaluated as a single large partition, standing in stark contrast to the potentially diverse key values (and therefore small

partitions) of the non-null-key records.

Spark specifically does not handle this automatically. However, there are several strategies to mitigate this problem like discarding null values temporarily, only to merge them back later (see last link

below).

In skewed DataFrames, the largest and the smallest partition consume very different amounts of memory.

This statement is correct. In fact, having very different partition sizes is the very definition of skew. Skew can degrade Spark performance because the largest partition occupies a single executor for

a long time. This blocks a Spark job and is an inefficient use of resources, since other executors that processed smaller partitions need to idle until the large partition is processed.

Salting can resolve data skew.

This statement is correct. The purpose of salting is to provide Spark with an opportunity to repartition data into partitions of similar size, based on a salted partitioning key.

A salted partitioning key typically is a column that consists of uniformly distributed random numbers. The number of unique entries in the partitioning key column should match the number of your

desired number of partitions. After repartitioning by the salted key, all partitions should have roughly the same size.

Spark does not automatically optimize skew joins by default.

This statement is correct. Automatic skew join optimization is a feature of Adaptive Query Execution (AQE). By default, AQE is disabled in Spark. To enable it, Spark's spark.sql.adaptive.enabled

configuration option needs to be set to true instead of leaving it at the default false.

To automatically optimize skew joins, Spark's spark.sql.adaptive.skewJoin.enabled options also needs to be set to true, which it is by default.

When skew join optimization is enabled, Spark recognizes skew joins and optimizes them by splitting the bigger partitions into smaller partitions which leads to performance increases.

Broadcast joins are a viable way to increase join performance for skewed data over sort-merge joins.

This statement is correct. Broadcast joins can indeed help increase join performance for skewed data, under some conditions. One of the DataFrames to be joined needs to be small enough to fit

into each executor's memory, along a partition from the other DataFrame. If this is the case, a broadcast join increases join performance over a sort-merge join.

The reason is that a sort-merge join with skewed data involves excessive shuffling. During shuffling, data is sent around the cluster, ultimately slowing down the Spark application. For skewed data,

the amount of data, and thus the slowdown, is particularly big.

Broadcast joins, however, help reduce shuffling data. The smaller table is directly stored on all executors, eliminating a great amount of network traffic, ultimately increasing join performance relative

to the sort-merge join.

It is worth noting that for optimizing skew join behavior it may make sense to manually adjust Spark's spark.sql.autoBroadcastJoinThreshold configuration property if the smaller DataFrame is bigger

than the 10 MB set by default.

More info:

- Performance Tuning - Spark 3.0.0 Documentation

- Data Skew and Garbage Collection to Improve Spark Performance

- Section 1.2 - Joins on Skewed Data • GitBook

Question 14

The code block shown below should write DataFrame transactionsDf as a parquet file to path storeDir, using brotli compression and replacing any previously existing file. Choose the answer that

correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__.format("parquet").__2__(__3__).option(__4__, "brotli").__5__(storeDir)

Options:

1. save

2. mode

3. "ignore"

4. "compression"

5. path

1. store

2. with

3. "replacement"

4. "compression"

5. path

1. write

2. mode

3. "overwrite"

4. "compression"

5. save

(Correct)

1. save

2. mode

3. "replace"

4. "compression"

5. path

1. write

2. mode

3. "overwrite"

4. compression

5. parquet

Question 15

Which of the following statements about DAGs is correct?

Options:

DAGs help direct how Spark executors process tasks, but are a limitation to the proper execution of a query when an executor fails.

DAG stands for "Directing Acyclic Graph".

Spark strategically hides DAGs from developers, since the high degree of automation in Spark means that developers never need to consider DAG layouts.

In contrast to transformations, DAGs are never lazily executed.

DAGs can be decomposed into tasks that are executed in parallel.

Question 16

Which of the following code blocks returns a DataFrame where columns predError and productId are removed from DataFrame transactionsDf?

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.|1 |3 |4 |25 |1 |null|

5.|2 |6 |7 |2 |2 |null|

6.|3 |3 |null |25 |3 |null|

7.+-------------+---------+-----+-------+---------+----+

Options:

transactionsDf.withColumnRemoved("predError", "productId")

transactionsDf.drop(["predError", "productId", "associateId"])

transactionsDf.drop("predError", "productId", "associateId")

transactionsDf.dropColumns("predError", "productId", "associateId")

transactionsDf.drop(col("predError", "productId"))

Weekend Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: save70

Pass Using Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Exam Dumps

Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

CompTIA

Fortinet

Microsoft

Salesforce