Explanation: Explanation
To mitigate skew, Spark automatically disregards null values in keys when joining.
This statement is incorrect, and thus the correct answer to the question. Joining keys that contain null values is of particular concern with regard to data skew.
In real-world applications, a table may contain a great number of records that do not have a value assigned to the column used as a join key. During the join, the data is at risk of being heavily
skewed. This is because all records with a null-value join key are then evaluated as a single large partition, standing in stark contrast to the potentially diverse key values (and therefore small
partitions) of the non-null-key records.
Spark specifically does not handle this automatically. However, there are several strategies to mitigate this problem like discarding null values temporarily, only to merge them back later (see last link
below).
In skewed DataFrames, the largest and the smallest partition consume very different amounts of memory.
This statement is correct. In fact, having very different partition sizes is the very definition of skew. Skew can degrade Spark performance because the largest partition occupies a single executor for
a long time. This blocks a Spark job and is an inefficient use of resources, since other executors that processed smaller partitions need to idle until the large partition is processed.
Salting can resolve data skew.
This statement is correct. The purpose of salting is to provide Spark with an opportunity to repartition data into partitions of similar size, based on a salted partitioning key.
A salted partitioning key typically is a column that consists of uniformly distributed random numbers. The number of unique entries in the partitioning key column should match the number of your
desired number of partitions. After repartitioning by the salted key, all partitions should have roughly the same size.
Spark does not automatically optimize skew joins by default.
This statement is correct. Automatic skew join optimization is a feature of Adaptive Query Execution (AQE). By default, AQE is disabled in Spark. To enable it, Spark's spark.sql.adaptive.enabled
configuration option needs to be set to true instead of leaving it at the default false.
To automatically optimize skew joins, Spark's spark.sql.adaptive.skewJoin.enabled options also needs to be set to true, which it is by default.
When skew join optimization is enabled, Spark recognizes skew joins and optimizes them by splitting the bigger partitions into smaller partitions which leads to performance increases.
Broadcast joins are a viable way to increase join performance for skewed data over sort-merge joins.
This statement is correct. Broadcast joins can indeed help increase join performance for skewed data, under some conditions. One of the DataFrames to be joined needs to be small enough to fit
into each executor's memory, along a partition from the other DataFrame. If this is the case, a broadcast join increases join performance over a sort-merge join.
The reason is that a sort-merge join with skewed data involves excessive shuffling. During shuffling, data is sent around the cluster, ultimately slowing down the Spark application. For skewed data,
the amount of data, and thus the slowdown, is particularly big.
Broadcast joins, however, help reduce shuffling data. The smaller table is directly stored on all executors, eliminating a great amount of network traffic, ultimately increasing join performance relative
to the sort-merge join.
It is worth noting that for optimizing skew join behavior it may make sense to manually adjust Spark's spark.sql.autoBroadcastJoinThreshold configuration property if the smaller DataFrame is bigger
than the 10 MB set by default.
More info:
- Performance Tuning - Spark 3.0.0 Documentation
- Data Skew and Garbage Collection to Improve Spark Performance
- Section 1.2 - Joins on Skewed Data • GitBook