Last Attempt Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions

Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions and Answers

Question 25

Which of the following code blocks shuffles DataFrame transactionsDf, which has 8 partitions, so that it has 10 partitions?

Options:

transactionsDf.repartition(transactionsDf.getNumPartitions()+2)

transactionsDf.repartition(transactionsDf.rdd.getNumPartitions()+2)

transactionsDf.coalesce(10)

transactionsDf.coalesce(transactionsDf.getNumPartitions()+2)

transactionsDf.repartition(transactionsDf._partitions+2)

Question 26

Which of the following code blocks returns a DataFrame with approximately 1,000 rows from the 10,000-row DataFrame itemsDf, without any duplicates, returning the same rows even if the code

block is run twice?

Options:

itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371)

itemsDf.sample(fraction=0.1, seed=87238)

itemsDf.sample(fraction=1000, seed=98263)

itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)

itemsDf.sample(fraction=0.1)

Answer:

Explanation:

Explanation

itemsDf.sample(fraction=0.1, seed=87238)

Correct. If itemsDf has 10,000 rows, this code block returns about 1,000, since DataFrame.sample() is never guaranteed to return an exact amount of rows. To ensure you are not returning

duplicates, you should leave the withReplacement parameter at False, which is the default. Since the QUESTION NO: specifies that the same rows should be returned even if the code block is run

twice,

you need to specify a seed. The number passed in the seed does not matter as long as it is an integer.

itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)

Incorrect. While this code block fulfills almost all requirements, it may return duplicates. This is because withReplacement is set to True.

Here is how to understand what replacement means: Imagine you have a bucket of 10,000 numbered balls and you need to take 1,000 balls at random from the bucket (similar to the problem in the

question). Now, if you would take those balls with replacement, you would take a ball, note its number, and put it back into the bucket, meaning the next time you take a ball from the bucket there

would be a chance you could take the exact same ball again. If you took the balls without replacement, you would leave the ball outside the bucket and not put it back in as you take the next 999

balls.

itemsDf.sample(fraction=1000, seed=98263)

Wrong. The fraction parameter needs to have a value between 0 and 1. In this case, it should be 0.1, since 1,000/10,000 = 0.1.

itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371)

No, DataFrame.sampleBy() is meant for stratified sampling. This means that based on the values in a column in a DataFrame, you can draw a certain fraction of rows containing those values from

the DataFrame (more details linked below). In the scenario at hand, sampleBy is not the right operator to use because you do not have any information about any column that the sampling should

depend on.

itemsDf.sample(fraction=0.1)

Incorrect. This code block checks all the boxes except that it does not ensure that when you run it a second time, the exact same rows will be returned. In order to achieve this, you would have to

specify a seed.

More info:

- pyspark.sql.DataFrame.sample — PySpark 3.1.2 documentation

- pyspark.sql.DataFrame.sampleBy — PySpark 3.1.2 documentation

- Types of Samplings in PySpark 3. The explanations of the sampling… | by Pinar Ersoy | Towards Data Science

Question 27

Which of the following describes the conversion of a computational query into an execution plan in Spark?

Options:

Spark uses the catalog to resolve the optimized logical plan.

The catalog assigns specific resources to the optimized memory plan.

The executed physical plan depends on a cost optimization from a previous stage.

Depending on whether DataFrame API or SQL API are used, the physical plan may differ.

The catalog assigns specific resources to the physical plan.

Month End Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: save70

Last Attempt Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions

Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

CompTIA

Fortinet

Microsoft

Salesforce