Helping Hand Questions for Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0

Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions and Answers

Question 17

Which of the following code blocks creates a new DataFrame with two columns season and wind_speed_ms where column season is of data type string and column wind_speed_ms is of data type

double?

Options:

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

1. from pyspark.sql import types as T

2. spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Answer:

Explanation:

Explanation

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

Correct. This command uses the Spark Session's createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified

as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string.

Find out more about SparkSession.createDataFrame() via the link below.

spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

No, the SparkSession does not have a newDataFrame method.

from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python's pandas package, in which this would be correct syntax. To create

a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame — PySpark 3.1.1 documentation and Data Types - Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1, QUESTION NO: 41 (Databricks import instructions)

Question 18

The code block displayed below contains an error. The code block should produce a DataFrame with color as the only column and three rows with color values of red, blue, and green, respectively.

Find the error.

Code block:

1.spark.createDataFrame([("red",), ("blue",), ("green",)], "color")

Instead of calling spark.createDataFrame, just DataFrame should be called.

Options:

The commas in the tuples with the colors should be eliminated.

The colors red, blue, and green should be expressed as a simple Python list, and not a list of tuples.

Instead of color, a data type should be specified.

The "color" expression needs to be wrapped in brackets, so it reads ["color"].

Question 19

The code block displayed below contains multiple errors. The code block should return a DataFrame that contains only columns transactionId, predError, value and storeId of DataFrame

transactionsDf. Find the errors.

Code block:

transactionsDf.select([col(productId), col(f)])

Sample of transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.+-------------+---------+-----+-------+---------+----+

Options:

The column names should be listed directly as arguments to the operator and not as a list.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed

as strings without being wrapped in a col() operator.

The select operator should be replaced by a drop operator.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and

f should be replaced by transactionId, predError, value and storeId.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Answer:

Explanation:

Explanation

Correct code block: transactionsDf.drop("productId", "f")

This QUESTION NO: requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code

block

includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error QUESTION

NO: will

make it easier for you to deal with single-error questions in the real exam.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as

strings without being wrapped in a col() operator.

Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the QUESTION NO: can be solved by using a select statement, a drop statement, given

the

answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column

names should be expressed as strings and not as Python variable names as in the original code block.

The column names should be listed directly as arguments to the operator and not as a list.

Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.

The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f

should be replaced by transactionId, predError, value and storeId.

Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named

productId instead of telling Spark to use the column productId - for that, you need to express it as a string.

The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

No. This still leaves you with Python trying to interpret the column names as Python variables (see above).

The select operator should be replaced by a drop operator.

Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as Python variables

(see above).

More info: pyspark.sql.DataFrame.drop — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 30 (Databricks import instructions)

Question 20

Which of the following code blocks returns a DataFrame that has all columns of DataFrame transactionsDf and an additional column predErrorSquared which is the squared value of column

predError in DataFrame transactionsDf?

Options:

transactionsDf.withColumn("predError", pow(col("predErrorSquared"), 2))

transactionsDf.withColumnRenamed("predErrorSquared", pow(predError, 2))

transactionsDf.withColumn("predErrorSquared", pow(col("predError"), lit(2)))

transactionsDf.withColumn("predErrorSquared", pow(predError, lit(2)))

transactionsDf.withColumn("predErrorSquared", "predError"**2)

Weekend Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: save70

Helping Hand Questions for Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0

Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

CompTIA

Fortinet

Microsoft

Salesforce