CCA175 Premium Exam Questions

CCA Spark and Hadoop Developer Exam Questions and Answers

Question 13

Problem Scenario 77 : You have been given MySQL DB with following details.

user=retail_dba

password=cloudera

database=retail_db

table=retail_db.orders

table=retail_db.order_items

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Columns of order table : (orderid , order_date , order_customer_id, order_status)

Columns of ordeMtems table : (order_item_id , order_item_order_ld , order_item_product_id, order_item_quantity,order_item_subtotal,order_ item_product_price)

Please accomplish following activities.

1. Copy "retail_db.orders" and "retail_db.order_items" table to hdfs in respective directory p92_orders and p92 order items .

2. Join these data using orderid in Spark and Python

3. Calculate total revenue perday and per order

4. Calculate total and average revenue for each date. - combineByKey

-aggregateByKey

Options:

Answer:

See the explanation for Step by Step Solution and configuration.

Explanation:

Solution :

Step 1 : Import Single table .

sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera -table=orders --target-dir=p92_orders –m 1

sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba -password=cloudera -table=order_items --target-dir=p92_order_items –m1

Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs

Step 2 : Read the data from one of the partition, created using above command, hadoop fs -cat p92_orders/part-m-00000 hadoop fs -cat p92_order_items/part-m-00000

Step 3 : Load these above two directory as RDD using Spark and Python (Open pyspark terminal and do following). orders = sc.textFile("p92_orders") orderltems = sc.textFile("p92_order_items")

Step 4 : Convert RDD into key value as (orderjd as a key and rest of the values as a value)

#First value is orderjd

ordersKeyValue = orders.map(lambda line: (int(line.split(",")[0]), line))

#Second value as an Orderjd

orderltemsKeyValue = orderltems.map(lambda line: (int(line.split(",")[1]), line))

Step 5 : Join both the RDD using orderjd

joinedData = orderltemsKeyValue.join(ordersKeyValue)

#print the joined data

for line in joinedData.collect():

print(line)

Format of joinedData as below.

[Orderld, 'All columns from orderltemsKeyValue', 'All columns from orders Key Value']

Step 6 : Now fetch selected values Orderld, Order date and amount collected on this order.

//Retruned row will contain ((order_date,order_id),amout_collected)

revenuePerDayPerOrder = joinedData.map(lambda row: ((row[1][1].split(M,M)[1],row[0]}, float(row[1][0].split(",")[4])))

#print the result

for line in revenuePerDayPerOrder.collect():

print(line)

Step 7 : Now calculate total revenue perday and per order

A. Using reduceByKey

totalRevenuePerDayPerOrder = revenuePerDayPerOrder.reduceByKey(lambda runningSum, value: runningSum + value)

for line in totalRevenuePerDayPerOrder.sortByKey().collect(): print(line)

#Generate data as (date, amount_collected) (Ignore ordeMd)

dateAndRevenueTuple = totalRevenuePerDayPerOrder.map(lambda line: (line[0][0], line[1]))

for line in dateAndRevenueTuple.sortByKey().collect(): print(line)

Step 8 : Calculate total amount collected for each day. And also calculate number of days. #Generate output as (Date, Total Revenue for date, total_number_of_dates)

#Line 1 : it will generate tuple (revenue, 1)

#Line 2 : Here, we will do summation for all revenues at the same time another counter to maintain number of records.

#Line 3 : Final function to merge all the combiner

totalRevenueAndTotalCount = dateAndRevenueTuple.combineByKey( \

lambda revenue: (revenue, 1), \

lambda revenueSumTuple, amount: (revenueSumTuple[0] + amount, revenueSumTuple[1] + 1), \

lambda tuplel, tuple2: (round(tuple1[0] + tuple2[0], 2}, tuple1[1] + tuple2[1]) \

for line in totalRevenueAndTotalCount.collect(): print(line)

Step 9 : Now calculate average for each date

averageRevenuePerDate = totalRevenueAndTotalCount.map(lambda threeElements: (threeElements[0], threeElements[1][0]/threeElements[1][1]}}

for line in averageRevenuePerDate.collect(): print(line)

Step 10 : Using aggregateByKey

#line 1 : (Initialize both the value, revenue and count)

#line 2 : runningRevenueSumTuple (Its a tuple for total revenue and total record count for each date)

#line 3 : Summing all partitions revenue and count

totalRevenueAndTotalCount = dateAndRevenueTuple.aggregateByKey( \

(0,0), \

lambda runningRevenueSumTuple, revenue: (runningRevenueSumTuple[0] + revenue, runningRevenueSumTuple[1] + 1), \

lambda tupleOneRevenueAndCount, tupleTwoRevenueAndCount: (tupleOneRevenueAndCount[0] + tupleTwoRevenueAndCount[0], tupleOneRevenueAndCount[1] + tupleTwoRevenueAndCount[1]) \

)

for line in totalRevenueAndTotalCount.collect(): print(line)

Step 11 : Calculate the average revenue per date

averageRevenuePerDate = totalRevenueAndTotalCount.map(lambda threeElements: (threeElements[0], threeElements[1][0]/threeElements[1][1]))

for line in averageRevenuePerDate.collect(): print(line)

Question 14

Problem Scenario 59 : You have been given below code snippet.

val x = sc.parallelize(1 to 20)

val y = sc.parallelize(10 to 30) operationl

z.collect

Write a correct code snippet for operationl which will produce desired output, shown below. Array[lnt] = Array(16,12, 20,13,17,14,18,10,19,15,11)

Spring Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: save70

CCA Spark and Hadoop Developer Exam Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

CompTIA

Fortinet

Microsoft

Salesforce