Latest Cloudera CCA175 Dumps PDF Questions Answers 2025

CCA Spark and Hadoop Developer Exam - Performance Based Scenarios Questions and Answers

Question 1

Problem Scenario 45 : You have been given 2 files , with the content as given Below

(spark12/technology.txt)

(spark12/salary.txt)

(spark12/technology.txt)

first,last,technology

Amit,Jain,java

Lokesh,kumar,unix

Mithun,kale,spark

Rajni,vekat,hadoop

Rahul,Yadav,scala

(spark12/salary.txt)

first,last,salary

Amit,Jain,100000

Lokesh,kumar,95000

Mithun,kale,150000

Rajni,vekat,154000

Rahul,Yadav,120000

Write a Spark program, which will join the data based on first and last name and save the joined results in following format, first Last.technology.salary

Options:

Buy Now

Question 2

Problem Scenario 73 : You have been given data in json format as below.

{"first_name":"Ankit", "last_name":"Jain"}

{"first_name":"Amir", "last_name":"Khan"}

{"first_name":"Rajesh", "last_name":"Khanna"}

{"first_name":"Priynka", "last_name":"Chopra"}

{"first_name":"Kareena", "last_name":"Kapoor"}

{"first_name":"Lokesh", "last_name":"Yadav"}

Do the following activity

1. create employee.json file locally.

2. Load this file on hdfs

3. Register this data as a temp table in Spark using Python.

4. Write select query and print this data.

5. Now save back this selected data in json format.

Options:

Question 3

Problem Scenario 27 : You need to implement near real time solutions for collecting information when submitted in file with below information.

Data

echo "IBM,100,20160104" >> /tmp/spooldir/bb/.bb.txt

echo "IBM,103,20160105" >> /tmp/spooldir/bb/.bb.txt

mv /tmp/spooldir/bb/.bb.txt /tmp/spooldir/bb/bb.txt

After few mins

echo "IBM,100.2,20160104" >> /tmp/spooldir/dr/.dr.txt

echo "IBM,103.1,20160105" >> /tmp/spooldir/dr/.dr.txt

mv /tmp/spooldir/dr/.dr.txt /tmp/spooldir/dr/dr.txt

Requirements:

You have been given below directory location (if not available than create it) /tmp/spooldir . You have a finacial subscription for getting stock prices from BloomBerg as well as

Reuters and using ftp you download every hour new files from their respective ftp site in directories /tmp/spooldir/bb and /tmp/spooldir/dr respectively.

As soon as file committed in this directory that needs to be available in hdfs in /tmp/flume/finance location in a single directory.

Write a flume configuration file named flume7.conf and use it to load data in hdfs with following additional properties .

1. Spool /tmp/spooldir/bb and /tmp/spooldir/dr

2. File prefix in hdfs sholuld be events

3. File suffix should be .log

4. If file is not commited and in use than it should have _ as prefix.

5. Data should be written as text to hdfs

Options:

Question 4

Problem Scenario 46 : You have been given belwo list in scala (name,sex,cost) for each work done.

List( ("Deeapak" , "male", 4000), ("Deepak" , "male", 2000), ("Deepika" , "female", 2000),("Deepak" , "female", 2000), ("Deepak" , "male", 1000) , ("Neeta" , "female", 2000))

Now write a Spark program to load this list as an RDD and do the sum of cost for combination of name and sex (as key)

Options:

Question 5

Problem Scenario 74 : You have been given MySQL DB with following details.

user=retail_dba

password=cloudera

database=retail_db

table=retail_db.orders

table=retail_db.order_items

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Columns of order table : (orderjd , order_date , ordercustomerid, order status}

Columns of orderjtems table : (order_item_td , order_item_order_id , order_item_product_id, order_item_quantity,order_item_subtotal,order_item_product_price)

Please accomplish following activities.

1. Copy "retaildb.orders" and "retaildb.orderjtems" table to hdfs in respective directory p89_orders and p89_order_items .

2. Join these data using orderjd in Spark and Python

3. Now fetch selected columns from joined data Orderld, Order date and amount collected on this order.

4. Calculate total order placed for each date, and produced the output sorted by date.

Options:

Answer:

See the explanation for Step by Step Solution and configuration.

Explanation:

Solution:

Step 1 : Import Single table .

sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera -table=orders --target-dir=p89_orders - -m1

sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera -table=order_items ~target-dir=p89_ order items -m 1

Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs

Step 2 : Read the data from one of the partition, created using above command, hadoopfs -cat p89_orders/part-m-00000 hadoop fs -cat p89_order_items/part-m-00000

Step 3 : Load these above two directory as RDD using Spark and Python (Open pyspark terminal and do following). orders = sc.textFile("p89_orders") orderitems = sc.textFile("p89_order_items")

Step 4 : Convert RDD into key value as (orderjd as a key and rest of the values as a value)

#First value is orderjd

ordersKeyValue = orders.map(lambda line: (int(line.split(",")[0]), line))

#Second value as an Orderjd

orderltemsKeyValue = orderltems.map(lambda line: (int(line.split(",")[1]), line))

Step 5 : Join both the RDD using orderjd

joinedData = orderltemsKeyValue.join(ordersKeyValue)

#print the joined data

tor line in joinedData.collect():

print(line)

Format of joinedData as below.

[Orderld, 'All columns from orderltemsKeyValue', 'All columns from orders Key Value']

Step 6 : Now fetch selected values Orderld, Order date and amount collected on this order.

revenuePerOrderPerDay = joinedData.map(lambda row: (row[0]( row[1][1].split(",")[1]( f!oat(row[1][0].split('\M}[4]}}}

#printthe result

for line in revenuePerOrderPerDay.collect():

print(line)

Step 7 : Select distinct order ids for each date.

#distinct(date,order_id)

distinctOrdersDate = joinedData.map(lambda row: row[1][1].split('\")[1] + "," + str(row[0])).distinct()

for line in distinctOrdersDate.collect(): print(line)

Step 8 : Similar to word count, generate (date, 1) record for each row. newLineTuple = distinctOrdersDate.map(lambda line: (line.split(",")[0], 1))

Step 9 : Do the count for each key(date), to get total order per date. totalOrdersPerDate = newLineTuple.reduceByKey(lambda a, b: a + b}

#print results

for line in totalOrdersPerDate.collect():

print(line)

step 10 : Sort the results by date sortedData=totalOrdersPerDate.sortByKey().collect()

#print results

for line in sortedData:

print(line)

Question 6

Problem Scenario 16 : You have been given following mysql database details as well as other info.

user=retail_dba

password=cloudera

database=retail_db

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish below assignment.

1. Create a table in hive as below.

create table departments_hive(department_id int, department_name string);

2. Now import data from mysql table departments to this hive table. Please make sure that data should be visible using below hive command, select" from departments_hive

Options:

Question 7

Problem Scenario 12 : You have been given following mysql database details as well as other info.

user=retail_dba

password=cloudera

database=retail_db

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following.

1. Create a table in retailedb with following definition.

CREATE table departments_new (department_id int(11), department_name varchar(45), created_date T1MESTAMP DEFAULT NOW());

2. Now isert records from departments table to departments_new

3. Now import data from departments_new table to hdfs.

4. Insert following 5 records in departmentsnew table. Insert into departments_new values(110, "Civil" , null); Insert into departments_new values(111, "Mechanical" , null); Insert into departments_new values(112, "Automobile" , null); Insert into departments_new values(113, "Pharma" , null);

Insert into departments_new values(114, "Social Engineering" , null);

5. Now do the incremental import based on created_date column.

Options:

Question 8

Problem Scenario 55 : You have been given below code snippet.

val pairRDDI = sc.parallelize(List( ("cat",2), ("cat", 5), ("book", 4),("cat", 12))) val pairRDD2 = sc.parallelize(List( ("cat",2), ("cup", 5), ("mouse", 4),("cat", 12)))

operation1

Write a correct code snippet for operationl which will produce desired output, shown below.

Array[(String, (Option[lnt], Option[lnt]))] = Array((book,(Some(4},None)), (mouse,(None,Some(4))), (cup,(None,Some(5))), (cat,(Some(2),Some(2)), (cat,(Some(2),Some(12))), (cat,(Some(5),Some(2))), (cat,(Some(5),Some(12))), (cat,(Some(12),Some(2))), (cat,(Some(12),Some(12)))J

Options:

Question 9

Problem Scenario 86 : In Continuation of previous question, please accomplish following activities.

1. Select Maximum, minimum, average , Standard Deviation, and total quantity.

2. Select minimum and maximum price for each product code.

3. Select Maximum, minimum, average , Standard Deviation, and total quantity for each product code, hwoever make sure Average and Standard deviation will have maximum two decimal values.

4. Select all the product code and average price only where product count is more than or equal to 3.

5. Select maximum, minimum , average and total of all the products for each code. Also produce the same across all the products.

Options:

Question 10

Problem Scenario 54 : You have been given below code snippet.

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"))

val b = a.map(x => (x.length, x))

operation1

Write a correct code snippet for operationl which will produce desired output, shown below.

Array[(lnt, String)] = Array((4,lion), (7,panther), (3,dogcat), (5,tigereagle))

Options:

Question 11

Problem Scenario 10 : You have been given following mysql database details as well as other info.

user=retail_dba

password=cloudera

database=retail_db

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following.

1. Create a database named hadoopexam and then create a table named departments in it, with following fields. department_id int,

department_name string

e.g. location should be hdfs://quickstart.cloudera:8020/user/hive/warehouse/hadoopexam.db/departments

2. Please import data in existing table created above from retaidb.departments into hive table hadoopexam.departments.

3. Please import data in a non-existing table, means while importing create hive table named hadoopexam.departments_new

Options:

Question 12

Problem Scenario 80 : You have been given MySQL DB with following details.

user=retail_dba

password=cloudera

database=retail_db

table=retail_db.products

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following activities.

1. Copy "retaildb.products" table to hdfs in a directory p93_products

2. Now sort the products data sorted by product price per category, use productcategoryid colunm to group by category

Options:

Answer:

See the explanation for Step by Step Solution and configuration.

Explanation:

Solution :

Step 1 : Import Single table .

sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera -table=products --target-dir=p93

Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs

Step 2 : Step 2 : Read the data from one of the partition, created using above command, hadoop fs -cat p93_products/part-m-00000

Step 3 : Load this directory as RDD using Spark and Python (Open pyspark terminal and do following}. productsRDD = sc.textFile(Mp93_products")

Step 4 : Filter empty prices, if exists

#filter out empty prices lines

Nonempty_lines = productsRDD.filter(lambda x: len(x.split(",")[4]) > 0)

Step 5 : Create data set like (categroyld, (id,name,price)

mappedRDD = nonempty_lines.map(lambda line: (line.split(",")[1], (line.split(",")[0], line.split(",")[2], float(line.split(",")[4]))))

tor line in mappedRDD.collect(): print(line)

Step 6 : Now groupBy the all records based on categoryld, which a key on mappedRDD it will produce output like (categoryld, iterable of all lines for a key/categoryld)

groupByCategroyld = mappedRDD.groupByKey() for line in groupByCategroyld.collect(): print(line)

step 7 : Now sort the data in each category based on price in ascending order.

# sorted is a function to sort an iterable, we can also specify, what would be the Key on which we want to sort in this case we have price on which it needs to be sorted.

groupByCategroyld.map(lambda tuple: sorted(tuple[1], key=lambda tupleValue: tupleValue[2])).take(5)

Step 8 : Now sort the data in each category based on price in descending order.

# sorted is a function to sort an iterable, we can also specify, what would be the Key on which we want to sort in this case we have price which it needs to be sorted.

on groupByCategroyld.map(lambda tuple: sorted(tuple[1], key=lambda tupleValue: tupleValue[2] , reverse=True)).take(5)

Question 13

Problem Scenario 77 : You have been given MySQL DB with following details.

user=retail_dba

password=cloudera

database=retail_db

table=retail_db.orders

table=retail_db.order_items

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Columns of order table : (orderid , order_date , order_customer_id, order_status)

Columns of ordeMtems table : (order_item_id , order_item_order_ld , order_item_product_id, order_item_quantity,order_item_subtotal,order_ item_product_price)

Please accomplish following activities.

1. Copy "retail_db.orders" and "retail_db.order_items" table to hdfs in respective directory p92_orders and p92 order items .

2. Join these data using orderid in Spark and Python

3. Calculate total revenue perday and per order

4. Calculate total and average revenue for each date. - combineByKey

-aggregateByKey

Options:

Answer:

See the explanation for Step by Step Solution and configuration.

Explanation:

Solution :

Step 1 : Import Single table .

sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera -table=orders --target-dir=p92_orders –m 1

sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba -password=cloudera -table=order_items --target-dir=p92_order_items –m1

Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs

Step 2 : Read the data from one of the partition, created using above command, hadoop fs -cat p92_orders/part-m-00000 hadoop fs -cat p92_order_items/part-m-00000

Step 3 : Load these above two directory as RDD using Spark and Python (Open pyspark terminal and do following). orders = sc.textFile("p92_orders") orderltems = sc.textFile("p92_order_items")

Step 4 : Convert RDD into key value as (orderjd as a key and rest of the values as a value)

#First value is orderjd

ordersKeyValue = orders.map(lambda line: (int(line.split(",")[0]), line))

#Second value as an Orderjd

orderltemsKeyValue = orderltems.map(lambda line: (int(line.split(",")[1]), line))

Step 5 : Join both the RDD using orderjd

joinedData = orderltemsKeyValue.join(ordersKeyValue)

#print the joined data

for line in joinedData.collect():

print(line)

Format of joinedData as below.

[Orderld, 'All columns from orderltemsKeyValue', 'All columns from orders Key Value']

Step 6 : Now fetch selected values Orderld, Order date and amount collected on this order.

//Retruned row will contain ((order_date,order_id),amout_collected)

revenuePerDayPerOrder = joinedData.map(lambda row: ((row[1][1].split(M,M)[1],row[0]}, float(row[1][0].split(",")[4])))

#print the result

for line in revenuePerDayPerOrder.collect():

print(line)

Step 7 : Now calculate total revenue perday and per order

A. Using reduceByKey

totalRevenuePerDayPerOrder = revenuePerDayPerOrder.reduceByKey(lambda runningSum, value: runningSum + value)

for line in totalRevenuePerDayPerOrder.sortByKey().collect(): print(line)

#Generate data as (date, amount_collected) (Ignore ordeMd)

dateAndRevenueTuple = totalRevenuePerDayPerOrder.map(lambda line: (line[0][0], line[1]))

for line in dateAndRevenueTuple.sortByKey().collect(): print(line)

Step 8 : Calculate total amount collected for each day. And also calculate number of days. #Generate output as (Date, Total Revenue for date, total_number_of_dates)

#Line 1 : it will generate tuple (revenue, 1)

#Line 2 : Here, we will do summation for all revenues at the same time another counter to maintain number of records.

#Line 3 : Final function to merge all the combiner

totalRevenueAndTotalCount = dateAndRevenueTuple.combineByKey( \

lambda revenue: (revenue, 1), \

lambda revenueSumTuple, amount: (revenueSumTuple[0] + amount, revenueSumTuple[1] + 1), \

lambda tuplel, tuple2: (round(tuple1[0] + tuple2[0], 2}, tuple1[1] + tuple2[1]) \

for line in totalRevenueAndTotalCount.collect(): print(line)

Step 9 : Now calculate average for each date

averageRevenuePerDate = totalRevenueAndTotalCount.map(lambda threeElements: (threeElements[0], threeElements[1][0]/threeElements[1][1]}}

for line in averageRevenuePerDate.collect(): print(line)

Step 10 : Using aggregateByKey

#line 1 : (Initialize both the value, revenue and count)

#line 2 : runningRevenueSumTuple (Its a tuple for total revenue and total record count for each date)

#line 3 : Summing all partitions revenue and count

totalRevenueAndTotalCount = dateAndRevenueTuple.aggregateByKey( \

(0,0), \

lambda runningRevenueSumTuple, revenue: (runningRevenueSumTuple[0] + revenue, runningRevenueSumTuple[1] + 1), \

lambda tupleOneRevenueAndCount, tupleTwoRevenueAndCount: (tupleOneRevenueAndCount[0] + tupleTwoRevenueAndCount[0], tupleOneRevenueAndCount[1] + tupleTwoRevenueAndCount[1]) \

)

for line in totalRevenueAndTotalCount.collect(): print(line)

Step 11 : Calculate the average revenue per date

averageRevenuePerDate = totalRevenueAndTotalCount.map(lambda threeElements: (threeElements[0], threeElements[1][0]/threeElements[1][1]))

for line in averageRevenuePerDate.collect(): print(line)

Question 14

Problem Scenario 59 : You have been given below code snippet.

val x = sc.parallelize(1 to 20)

val y = sc.parallelize(10 to 30) operationl

z.collect

Write a correct code snippet for operationl which will produce desired output, shown below. Array[lnt] = Array(16,12, 20,13,17,14,18,10,19,15,11)

Options:

Exam Detail

Vendor: Cloudera

Certification: Cloudera Certified Associate CCA

Exam Code: CCA175

Exam Name: CCA Spark and Hadoop Developer Exam - Performance Based Scenarios

Last Update: Jul 5, 2025

CCA175 Question Answers

Weekend Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: save70

Free and Premium Cloudera CCA175 Dumps Questions Answers

CCA Spark and Hadoop Developer Exam - Performance Based Scenarios Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

CompTIA

Fortinet

Microsoft

Salesforce