Spark RDD for Small case study

DKC Career
May 28, 2020
2 min read

While working with DataFrame & DataSet API, difficult to come back on RDD

As everybody focused on spark dataframe API or Dataset API, developer are more focused on the solution based on both of the API's.

In some case if we need to deal with rdd then it's getting challenging job to solve the use case.

I found below usecase from one of my friend who was struggling to solve this, when I try then it took less then 30 minutes to finish this. I thought to put on blog for the others and out follower's as well.

1. CyberCrunch, a tech news forum, allows users to indicate that they like (upvote) interesting articles posted by other users. Information about upvotes is stored as a text file where every line is associated with a single upvote and contains the following components separated by commas: 
A url finking to the article. . The user ID of the user who posted the article. . The user ID of the user who upvoted the article . The date and time of the upvote. 
The following is an example of several lines from the text file: 

https://technews.com/omg.html,tholloway,trex,21-01-2020 16:45:53
https://s3lab.isg.rhul.ac.uk/,trex,throwaway1,21-01-2020 16:54:30 https://anchor.fm/compsoc,sparkWiz,avidReader,22-01-2020 16:54:30 https://technews.com/omg.html,sparkWiz,sparkWiz,22-01-2020 16:58:01 https://anchor.fm/compsoc,nosglGuru,csRulez,23-01-2020 00:50:59 https://anchor.fm/compsoc,pythonista,hadooper,23-01-2020 00:50:59

(a) The items below describe a series of data manipulation steps. For each step, write a method call applied to a suitable RDD (Resilient Distributed Dataset) object to obtain the result required by the step description. Assume that a base RDD consisting of a sequence of lines in the above format has already been created and bound to the variable rddUpvotes. 

i. Convert each line in rddUpvotes into a list of strings holding the val-ues of the original components stored in that line. For example the One 'https://s3lab.isg.rhul.ac.uk/,trex,throwaway1,21-01-2020 16:54:30' shouldbecomePhttps://s3lab.isg.rhul.ac.uk/',,re,,,hrowaway,, '21-01-2020 16:54:30'). Ston,theresultinrddl.

ii. Filter out all upvotes where the user who posted the article is the same as the user who upvoted it. Store the result in rdd2.

iii. Convert each remaining element of rdd2 into a tuple Curl, 1) where url is the URL of the article being upvoted. Store the result in rdd3.

iv. Compute from rdd3 the total number of upvotes for each article, and store the result in rdd4.


(b) Write a Spark program that starts from rdd2 and computes for each pair of users (p. the total number of upvotes given by user for all articles posted by user p


Solutions:

1.(a)
val rddUpvotes = sc.textFile("/tmp/data1.txt");

val rddList = rddUpvotes.map(line=>line.split(","))

val rdd2 = rddList.filter(array => (array(1)!=array(2)))

val rdd3 = rdd2.map(array => (array(0),1))

val rdd4 = rdd3.groupBy(_._1).map{ case (key, list) => key -> list.map(_._2).reduce(_+_) }

1.(b)
val rdd5 = rdd2.map(array => (array(1)+"_"+array(2),1))

val rdd6 = rdd5.groupBy(_._1).map{ case (key, list) => key -> list.map(_._2).reduce(_+_) }

Thanks for visiting us, please do subscribe for being in touch and let us know if we can solve you use case too.

AnyDataFlow

Spark RDD for Small case study

Recent Posts

Comments

Subscribe to Our Newsletter