Q1: Create an application using Google Colab and pyspark in python. The application should meet the following requirements

Download 19.25 Kb.

Assignment2

Q1: Create an application using Google Colab and pyspark in python. The application should meet the following requirements:

Note: first four steps are common for environment setup using Pyspark in Google Colab.

Download and install openjdk-8
Download and setup apache spark from url : http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
Set environment variables for Java and Spark
Create Spark Session to perform different operations in Pyspark
Upload the dataset “assignment2_dataset.csv” file to your google drive and Load in the application. Display the top 10 rows from the dataset.
Display the count of total records in the dataset
Now, we have to filter records based on Area (sq km) column. First, display the count of records where area is greater than “2381750” and below the count display all records where area is greater than the value specified.
Rename the existing column “Area(sq km)” in the dataset to new name “Area” and then display the data to confirm column is renamed
Fetch the top 200 rows from dataset and sort by “Country” descending e.g. countries starting from letter “z” should be on top and so on
Select only three columns “Country”, “Area” and “Population” but all the rows where Population is greater than 29928987. If you display the count of the rows matching this criteria that will be a plus point.
Sort the records fetched in point no 10(applying the mentioned filters and only displaying the three columns specified) by Population descending i.e. display the record on top where population is the highest.
Fetch all the columns for the data where Country is either “Saudi Arabia” or “Pakistan”

Q2: In the same application as created in Q1 using google colab and pyspark add the following functionality:

Create Spark context and specify a cluster with the following details
1. Name the cluster as “Assignment-2”
2. Set the cluster to run on local computer and not on any cloud e.g. aws etc
3. Specify the cluster to use 2 cores in each worker node
Load the following “students” data in the application

[‘Muhammad', 'Abdullah', 'Farooq', 'Ahmad']

Move the data into spark with its method called parallelize
Now display the number of cores in that data
Search for a student name e.g. Farooq that should be searched in 4 different partitions of the dataset simultaneously.
In step 2 use the “assignment2_dataset.csv” dataset

Download 19.25 Kb.

Share with your friends: