Now, we have to filter records based on Area (sq km) column. First, display the count of records where area is greater than “2381750” and below the count display all records where area is greater than the value specified.
Rename the existing column “Area(sq km)” in the dataset to new name “Area” and then display the data to confirm column is renamed
Fetch the top 200 rows from dataset and sort by “Country” descending e.g. countries starting from letter “z” should be on top and so on
Select only three columns “Country”, “Area” and “Population” but all the rows where Population is greater than 29928987. If you display the count of the rows matching this criteria that will be a plus point.
Sort the records fetched in point no 10(applying the mentioned filters and only displaying the three columns specified) by Population descending i.e. display the record on top where population is the highest.
Fetch all the columns for the data where Country is either “Saudi Arabia” or “Pakistan”
Q2: In the same application as created in Q1 using google colab and pyspark add the following functionality: Create Spark context and specify a cluster with the following details
Name the cluster as “Assignment-2”
Set the cluster to run on local computer and not on any cloud e.g. aws etc