Ait580 Lab Journal Report

Download 43.41 Kb.
Size43.41 Kb.

AIT580 Lab Journal Report

QwikLabs AWS Lab: AWS Access and Introduction to EC2, S3, and EMR
Report from a laboratory exercises for Spring 2016

as part of AIT580 Analytics: Big Data to Information

Data Analytics Engineering Program

Department of Information Sciences and Technology (Applied IT)

Volgenau School of Engineering

George Mason University

Hyunseok Choi


29 Mar, 2016

Table of Contents

1.Lab Scripts 1

1.1. Amazon EC2 1

1.2. Amazon S3 5

1.3. Amazon EMR 8

2.Lab Questions 13

1.4. Comment on the benefits and drawbacks of having all your input and output stored on s3. 13

1.5. How much should organization software depend on AWS? Is it too all- encompassing? 13

1.6. Who would you give access to your S3 buckets for analytic work - remember read-only access is possible. 14

3.Learning Achieved 14

4.Conclusions 14

5.References 15
  1. Lab Scripts

This lab consists of basic information and the usage of cloud services needed for the analytic process including Amazon EC2, Amazon S3, and Amazon EMR. The lab is provided with free training video on YouTube, so that you can easily understand each concept and how to execute each hands-on lab. Most of all, the lab provides Amazon Web Services student account and prebuilt environment so that you can save time on account registration or security key issue.

1.1.Amazon EC2

Amazon EC2, which stands for Amazon Elastic Compute Cloud, is the notable IaaS (Infrastructure as a Service) provided by AWS (Amazon Web Services). You can configure and operate virtual machines with Amazon EC2. The lab experience teaches how to create and configure an Amazon Linux Instance and how to connect to the instance.

Once you login to AWS with the account information provided by this lab, you move to the Amazon Management Console for EC2. If you click “Launch Instance”, you will see the following screen to select the type of Amazon Machine Image (AMI). (Picture 1)

Picture 1 To select an AMI
Here, you select Amazon Linux by clicking “Select” button next to Amazon Linux AMI. You can select other OS provided by Amazon such as Red Hat Enterprise Linux or Windows Server, but I will go with Amazon Linux for this lab. If you clicked “Select” button, you will see the second step to choose its instance type. (Picture 2) Here, you can choose your computing capacities such as CPU and Memory, its price will be different depending on your computing power selected. Here, you select the default t2.micro instance, which is free of cost and limited in computing power, and click “Next: Configure Instance Details.”

Picture 1 To select an instance type

You will see other details for the configuration of your instance. (Picture 3) You can select the configuration values, but here you follow default values. Click “Next: Add Storage”.

Picture 2 To select configuration details

The fourth and fifth step is for the setting of your instance’s storage and tag. Also, you can change the configuration values, but you follow the default options. Click “Next: Tag Instance” to continue to the fifth step and “Next: Configure Security Group” to continue to the sixth step. Here is important in that you can set your security options. Leaving source as is insecure in reality, but leave as it is for this lab.

Picture 3 To select security group

Lastly, you will see the review screen before launch. (Picture 5) If you click “Launch”, you will be prompted a dialog box to choose a key pair. This is important for your security. However, you select the existing key pair provided by quickLabs and click “Launch Instances”. Now your instance will be created automatically as you configured.

Picture 4 To review instance launch

If you click “View Instance”, you will move to Amazon Management Console for your EC2 instance. (Picture 6) This screen is important for you to monitor and take action of your instances. If you see your Instance State turned “running”, it means your instance is created and ready. You can connect to the instance with the public DNS address on the screen.

Picture 5 Amazon Management Console

Since I’m running a Mac, I used Terminal application to connect to my EC2 instance. You move to the folder where you downloaded the private key ended with *.pem. The download is available in the quickLabs-Connections. Once you moved to the folder, you change the authority of the private key file with the following command.

chmod 600

Next, you connect to your instance using Public DNS address provided in the Amazon Management Console. The following is the command to connect the instance via SSH with your downloaded private key. My public DNS address is in my case.

ssh –i <file path/your private key file.pem> ec2-user@

The user name ec2-user is the default user name of my instance created by AWS. The following screen is to show what I have done to connect the instance.

Figure 6 Commands to connect to the instance

Once you connect to the instance, you can do the same work as you will do in an actual Linux machine. The following screen is when I executed list command (ls) in the instance.

Picture 7 Example command in an EC2 instance

1.2.Amazon S3

Amazon S3, which stands for Amazon Simple Storage Service, is a web-based storage for your data and application. In this lab, you will go through how to create a Bucket and how to add/remove an Object. Before you jump on hands-on lab, you need to be aware of some concepts for Amazon S3. Your files are stored as “Objects” in a “Bucket.” The object consists of files and their metadata (optional). The bucket is a container for the objects. You can access to S3 with standard web service interface, command line interface or Amazon Management Console.

As you did in Amazon EC2 lab, you log in to AWS with awsstudent account provided by quikLabs. Then, you move to Amazon S3. In order to create a bucket, you click “Create Bucket” on the top left. (Picture 9)

Picture 8 Amazon Management Console for Amazon S3

Here, you will set your bucket name which has to be unique in the region. In addition, you can select your region. Here, you follow the default region and click “Create.”

Picture 9 To create a bucket

Now your S3 bucket is created. You move into your bucket by clicking the bucket name you created. In order to upload your object (data) to your bucket, click “Upload.” You will see a dialogue box to upload your objects. You can simply upload files by clicking “Add Files”, or drag and drop the files or folders to upload. (Picture 11)

Picture 10 Upload file/folder to your bucket

If you click “Start Upload”, your upload will begin. You can watch the progress in the Transfer panel. You can see what you have uploaded in the console. (Picture 12) You can also check the path of your address in S3 on the top. It is “All Buckets/choih-mybucket/Python” in my case.

Picture 11 To check what you have uploaded via Amazon Management Console

If you select a file by clicking it and right-click it, you can execute additional actions for the file. (Picture 13) You can open, download or delete the file as you do in the Finder application in your PC. It is simple as that. To move your file, you select “Cut” in the menu and move to other bucket and paste the file.

Picture 12 Actions can be taken for the file

1.3.Amazon EMR

Amazon EMR, which stands for Amazon Elastic MapReduce, is a PaaS (Platform as a Service) in that it provides the pre-built environment for analytics. It processes and analyze data supported by Amazon S3.

You log in to Amazon Management Console with awsstudent account provided by quikLabs. Since Amazon EMR uses Amazon S3 for its data input, output, and log, you create S3 first. You create a bucket named emr-bucket-- to make it easily understood and unique, referring to Picture 11. When you select the region, it is better to make the bucket within the same region as your Amazon EMR instance to avoid cross-region bandwidth charge. The lab instruction didn’t instruct to create nested folder in your bucket, but it is better to make nested folders “output” and “logs”. By doing this, you can locate your nested folders later easily.

Moving on to Management Console for Amazon EMR, you start to create an Amazon EMR cluster by clicking “Create Cluster”. (Picture 14)

Picture 14 Amazon Management Console for Amazon EMR

Next, you go to advanced options to configure the details of EMR cluster. On the Advanced Options page, you select “Streaming program” for the “Step type”. And click “Configure.” (Picture 15)

Picture 15 Advanced options for Amazon EMR

On the following screen, you set the name, mapper, reducer, input S3 location and output S3 location as described in Picture 16. Mapper sets the Python code used for word count. The code is from word count example, so you can take a look at the source code in

Picture 16 Add step for word count process

If you click “Add” to add the current settings, you will go back to the previous screen. Then, you check “Auto-terminate cluster after the last step is completed.” You click “Next” to move on to the hardware options. Again, you click “Next” leaving the current settings as they are.

On the next screen, you set the logging folder as the folder you created earlier for the logs. It is shown in Picture 17. Then, you click “Next.”

Picture 17 General options for Amazon EMR

On the next screen, you set your private key pair to “Proceed without an EC2 key pair.” This is similar when you created an EC2 instance. Since you didn’t choose EC2 key pair, you don’t have access to its SSH. Leaving all other settings as default, you click “Create cluster.”

Now you launched an EMR cluster. It would take about 10 minutes to launch the cluster instance, load the data files from Amazon S3, process and store the data to Amazon S3. If you click “Cluster List”, you can check the work is completed by its Status. (Picture 18) When the work is completed, the cluster is going to be terminated as shown in Picture 19.

Picture 18 Amazon EMR cluster list

Picture 19 Amazon EMR cluster list (terminated)

Because the output is stored in Amazon S3 as you configured, you should go to the bucket you created in Amazon Management Console. In output folder, you will see the outputs. (Picture 20)

Picture 20 Amazon EMR output in Amazon S3

As you conducted additional actions with an object in the previous lab, you download a file and open it with a text editor. (Picture 21)

Picture 21 Word count output

  1. Lab Questions

1.4.Comment on the benefits and drawbacks of having all your input and output stored on s3.

If I store all of my input and output on Amazon S3, there will be the benefits and drawbacks. The first benefit is its cost-efficiency. Since you store data on the cloud, you are going to get paid based on what you have used and there will be no up-front cost. To be specific, you will get paid by the instances you use as you can see on the pricing page on AWS. (“AWS | Amazon Elastic MapReduce (EMR) | Pricing,” n.d.) The second benefit is excellent usability. Because Amazon S3 is integrated with Amazon EMR, you can easily process, analyze and output the data on the cloud seamlessly. The third benefit is its accessibility. Since input and output data are on the cloud, it is easy to aggregate the input data from the online source and easy to share with other users on the Internet. Lastly and most importantly, you can focus on core business activities such as analytics. Decreased management point in an organization will cause not only cost-effectiveness but also business agility to cope with the rapidly changing business environment like nowadays.

It has some drawbacks too. The first drawback is additional workload by uploading and downloading your data onto the cloud. It will consume time depending on your network capability. What is more, it is more limited to store data in the cloud depending on a type of data and local regulation, so there can be hidden cost to follow the regulation. It will be worse if your organization is a multinational corporation because related laws will be different from countries. As far as security concerned, it is said to be safe, but people’s doubt about security is still there. One of other major cloud service provider, Dropbox, has got its security breached in 2014. (Reilly, n.d.) As you can see from the example, you still need to pay attention to security.

1.5.How much should organization software depend on AWS? Is it too all- encompassing?

I would answer it depends on the size and industry of the organization and sensitivity of data. As you can see from AWS Marketplace web page, AWS is offering various software from software infrastructure to business software. (“AWS Marketplace: Find and Buy Server Software and Services that Run on the AWS Cloud,” n.d.) So if your organization is a small and medium business, it would be cost-effective to adopt AWS throughout the organization especially if your network traffic is not easy to be anticipated such as game and e-commerce business. However, if your organization belongs to a conservative industry such as finance and government, you will be required to deal with data in the more proven way. More time will be necessary to prove the technology is safe for the business.

Sensitive data should be treated with care in an organization. The business software such as Customer Relationship Management (CRM) and Human Resource Information System (HRIS) involves sensitive data and performs a vital role in an organization at the same time. It treats customer data or employee data, which is sensitive and private information, so it should not be on the cloud by rules and regulations.

Even if all conditions are permitted to go on AWS, there is one last thing you need to consider, which is a vendor lock-in. One of AWS’s constant marketing message is it continues to drop its price, for example, the last price drop in last January was its 51st. (“Happy New Year – EC2 Price Reduction (C4, M4, and R3 Instances) | AWS Blog,” n.d.) However, if your technology is dependent on AWS, you are losing your buying power. (“Vendor Lock-in and the Big Data Ecosystem — What Does it Really Mean?,” n.d.) Thus, it is risky to rely too much on AWS for the reasons above.

1.6.Who would you give access to your S3 buckets for analytic work - remember read-only access is possible.

In order to answer this question, I would assume it is product marketing and there is a separate product sales team. The first thing you have to do before you give an access to someone is to properly mask the sensitive data that you have collected. Many customers’ personal data such as personally identifiable information, network connection information would be collected through web logs and marketing events, so I would recommend you to treat data directly even if you work with your marketing agency. Before they are distributed, you should mask the sensitive data with proper technology such as Data Redaction. (“Amazon RDS now supports Oracle Database 12c,” n.d.)

After that, I would recommend you to give read-only access for S3 buckets to closely related functional teams such as a product sales team. Still, they could analyze the data as they intended, and there wouldn’t be data loss from human error. In addition, if there is more analytic staff within your team, you could share full data access to them. In order to analyze with proper factors for a better result, your team members might need to add other data.

  1. Learning Achieved

Through these labs, I could learn basic skills to operate Amazon EC2, S3, and EMR. I knew about Amazon EC2 and S3, but EMR was new to me. While I was working on Amazon EMR lab, I was confused about the concepts of cluster and node. I knew the general concepts of the terms, but I was not sure how the concepts apply to EMR. I assumed that I could find the answer from the official manual from AWS, so I read the manual titled “Amazon Elastic MapReduce Management Guide”. I could find that a cluster consists of a group of EC2 instances, which are also called nodes and used for different roles. (“Amazon Elastic MapReduce Management Guide,” n.d.) As a result, I could find out how the other two products, Amazon EC2 and S3, play a holistic role within EMR.

To sum up, I conducted self-study labs and solved a problem on my own. The labs provided a good foundation to study further about analytic on AWS, and I believe I can utilize quickLabs to learn other skills on AWS. I learned a good source as well to solve a problem on AWS quickly.

  1. Conclusions

I focused on the aspect how AWS technology would enable a business to focus on core business activities without much effort in technology because I believe it is the biggest value of cloud service for a business. I turned out to be true while conducting the labs. I could launch a virtual instance easily and quickly on AWS, otherwise I had to spend some days to prepare for the infrastructure. I also could set up and run EMR quickly, and I didn’t need to care much about technical infrastructure.

What I have to focus on more as a data scientist from now on is to practice to improve the analytic algorithm, interpret results and draw meaningful insights. I tried word count example for this time, but I can try other examples or other skillsets. As the guest speakers Charlie Greenbacker and Jeremy Glesner said during class on 21 March, I’d like to expand my skills to Apache Spark on EMR. As cloud technology propagates more, it is expected that being more instant and straight to the core business value is essential for a business.

  1. References

Amazon Elastic MapReduce Management Guide. (n.d.). Retrieved from

Amazon RDS now supports Oracle Database 12c. (n.d.). Retrieved March 31, 2016, from //

AWS | Amazon Elastic MapReduce (EMR) | Pricing. (n.d.). Retrieved March 30, 2016, from

AWS Marketplace: Find and Buy Server Software and Services that Run on the AWS Cloud. (n.d.). Retrieved March 30, 2016, from

Happy New Year – EC2 Price Reduction (C4, M4, and R3 Instances) | AWS Blog. (n.d.). Retrieved March 30, 2016, from

Reilly, C. (n.d.). Hackers hold 7 million Dropbox passwords ransom. Retrieved March 30, 2016, from

Vendor Lock-in and the Big Data Ecosystem — What Does it Really Mean? | SmartData Collective. (n.d.). Retrieved March 30, 2016, from

Download 43.41 Kb.

Share with your friends:

The database is protected by copyright © 2024
send message

    Main page