Although many organizations choose to run Hadoop in-house, it is also popular to run Hadoop in the cloud on rented hardware or as a service. For instance, Cloudera offers tools for running Hadoop in a public or private cloud, and Amazon has a Hadoop cloud service called Elastic MapReduce.
Hadoop on Amazon EC2
Amazon Elastic Compute Cloud (EC2) is a computing service that allows customers to rent computers (instances) on which they can run their own applications. A customer can launch and terminate instances on demand, paying by the hour for active instances.
The Apache Whirr project (http://incubator.apache.org/whirr) provides a set of scripts that make it easy to run Hadoop on EC2 and other cloud providers. The scripts allow you to perform such operations as launching or terminating a cluster, or adding instances to an existing cluster.
Running Hadoop on EC2 is especially appropriate for certain workflows. For example, if you store data on Amazon S3, then you can run a cluster on EC2 and run MapReduce jobs that read the S3 data and write output back to S3, before shutting down the cluster. If you’re working with longer-lived clusters, you might copy S3 data onto HDFS running on EC2 for more efficient processing, as HDFS can take advantage of data locality, but S3 cannot (since S3 storage is not collocated with EC2 nodes).
Before you can run Hadoop on EC2, you need to work through Amazon’s Getting Started Guide (linked from the EC2 website http://aws.amazon.com/ec2/), which goes through setting up an account, installing the EC2 command-line tools, and launching an instance.
Next, install Whirr, then configure the scripts to set your Amazon Web Service credentials, security key details, and the type and size of server instances to use. Detailed instructions for doing this may be found in Whirr’s README file.
Launching a cluster
We are now ready to launch a cluster. To launch a cluster named test-hadoop-cluster with one master node (running the namenode and jobtracker) and five worker nodes (running the datanodes and tasktrackers), type:
% hadoop-ec2 launch-cluster test-hadoop-cluster 5
This will create EC2 security groups for the cluster, if they don’t already exist, and give the master and worker nodes unfettered access to one another. It will also enable SSH access from anywhere. Once the security groups have been set up, the master instance will be launched; then, once it has started, the five worker instances will be launched. The reason that the worker nodes are launched separately is so that the master’s hostname can be passed to the worker instances, and allow the datanodes and tasktrackers to connect to the master when they start up.
To use the cluster, network traffic from the client needs to be proxied through the master node of the cluster using an SSH tunnel, which we can set up using the following command:
% eval 'hadoop-ec2 proxy test-hadoop-cluster' Proxy pid 27134
Running a MapReduce job
You can run MapReduce jobs either from within the cluster or from an external machine. Here we show how to run a job from the machine we launched the cluster on. Note that this requires that the same version of Hadoop has been installed locally as is running on the cluster.
When we launched the cluster, a hadoop-site.xml file was created in the directory ~/.hadoop-cloud/test-hadoop-cluster. We can use this to connect to the cluster by setting the
HADOOP_CONF_DIRenvironment variable as follows:
% export HADOOP_CONF_DIR=~/.hadoop-cloud/test-hadoop-cluster
The cluster’s filesystem is empty, so before we run a job, we need to populate it with data. Doing a parallel copy from S3 using Hadoop’s distcp tool is an efficient way to transfer data into HDFS:
% hadoop distcp s3n://hadoopbook/ncdc/all input/ncdc/all
After the data has been copied, we can run a job in the usual way:
% hadoop jar job.jar MaxTemperatureWithCombiner input/ncdc/all output
Alternatively, we could have specified the input to be S3, which would have the same effect. When running multiple jobs over the same input data, it’s best to copy the data to HDFS first to save bandwidth:
% hadoop jar job.jar MaxTemperatureWithCombiner s3n://hadoopbook/ncdc/all output
You can track the progress of the job using the jobtracker’s web UI, found at
http://master_host:50030/. To access web pages running on worker nodes, you need set up a proxy auto-config (PAC) file in your browser. See the Whirr documentation for details on how to do this.
Terminating a cluster
To shut down the cluster, issue the
% hadoop-ec2 terminate-cluster test-hadoop-cluster
You will be asked to confirm that you want to terminate all the instances in the cluster.
Finally, stop the proxy process (the
HADOOP_CLOUD_PROXY_PIDenvironment variable was set when we started the proxy):
% kill $HADOOP_CLOUD_PROXY_PID
Learn more about this topic from Hadoop: The Definitive Guide, 2nd Edition.
Apache Hadoop is ideal for organizations with a growing need to store and process massive application datasets. With Hadoop: The Definitive Guide, programmers will find details for analyzing large datasets with Hadoop, and administrators will learn how to set up and run Hadoop clusters. The book includes case studies that illustrate how Hadoop is used to solve specific problems.