Application of Analytics: Let's get started with Hadoop!

Introduction

New applications such as web searches, recommendation engines, machine learning and social networking generate vast amounts of data in the form of logs, blogs, email, and other technical unstructured information streams. This data needs to be processed and correlated to gain insight into today’s business processes. Also, the need to keep both structured and unstructured data requires the storage, processing and analysis of large amounts of data.

There are many ways to process and analyze large volumes of data in a massively parallel scale. Hadoop is an often cited example of a massively parallel processing system. This blog aims at briefly introducing the Hadoop and executing few simple programs in it.

What is Hadoop?

Hadoop is an Open Source implementation of a large-scale batch processing system. It uses the MapReduce framework introduced by Google by leveraging the concept of map and reduce functions well known used in Functional Programming. Although the Hadoop framework is written in Java, it allows developers to deploy custom- written programs coded in Java or any other language to process data in a parallel fashion across hundreds or thousands of commodity servers. It is optimized for contiguous read requests (streaming reads), where processing consists of scanning all the data. Depending on the complexity of the process and the volume of data, response time can vary from minutes to hours. While Hadoop can processes data fast, its key advantage is its massive scalability.

Hadoop leverages a cluster of nodes to run MapReduce programs massively in parallel. A MapReduce program consists of two steps: the Map step processes input data and the Reduce step assembles intermediate results into a final result. Each cluster node has a local file system and local CPU on which to run the MapReduce programs. Data are broken into data blocks, stored across the local files of different nodes, and replicated for reliability. The local files constitute the file system called Hadoop Distributed File System (HDFS). The number of nodes in each cluster varies from hundreds to thousands of machines. Hadoop can also allow for a certain set of fail-over scenarios.

Hadoop is currently being used for index web searches, email spam detection, recommendation engines, prediction in financial services, genome manipulation in life sciences, and for analysis of unstructured data such as log, text, and clickstream. But before we get there, let's learn how to install it and run few simple programs.

Hadoop on your Laptop!

Requirement
Make sure you have a high configuration Laptop with at least 6 GB RAM. You would need HortonWorks Sandbox with HDP 2.3 mounted on a Virtual Box (Oracle VM in this example). It can be installed from "http://hortonworks.com/products/hortonworks-sandbox/#install"

Accessing Hadoop

Once the Virtual server is on, Hadoop is accessed on the your browser at the address 127.0.0.1:8888. After this, you need to login into SSH client - address of which is mentioned on the page under the advanced options. Below is the screenshot for the same.

Link : The Secure Shell Client can be accesses at 127.0.0.1:4200
Login : root
Password: hadoop

Running the Programs

Once logged in , we will create a directory WCclasses in the root directory with the command "mkdir WCclasses". We will now create 3 programs, namely WordMapper.java, SumReducer.java, WordCount.java using Vi Editor. These programs will be in your root directory and will be compiled using javac compile command. The class files will be in the WCclasses directory which were made earlier

Program 1: WordCount.java

Compiling WordCount.java
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar: -d WCclasses WordCount.java

Program 2: WordMapper.java

Compiling WordMapper.java
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar -d WCclasses WordMapper.java

Program 3: SumReducer.java

Compiling SumReducer.java
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar: -d WCclasses WordCount.java

Creating JAR file

The below command will create a JAR file in the directory WCclasses

jar -cvf WordCount.jar -C WCclasses/ .

Below is the screenshot for the same

Once the jar file is created you need to create the input directory in the hdfs file system using the below command.

hdfs dfs -mkdir /user/ru1
hdfs dfs -ls /user/ru1
hdfs dfs -mkdir /user/ru1/wc-inp
hdfs dfs -ls /user/ru1/wc-inp

Loading files into HUE

Input of the txt files to be read into hue

Now you need to access HUE on 127.0.0.1:8000 to input the txt files to be read. Thankfully, this is a drag and drop and does not involve writing any commands.

Once the files are inputted you use the below code to run the final hadoop command

Final Execution
hadoop jar WordCount.jar WordCount /user/ru1/wc-inp /user/ru1/wc-out41

Notice that we have not made wc-out2. Hadoop will create the output directory by itself once the commands are run

You can track your job at the address 127.0.0.1:8088 which lists the log of all jobs and the status.

Once the job comes as Finished & Succeeded we are on our way

Directory for Checking final Output

We go to the directory that was created during execution of the program /user/ru1/wc-out41

File to be seen is part-r-00000 which contains the output.

OUTPUT
And that’s it, you can have a look at the output below. The file can be accessed from the wc-out41 directory in HUE.

You can try the same if you 're new to Hadoop. Hope this blog helps.

Thanks for reading! :)

Application of Analytics

Saturday, 21 November 2015

Let's get started with Hadoop!

1 comment:

Blog Archive