Introduction
What is Hadoop?
New applications such as web searches, recommendation engines, machine learning and social networking generate vast amounts of data in the form of logs, blogs, email, and other technical unstructured information streams. This data needs to be processed and correlated to gain insight into today’s business processes. Also, the need to keep both structured and unstructured data requires the storage, processing and analysis of large amounts of data.
There are many ways to process and analyze large volumes of data in a massively parallel scale. Hadoop is an often cited example of a massively parallel processing system. This blog aims at briefly introducing the Hadoop and executing few simple programs in it.
What is Hadoop?
Hadoop is an Open Source implementation of a large-scale batch processing system. It uses the MapReduce framework introduced by Google by leveraging the concept of map and reduce functions well known used in Functional Programming. Although the Hadoop framework is written in Java, it allows developers to deploy custom- written programs coded in Java or any other language to process data in a parallel fashion across hundreds or thousands of commodity servers. It is optimized for contiguous read requests (streaming reads), where processing consists of scanning all the data. Depending on the complexity of the process and the volume of data, response time can vary from minutes to hours. While Hadoop can processes data fast, its key advantage is its massive scalability.
Hadoop leverages a cluster of nodes to run MapReduce programs massively in parallel. A MapReduce program consists of two steps: the Map step processes input data and the Reduce step assembles intermediate results into a final result. Each cluster node has a local file system and local CPU on which to run the MapReduce programs. Data are broken into data blocks, stored across the local files of different nodes, and replicated for reliability. The local files constitute the file system called Hadoop Distributed File System (HDFS). The number of nodes in each cluster varies from hundreds to thousands of machines. Hadoop can also allow for a certain set of fail-over scenarios.
Hadoop is currently being used for index web searches, email spam detection, recommendation engines, prediction in financial services, genome manipulation in life sciences, and for analysis of unstructured data such as log, text, and clickstream. But before we get there, let's learn how to install it and run few simple programs.
Hadoop on your Laptop!
Requirement
Make sure you have a high configuration Laptop with at least 6 GB RAM. You would need HortonWorks Sandbox with HDP 2.3 mounted on a Virtual Box (Oracle VM in this example). It can be installed from "http://hortonworks.com/products/hortonworks-sandbox/#install"
Accessing Hadoop
Program 1: WordCount.java


Compiling WordCount.java
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar: -d WCclasses WordCount.java
Program 2: WordMapper.java

Compiling WordMapper.java
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar -d WCclasses WordMapper.java
Program 3: SumReducer.java

Compiling SumReducer.java
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar: -d WCclasses WordCount.java
Creating JAR file


Directory for Checking final Output
Requirement
Make sure you have a high configuration Laptop with at least 6 GB RAM. You would need HortonWorks Sandbox with HDP 2.3 mounted on a Virtual Box (Oracle VM in this example). It can be installed from "http://hortonworks.com/products/hortonworks-sandbox/#install"
Accessing Hadoop
Once the Virtual server is on, Hadoop is accessed on the your browser at the address 127.0.0.1:8888. After this, you need to login into SSH client - address of which is mentioned on the page under the advanced options. Below is the screenshot for the same.
Link : The Secure Shell Client can be accesses at 127.0.0.1:4200
Login : root
Password: hadoop
Login : root
Password: hadoop
Once logged in , we will create a directory WCclasses in the root directory with the command "mkdir WCclasses". We will now create 3 programs, namely WordMapper.java, SumReducer.java, WordCount.java using Vi Editor. These programs will be in your root directory and will be compiled using javac compile command. The class files will be in the WCclasses directory which were made earlier
Program 1: WordCount.java


Compiling WordCount.java
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar: -d WCclasses WordCount.java
Program 2: WordMapper.java

Compiling WordMapper.java
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar -d WCclasses WordMapper.java
Program 3: SumReducer.java

Compiling SumReducer.java
javac -classpath /usr/hdp/2.3.0.0-2557/hadoop/hadoop-common-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/hadoop-mapreduce-client-core-2.7.1.2.3.0.0-2557.jar:/usr/hdp/2.3.0.0-2557/hadoop-mapreduce/commons-cli-1.2.jar: -d WCclasses WordCount.java
Creating JAR file
The below command will create a JAR file in the directory WCclasses
jar -cvf WordCount.jar -C WCclasses/ .

Below is the screenshot for the same

Once the jar file is created you need to create the input directory in the hdfs file system using the below command.
hdfs dfs -mkdir /user/ru1
hdfs dfs -ls /user/ru1
hdfs dfs -mkdir /user/ru1/wc-inp
hdfs dfs -ls /user/ru1/wc-inp
Loading files into HUE
jar -cvf WordCount.jar -C WCclasses/ .

Below is the screenshot for the same

Once the jar file is created you need to create the input directory in the hdfs file system using the below command.
hdfs dfs -mkdir /user/ru1
hdfs dfs -ls /user/ru1
hdfs dfs -mkdir /user/ru1/wc-inp
hdfs dfs -ls /user/ru1/wc-inp
Loading files into HUE
Input of the txt files to be read into hue
Now you need to access HUE on 127.0.0.1:8000 to input the txt files to be read. Thankfully, this is a drag and drop and does not involve writing any commands.

Once the files are inputted you use the below code to run the final hadoop command
Now you need to access HUE on 127.0.0.1:8000 to input the txt files to be read. Thankfully, this is a drag and drop and does not involve writing any commands.
Once the files are inputted you use the below code to run the final hadoop command
Final Execution
hadoop jar WordCount.jar WordCount /user/ru1/wc-inp /user/ru1/wc-out41
Notice that we have not made wc-out2. Hadoop will create the output directory by itself once the commands are run
hadoop jar WordCount.jar WordCount /user/ru1/wc-inp /user/ru1/wc-out41
Notice that we have not made wc-out2. Hadoop will create the output directory by itself once the commands are run

You can track your job at the address 127.0.0.1:8088 which lists the log of all jobs and the status.
Once the job comes as Finished & Succeeded we are on our way

Directory for Checking final Output
We go to the directory that was created during execution of the program /user/ru1/wc-out41


File to be seen is part-r-00000 which contains the output.
OUTPUT
And that’s it, you can have a look at the output below. The file can be accessed from the wc-out41 directory in HUE.



You can try the same if you 're new to Hadoop. Hope this blog helps.
Thanks for reading! :)
File to be seen is part-r-00000 which contains the output.
OUTPUT
And that’s it, you can have a look at the output below. The file can be accessed from the wc-out41 directory in HUE.
You can try the same if you 're new to Hadoop. Hope this blog helps.
Thanks for reading! :)
seen
ReplyDelete