Friday, 18 December 2015

Hive - The de facto standard for SQL queries in Hadoop

In my previous blogs, we got introduced to Hadoop and later with Pig which is a scripting language with a focus on dataflow. In this section we will learn about Apache Hive. We will also run queries in HIVE which is the almost SQL equivalent.

Introduction

Although Pig can be quite a powerful and simple language to use, the downside is that it’s something new to learn and master. Some folks at Facebook developed a runtime Hadoop support structure that allows anyone who is already fluent with SQL (which is commonplace for relational data-base developers) to leverage the Hadoop platform right out of the gate.

Their creation, called Hive, allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements; now you should be aware that HQL is limited in the commands it understands, but it is still pretty useful. HQL statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster.

For anyone with a SQL or relational database background, this section will look very familiar to you. As with any database management system (DBMS), you can run your Hive queries in many ways. You can run them from a command line interface (known as the Hive shell), from a Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) application leveraging the Hive JDBC/ODBC drivers, or from what is called a Hive Thrift Client. The Hive Thrift Client is much like any database client that gets installed on a user’s client machine (or in a middle tier of a three-tier architecture): it communicates with the Hive services running on the server. You can use the Hive Thrift Client within applications written in C++, Java, PHP, Python, or Ruby (much like you can use these client-side languages with embedded SQL to access a database such as DB2 or Informix).

Hive looks very much like traditional database code with SQL access. However, because Hive is based on Hadoop and MapReduce operations, there are several key differences. The first is that Hadoop is intended for long sequential scans, and because Hive is based on Hadoop, you can expect queries to have a very high latency (many minutes). This means that Hive would not be appropriate for applications that need very fast response times, as you would expect with a database such as DB2. Finally, Hive is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.


Getting Started

We will be using the same data that we used in my previous blog on Pig. Namely, files batting.csv and master.csv.


Data Source

http://seanlahman.com/files/database/lahman591-csv.zip


Accessing Hue

You can access HUE from the entering the address 127.0.0.1:8000
Login Id : Hue
Pwd : 1111


Uploading Data

Data is uploaded into the directory user/hue from the HDFS file system. The steps to upload the files into this directory are available on my previous blogs.

Once the files are uploaded they should look like this

beeswax.jpg



Starting the HIVE view

You click on the top left corner of your page and click on the 2nd icon which says Beeswax(Hive UI).


Writing Queries

You click on the query editor to go to the query page.


Queries

The 1st step we do is create a temp_batting table with the query

Creating Table Query'


To see if the code has executed correctly you can click on Tables and then you can see you temp_batting table. Tick the checkbox next to it and click on View as per the screenshot below.

Table Temp_batting



Loading Batting.csv into temp_batting

Next we load the baseball scores into our temp table with the below code.

data loading into temp_batting

Once the data is loaded we check if the samples are recorded in temp_batting table with the below command.

Displaying temp_batting 100 records

Once  you execute the query you will see the results as shown in the screenshot below.

temp_batting data display

Now we create a new table called BATTING  which will contain 3 columns namely (player_id, year and number of runs)

creating batting table



Table Batting columns

batting table columns.jpg

Now we extract the data we want from temp_batting and copy it into batting.  We do it with a regexp pattern and build a multi line query.

The 1st line will overwrite the blank data into the batting table. Next 3 lines will extract player_id, year and runs fields form the temp_batting table.

insert overwrite table batting SELECT 
regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1) 
player_id, regexp_extract(col_value, '^(?:([^,]*)\,?)
{2}', 1) year, regexp_extract(col_value, '^(?:([^,]*)\,?)
{9}', 1) run from temp_batting;

Once the query is executed we can see the job status by entering the below address in your web browser 127.0.0.1:8088 and you can get a status as below.

app_status.jpg

Now we group the data by year with the below code

group by year.jpg

Output

max runs_year1.jpg


Now we need to figure out which player has scored which runs for that given year. We can do this with the below command.

SELECT a.year, a.player_id, a.runs from batting a JOIN 
(SELECT year, max(runs) runs FROM batting GROUP BY year ) 
b ON (a.year = b.year AND a.runs = b.runs) ;


final join.jpg



FINAL OUTPUT

max runs_year1.jpg




Hope you liked the blog!

Thanks for reading :)













1 comment: