In my previous blogs, we got introduced to Hadoop and later with Pig which is a scripting language with a focus on dataflow. In this section we will learn about Apache Hive. We will also run queries in HIVE which is the almost SQL equivalent.
Introduction
Although Pig can be quite a powerful and simple language to use, the downside is that it’s something new to learn and master. Some folks at Facebook developed a runtime Hadoop support structure that allows anyone who is already fluent with SQL (which is commonplace for relational data-base developers) to leverage the Hadoop platform right out of the gate.
Their creation, called Hive, allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements; now you should be aware that HQL is limited in the commands it understands, but it is still pretty useful. HQL statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster.
For anyone with a SQL or relational database background, this section will look very familiar to you. As with any database management system (DBMS), you can run your Hive queries in many ways. You can run them from a command line interface (known as the Hive shell), from a Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) application leveraging the Hive JDBC/ODBC drivers, or from what is called a Hive Thrift Client. The Hive Thrift Client is much like any database client that gets installed on a user’s client machine (or in a middle tier of a three-tier architecture): it communicates with the Hive services running on the server. You can use the Hive Thrift Client within applications written in C++, Java, PHP, Python, or Ruby (much like you can use these client-side languages with embedded SQL to access a database such as DB2 or Informix).
Hive looks very much like traditional database code with SQL access. However, because Hive is based on Hadoop and MapReduce operations, there are several key differences. The first is that Hadoop is intended for long sequential scans, and because Hive is based on Hadoop, you can expect queries to have a very high latency (many minutes). This means that Hive would not be appropriate for applications that need very fast response times, as you would expect with a database such as DB2. Finally, Hive is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.
Getting Started
We will be using the same data that we used in my previous blog on Pig. Namely, files batting.csv and master.csv.
Data Source
http://seanlahman.com/files/database/lahman591-csv.zip
Accessing Hue
You can access HUE from the entering the address 127.0.0.1:8000
Login Id : Hue
Pwd : 1111
Uploading Data
Data is uploaded into the directory user/hue from the HDFS file system. The steps to upload the files into this directory are available on my previous blogs.
Once the files are uploaded they should look like this

Starting the HIVE view
You click on the top left corner of your page and click on the 2nd icon which says Beeswax(Hive UI).
Writing Queries
You click on the query editor to go to the query page.
Queries
The 1st step we do is create a temp_batting table with the query
'
To see if the code has executed correctly you can click on Tables and then you can see you temp_batting table. Tick the checkbox next to it and click on View as per the screenshot below.

Loading Batting.csv into temp_batting
Next we load the baseball scores into our temp table with the below code.

Once the data is loaded we check if the samples are recorded in temp_batting table with the below command.

Once you execute the query you will see the results as shown in the screenshot below.

Now we create a new table called BATTING which will contain 3 columns namely (player_id, year and number of runs)

Table Batting columns

Now we extract the data we want from temp_batting and copy it into batting. We do it with a regexp pattern and build a multi line query.
The 1st line will overwrite the blank data into the batting table. Next 3 lines will extract player_id, year and runs fields form the temp_batting table.
insert overwrite table batting SELECT
regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1)
player_id, regexp_extract(col_value, '^(?:([^,]*)\,?)
{2}', 1) year, regexp_extract(col_value, '^(?:([^,]*)\,?)
{9}', 1) run from temp_batting;
Once the query is executed we can see the job status by entering the below address in your web browser 127.0.0.1:8088 and you can get a status as below.

Now we group the data by year with the below code

Output

Now we need to figure out which player has scored which runs for that given year. We can do this with the below command.
SELECT a.year, a.player_id, a.runs from batting a JOIN
(SELECT year, max(runs) runs FROM batting GROUP BY year )
b ON (a.year = b.year AND a.runs = b.runs) ;

FINAL OUTPUT

Hope you liked the blog!
Thanks for reading :)
Introduction
Although Pig can be quite a powerful and simple language to use, the downside is that it’s something new to learn and master. Some folks at Facebook developed a runtime Hadoop support structure that allows anyone who is already fluent with SQL (which is commonplace for relational data-base developers) to leverage the Hadoop platform right out of the gate.
Their creation, called Hive, allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements; now you should be aware that HQL is limited in the commands it understands, but it is still pretty useful. HQL statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster.
For anyone with a SQL or relational database background, this section will look very familiar to you. As with any database management system (DBMS), you can run your Hive queries in many ways. You can run them from a command line interface (known as the Hive shell), from a Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) application leveraging the Hive JDBC/ODBC drivers, or from what is called a Hive Thrift Client. The Hive Thrift Client is much like any database client that gets installed on a user’s client machine (or in a middle tier of a three-tier architecture): it communicates with the Hive services running on the server. You can use the Hive Thrift Client within applications written in C++, Java, PHP, Python, or Ruby (much like you can use these client-side languages with embedded SQL to access a database such as DB2 or Informix).
Hive looks very much like traditional database code with SQL access. However, because Hive is based on Hadoop and MapReduce operations, there are several key differences. The first is that Hadoop is intended for long sequential scans, and because Hive is based on Hadoop, you can expect queries to have a very high latency (many minutes). This means that Hive would not be appropriate for applications that need very fast response times, as you would expect with a database such as DB2. Finally, Hive is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.
Getting Started
We will be using the same data that we used in my previous blog on Pig. Namely, files batting.csv and master.csv.
Data Source
http://seanlahman.com/files/database/lahman591-csv.zip
Accessing Hue
You can access HUE from the entering the address 127.0.0.1:8000
Login Id : Hue
Pwd : 1111
Uploading Data
Data is uploaded into the directory user/hue from the HDFS file system. The steps to upload the files into this directory are available on my previous blogs.
Once the files are uploaded they should look like this

Starting the HIVE view
You click on the top left corner of your page and click on the 2nd icon which says Beeswax(Hive UI).
Writing Queries
You click on the query editor to go to the query page.
Queries
The 1st step we do is create a temp_batting table with the query
'To see if the code has executed correctly you can click on Tables and then you can see you temp_batting table. Tick the checkbox next to it and click on View as per the screenshot below.

Loading Batting.csv into temp_batting
Next we load the baseball scores into our temp table with the below code.

Once the data is loaded we check if the samples are recorded in temp_batting table with the below command.

Once you execute the query you will see the results as shown in the screenshot below.

Now we create a new table called BATTING which will contain 3 columns namely (player_id, year and number of runs)

Table Batting columns

Now we extract the data we want from temp_batting and copy it into batting. We do it with a regexp pattern and build a multi line query.
The 1st line will overwrite the blank data into the batting table. Next 3 lines will extract player_id, year and runs fields form the temp_batting table.
insert overwrite table batting SELECT
regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1)
player_id, regexp_extract(col_value, '^(?:([^,]*)\,?)
{2}', 1) year, regexp_extract(col_value, '^(?:([^,]*)\,?)
{9}', 1) run from temp_batting;
Once the query is executed we can see the job status by entering the below address in your web browser 127.0.0.1:8088 and you can get a status as below.

Now we group the data by year with the below code

Output

Now we need to figure out which player has scored which runs for that given year. We can do this with the below command.
SELECT a.year, a.player_id, a.runs from batting a JOIN
(SELECT year, max(runs) runs FROM batting GROUP BY year )
b ON (a.year = b.year AND a.runs = b.runs) ;

FINAL OUTPUT

Hope you liked the blog!
Thanks for reading :)
seen
ReplyDelete