google-site-verification: googleb67531d106507af5.html All About Hive in Hadoop by RCP Technologies Pvt Ltd
7330907991 Online | 7330907992 Classroom info@rcptec.com

HIVE

Hive is on the component in Hadoop which will built on top of HDFS, and it’s a kind of warehouse system in Hadoop;

If we want perform querying in Hadoop using HQL(HiveQL)

 

MetaStore:–

*********

It is internal data store of Hive which will consist of

1. Table Schema

2. Column Information

3. Partition id’s Information etc..0.

 

Note:– If MetaStore is not running we can not do anything in HIVE, means not able to fetching any information from MetaStore.

 

By default, HIVE comes up with “DURBY” database as its store.

We can configure some other databases also as our metasote.

Ex: -MySql, Thrift Server…

The Default warehouse of drive is “/user/hive/warehouse” (HDFS)

 

 

 

Managed Tables:–

**************

Its managed by HIVE system only

The data which we are storing in HIVE managed tables will get stored on top of default HIVE warehouse directory (/user/hive/warehouse)

 

Modes of Loading the Data in HIVE tables are

1. Local Mode: load data local inpath ….

2. HDFS Mode: load data inpath …..

 

 

#hive

hive>create table Emp(eIdint,eNamestring,eSal double)

row format delimited

fields terminated by ‘|’

stored as textfile;

 

InputFile:–

 

1000|Praveen|20000

1001|Raju|30000

1002|Venu|40000

1003|Srinu|50000

 

 

hive>load data local inpath ‘/home/praveen/Praveen/HIVE/empData.txt’

into table Emp;

 

hive>select * from Emp;

 

hive>quit;

 

#hadoop fs -mkdir /HIVE

#hadoop fs -put ’empData.txt’ /HIVE

#hive

 

 

hive>load data inpath ‘/HIVE/empData.txt’ into table Emp;

#hadoop fs -put ’empData.txt’ /HIVE

hive>load data inpath ‘/HIVE/empData.txt’ overwrite into table Emp;

#/Praveen/HIVE$ hadoop fs -ls /user/hive/warehouse/emp

hive>drop table Emp;

#/Praveen/HIVE$ hadoop fs -ls /user/hive/warehouse

 

 

External Tables:–

***************

Its external to the HIVE system.

HIVE system is not going to manage the external location.

The table data will not store on top of HIVE default warehouse dir

It will get store HDFS external location which we are mentioning in the external table schema diffanation.

 

How to configure the Metastore:–

——————————

 

hive-site.xml

 

hive> create external table extEmp(eIdint,eNamestring,eSal double)

row format delimited

fields terminated by ‘|’

stored as textfile

location ‘/Praveen/HT/extEmp1′;

 

#hadoop fs -put ’empData.txt’ /HIVE

hive>load data inpath ‘/HIVE/empData.txt’ into table Emp;

 

 

Altering the Table:–

******************

 

hive>alter table Emp RENAME to newEmp;

 

Partition:–

*********

 

Partitioning is something like grouping similar entries based on the dynamic columns(partitioned columns), means instead of searching whole table data set, we are only searching particular partition.

 

 

hive> create table partEmp(eIdint, eName string, eSal double)

PARTITIONED BY(eLoc string)

row format delimited

fields terminated by ‘|’

stored as textfile;

 

hive> load data inpath ‘/HIVE/empNewData.txt’ into table partEmp

PARTITION (eLoc=’Hyd’);

 

hive> load data local inpath ‘./empNewData.txt’

>into table NewEmp PARTITION (eLoc=’Ban’);

 

hive> select * from newemp;

 

hive> select * from newemp where eLoc =”Hyd”;

 

Create Table as select:-

***********************

 

hive> create table asEmp AS select eName,eLoc from NewEmp;

hive> select * from asEmp;

hive> create table hydEmp AS select eName,eSal from NewEmp where eLoc=’Hyd’;

hive> select * from hydEmp;

 

Complex Data Types:–

*****************

There are three complex types in hive,

 

arrays: It is an ordered collection of elements.The elements in the array must be of the same type.

——

map: It is an unordered collection of key-value pairs.Keys must be of primitive types.Values can be of any type.

—-

struct: It is a collection of elements of different types.

——

 

Input file

———-

 

1     Hyderabad,AP      Sagar,Musi  schools=154,collages=71

2     Banglore,KA Tunga collages=197,schools=31

 

hive>CREATE EXTERNAL TABLE cities(id int, locale struct<city:string,state:int>, lakes array<string>, attributes map<string,string>)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘\t’

COLLECTION ITEMS TERMINATED BY ‘,’

MAP KEYS TERMINATED BY ‘=’

STORED AS TEXTFILE

LOCATION ‘/Praveen/HIVE/cities’;

 

hive> load data local inpath ‘citiesData.txt’ overwrite into table cities;

 

hive> select * from cities;

 

hive>SELECT id, locale.city, lakes[0],  attributes[‘schools’] FROM cities;

 

Joins:–

******

 

Input File

 

10001|Georgi Facello|M|d005

10002|Bezalel Simmel|F|d007

10003|Parto Bamford|M|d004

10004|Chirstian Koblick|M|d004

10005|Kyoichi Maliniak|M|d003

10006|Anneke Preusig|F|d005

10009|Sumant Peac|F|d006

 

 

hive> create table joinEmp (eIdint, eName string, eGen string, dId string)

row format delimited

fields terminated by ‘|’

stored as textfile;

 

hive> load data local inpath ‘./empData.txt’ into table joinEmp;

 

 

hive>create table joinDept(dId string, dName string)

row format delimited

fields terminated by ‘|’

stored as textfile;

 

hive> load data local inpath ‘deptData.txt’ into table joinDept;

 

hive> create table joinEmpDept as select e.eId, e.eName, d.dName from joinemp e

JOIN joindept d on (e.dId=d.dId);

 

hive> select * from newemp sort by eSal; (Multiple Reducers)

hive> select * from newemp order by eId; (Single Reducer)

hive> select * from newemp group by eLoc;

hive> select * from newemp distribute by eName;

hive> select * from newemp distribute sort by eName;

 

hive> select * from newemp cluster by eName;

 

 

UDF’s:–

******

 

If we are not achieving our functionality by using of HIVE built-in commands, then will use HIVE User Defined Functions.

 

Steps for Creating UDF:–

 

1. Our class should extend the base class of UDF

2. business logic should be part of evaluate

3. need to create a jar and copy the jar file in clustedenv, where HIVE is running.

4. We have to explicitly add the export jar file to HIVE class path

syx:- add jar <jar file name>

5. Create a tempararyfunctoin out of the same jar file.

hive> create temporary function <fnName> as <ClassName>

 

 

importorg.apache.commons.lang.StringUtils;

importorg.apache.hadoop.hive.ql.exec.UDF;

importorg.apache.hadoop.io.Text;

 

public class Strip extends UDF {

private Text result = new Text();

 

public Text evaluate(Text str) {

if (str == null) {

return null;

}

result.set(StringUtils.strip(str.toString()));

return result;

}  public Text evaluate(Text str, String stripChars) {

if (str == null) {

return null;

}

result.set(StringUtils.strip(str.toString(), stripChars));

return result;

}

}

 

hive> add jar HiveUdf.jar;

hive> create temporary function myStrip as ‘Strip’;

 

hive> select eId,dId from joinemp;

hive> select eId,myStrip(dId,’d’) from joinemp;

 

 

UADF’s:–(User Defined Aggregation Function)

******

 

 

importorg.apache.hadoop.hive.ql.exec.UDAF;

importorg.apache.hadoop.hive.ql.exec.UDAFEvaluator;

importorg.apache.hadoop.io.IntWritable;

 

 

public class Maximum extends UDAF {

 

public static class MaximumIntUDAFEvaluator implements UDAFEvaluator {

 

privateIntWritable result;

 

public void init() {

result = null;

}

 

publicboolean iterate(IntWritable value) {

if (value == null) {

return true;

}

if (result == null) {

result = new IntWritable(value.get());

} else {

result.set(Math.max(result.get(), value.get()));

}

return true;

}

 

publicIntWritableterminatePartial() {

return result;

}

 

publicboolean merge(IntWritable other){

return iterate(other);

}

publicIntWritable terminate(){

return result;

}

}

}

 

 

 

hive> create temporary function myMax as ‘Maximum’;

 

hive> select myMax(eId) from joinemp;

 

 

 

add jar ${HIVE_HOME}/lib/hive-contrib-0.9.0.jar;
CREATE TEMPORARY FUNCTION row_sequence as ‘org.apache.hadoop.hive.contrib.udf.UDFRowSequence’;

SELECT row_sequence(), col
FROM (
SELECT col1 AS col
FROM sample_table
) t2;

 

Annotations exist for UDFs that can make functions easier to use and even increase the performance of some Hive queries.

Deterministic, Stateful, DistinctLike
Please read the following link to know more about Deterministic.

Stateful: It can be used only in select clause, if annotated as stateful then its internally non-deterministic.

DistinctLike: Used for cases where the function behaves like DISTINCT even when applied to a nondistinct column of values. Examples include min and max functions that return a distinct value even though the underlying numeric data can have repeating values.