Monday Technology Series

What is Apache Hive?

Apache Hive is a data warehouse infrastructure built on top of Hadoop which allows querying and managing large datasets residing in distributed storage. It provides an SQL-like language called as HiveQL with schema on read and transparently converts queries to map reduce, tez or spark jobs. All these execution engines run on Hadoop YARN. The HiveQL language also allows traditional map/reduce programmers simple extensions for plugging in custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Some of the features of Hive are –

  • It provides indexes to accelerate queries, including bitmap indexing.
  • It supports different storage types – pain text, RCFile, HBase, ORC etc.
  • It operates on compressed data stored on HDFS.
  • Built-in functions which has date, string and other data operations.
  • It allows users to  write their own functionality using User Defined Functions (UDFs).
  • Like traditional databases, Hive has database and tables.
  • The tables in Hive are similar to the table in a relational database and data units are organized from larger to more granular units. Databases have tables which are made up of partitions.
  • Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions can be further broken down into buckets.
  • Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.
  • Hive supports overwriting or appending data.
  • Hive is scalable and extensible.
  • It works with traditional data integration and data analytics tools.
  • The queries are fast even over large data sets.

Like it? Tweet this!Tweet: What is Apache Hive? http://ctt.ec/398TL @themarketeng

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s