Apache Hive is a data warehouse infrastructure built on top of Hadoop which allows querying and managing large datasets residing in distributed storage. It provides an SQL-like language called as HiveQL with schema on read and transparently converts queries to map reduce, tez or spark jobs. All these execution engines run on Hadoop YARN. The HiveQL language also allows traditional map/reduce programmers simple extensions for plugging in custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Some of the features of Hive are –
- It provides indexes to accelerate queries, including bitmap indexing.
- It supports different storage types – pain text, RCFile, HBase, ORC etc.
- It operates on compressed data stored on HDFS.
- Built-in functions which has date, string and other data operations.
- It allows users to write their own functionality using User Defined Functions (UDFs).
- Like traditional databases, Hive has database and tables.
- The tables in Hive are similar to the table in a relational database and data units are organized from larger to more granular units. Databases have tables which are made up of partitions.
- Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions can be further broken down into buckets.
- Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.
- Hive supports overwriting or appending data.
- Hive is scalable and extensible.
- It works with traditional data integration and data analytics tools.
- The queries are fast even over large data sets.