Tuesday Big Data Series

Hive or Pig?

Which one of them is your favorite – Hive or Pig? What do you prefer to work with?

People often confuse as to when to use Hive and when to use Pig. And, while in most of the cases, either of it can be used, the question that arises is why both of them exist in the first place. Hive is more familiar because of it’s similarity with SQL and is more of a popular choice especially amongst the people who have used SQL. Well, this should not be the only reason to choose Hive over Pig. Since, Hadoop provides both these tools, the choice to use either Hive or Pig may depend on what you prefer. But to understand better, let us look at some use cases where each of them fits the best.

Use Case #1: ETL Scenarios

As the name suggests, Extract, Load, Transform, means that the data is extracted or gathered from some source, it is loaded, cleaned and conformed to a model and then transformed or joined with other data sources. Pig fits the best for such use cases.

Use Case #2: Pipelines/Logs

A common example for a pipeline is a server log or any kind of log where data is constantly appended. These logs may contain user information and a part of the log can be used to extract some meaningful information. For such cases, Pig is the perfect tool.

Use Case #3: Iterative Processing

Assume that you have a large data set in place which is transformed into a desired format. Now, you have small sets of data coming in which can change the state of the large data set. For example, you might want to join the small data set with the large data set. And, this needs to be done every time a new small data set flows in.

This requires incremental processing and the need to store intermediate results. Re-running the process from the start every time a new data set is introduced, will only increase the processing time. Such incremental operations can be easily implemented with Pig Latin. It also supports the intermediate storage of data as required.

Use Case #4: Research on large and very large data

Generally, for research, engineers and scientists use a very large data set. They might want to just run some jobs on the data to understand the data better and gain insights from it. Also, since Pig supports Java, Python, Perl, it is easier to re-use the existing scripts from within the Pig script and run it against the large data set.

Use Case #5: Data Warehousing Scenarios

Data warehouse stores the data which is ready to be used. It generally doesn’t need any transformations to be applied and can be used directly. For such cases, where the data is available and can be used readily to perform some analysis, Hive is a better option.

Use Case #6: Adhoc queries

For the similar reasons as mentioned in use case #5, it’s easier to fire a query using Hive. For example, if you need to calculate the count, or get an average, it is easier to do it using Hive. And, it’s similarity with SQL, makes it all the more easy to use it for running adhoc queries.

Hope this article answers all of your concerns and helps you decide which tool fits the best for your use case.

Like it? Tweet this!Tweet: Hive or Pig? http://ctt.ec/f41c2 @themarketeng


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s