Pig is an open source scripting platform by Apache used for analyzing and processing large data sets. It allows users to write complex map reduce problems using a simple scripting language called Pig Latin. Pig translates the Pig Latin script into MapReduce so that it can be executed within YARN for access to a single dataset stored in the Hadoop Distributed File System (HDFS). One of the important properties of Pig scripts is that it promotes parallelization, which enables it to handle very large data sets easily.
Pig consists of a compiler that produces a sequence of Map Reduce programs. The large scale parallel implementations of these Map Reduce programs already exist. Pig Latin consists of built-in functions which are quite textual and can be used to perform different operations over the actual Map Reduce implementation.
For example – Suppose, you want to perform a JOIN or a COUNT operation. The Map Reduce implementation for JOIN or COUNT already exists that supports parallel processing. The built-in function JOIN or COUNT exists for the same in Pig. So, in a Pig script, you can directly use this function and can perform the same operation in a single line of code without worrying about the internal working or without worrying about the parallelization.
At present, the latest version of Pig that is available is 0.15.0. Some of the key properties of Pig that make it so awesome are as below –
- It supports structured, semi-structured as well as unstructured data. This means that the raw data on which the Pig script should run should not adhere to any pre-defined schema. Pig supports ways to handle this unstructured data.
- It provides ease of programming. It is easy to achieve parallelism without worrying about how to handle it programmatically. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
- It supports various optimization techniques which are handled automatically when a Pig script runs. This way the user can concentrate on semantics rather than efficiency.
- Pig’s multi-query approach combines certain types of operations together in a single pipeline, reducing the number of times data is scanned.
- The latest release of Pig (at present it is version 0.15.0) supports calling Hive User defined functions from within a pig script. This allows re-usability of the code.
- It is used extensively for data pipelines and iterative data processing.
Pig is a very powerful scripting language which makes processing of large data sets very easy without actually writing complex Map Reduce programs. That way developers can just concentrate on analyzing the data rather than spending time of writing complex programs to perform certain common operations.
To learn more about Pig, please go through the official page of Apache Pig.