Tuesday Big Data Series

Understanding Pig Data Model

Pig has a simple yet rich data model which consists the following four types:

Atom

An atom consists of a single atomic value which can be a string or a number.

Examples – ‘tom’ or 2

Tuple

A tuple is a sequence of fields each of which can be of any datatype.

Examples – (‘tom’, ‘california’) or (‘tom’ ,2) or (1000, 21)

Bag

A bag is a collection of tuples. It can have duplicates. The schema of the tuples need not be consistent or same. All the tuples in a bag need not have same number of fields or be of the same datatype.

Examples – {(‘tom’,’california’)(1000,21)(‘tom’,(‘lake forest’,’california’))}

Map

A map is a collection of data items, where each item has an associated key through which it can be looked up. It is a key value pair M = [K,V]. Like a bag, the schema of a map is flexible i.e. all the items in a map need not be of the same datatype or of the same number. However, for the efficiency to lookup, the keys of the maps are required to be atoms.

Example – [‘age’->10] or [‘likes’->{(‘chocolates’)(‘movies’)}]

where, ‘age’ and ‘likes’ are keys respectively.

Let us look at an example to see how expressions are evaluated.

Consider a tuple t = (‘tom’, {(‘california’,2)(‘michigan’,10)},[‘age’->10])

Let fields of tuple t be f1,f2,f3.

In the above tuple t, let us see what will be the value of t for some expressions-

For getting the field by position, say for $0 => t = ‘tom’

For getting field by name, say for f3 => t = [‘age’->10]

For projection, say f2.$0 => t = {(‘california’)(‘michigan’)}

For map lookup, say f3#’age’ => t = 10

For function evaluation, say SUM(f2.$1) => t = 2 + 10 = 12

For conditional expression, say f3#’age’ > 18 ? ‘adult’ : ‘minor’  => t = ‘minor’

For flattening, say FLATTEN(f2) => t will flatten into ‘california’, 2 and ‘michigan’, 10

Like it? Tweet this!Tweet: Understanding Pig Data Model http://ctt.ec/aCcfd @themarketeng

Leave a comment