Pig has a simple yet rich data model which consists the following four types:
Atom
An atom consists of a single atomic value which can be a string or a number.
Examples – ‘tom’ or 2
Tuple
A tuple is a sequence of fields each of which can be of any datatype.
Examples – (‘tom’, ‘california’) or (‘tom’ ,2) or (1000, 21)
Bag
A bag is a collection of tuples. It can have duplicates. The schema of the tuples need not be consistent or same. All the tuples in a bag need not have same number of fields or be of the same datatype.
Examples – {(‘tom’,’california’)(1000,21)(‘tom’,(‘lake forest’,’california’))}
Map
A map is a collection of data items, where each item has an associated key through which it can be looked up. It is a key value pair M = [K,V]. Like a bag, the schema of a map is flexible i.e. all the items in a map need not be of the same datatype or of the same number. However, for the efficiency to lookup, the keys of the maps are required to be atoms.
Example – [‘age’->10] or [‘likes’->{(‘chocolates’)(‘movies’)}]
where, ‘age’ and ‘likes’ are keys respectively.
Let us look at an example to see how expressions are evaluated.
Consider a tuple t = (‘tom’, {(‘california’,2)(‘michigan’,10)},[‘age’->10])
Let fields of tuple t be f1,f2,f3.
In the above tuple t, let us see what will be the value of t for some expressions-
For getting the field by position, say for $0 => t = ‘tom’
For getting field by name, say for f3 => t = [‘age’->10]
For projection, say f2.$0 => t = {(‘california’)(‘michigan’)}
For map lookup, say f3#’age’ => t = 10
For function evaluation, say SUM(f2.$1) => t = 2 + 10 = 12
For conditional expression, say f3#’age’ > 18 ? ‘adult’ : ‘minor’ => t = ‘minor’
For flattening, say FLATTEN(f2) => t will flatten into ‘california’, 2 and ‘michigan’, 10