Why do we need Oozie?
The Hadoop stack consists of a variety of tools like Pig, Map Reduce, Hive, HBase, Sqoop etc. At times, when dealing with large data sets, we might have to use a combination of either of these technologies along with plain old Java, Python, Perl or shell scripts to get work done depending upon the requirement at hand. For example, we might use a Pig script to perform some transformations on data and then we might use shell scripts to perform some action on this data. Since, we are dealing with Big Data here, it is required that we have an automated system that can control the flow of jobs; a mechanism that tells when a job should be started (at what time), what data it should use as input, where should the output be stored, manage the sequence of the jobs in which they should run, manage at what frequencies should the jobs run etc.
What is Oozie?
Oozie, an open source project, is implemented as a Java web application that runs in a Java servlet container and is distributed under the Apache License 2.0. It is a workflow scheduling system to manage Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work. It is integrated with the Hadoop stack, with YARN as its architectural center, and supports Hadoop jobs for Apache MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. Oozie can also schedule jobs specific to a system, like Java programs or shell scripts. Oozie is similar to what a cron is in Unix, but with a wider range of possibilities.
What can Oozie do?
- Workflows in Oozie are defined as a collection of control flow and action nodes in a directed acyclic graph. Control flow nodes define the beginning and the end of a workflow (start, end and failure nodes) as well as a mechanism to control the workflow execution path (decision, fork and join nodes). Action nodes are the mechanism by which a workflow triggers the execution of a computation/processing task.
- It can schedule jobs. It allows user to set time, day, date when the job should start and end. It also lets the user define the frequency of the job.
- It allows to run the jobs either sequentially or in parallel.
- It supports decision node (similar to if (condition) then <do this> else <do that>).
- Oozie provides support for different types of actions including Hadoop MapReduce, Hadoop distributed file system operations, Pig, SSH, and email. Oozie can also be extended to support additional types of actions like Java, Python, etc.
- Oozie workflows can be parameterized to define input paths and output paths on HDFS.
- It provides support for Email Action. A user can be notified via email when the job has completed or can be alerted in case of failure. This can be used for monitoring and can be helpful for the ops team.
More details about the features that are supported with the latest version of Oozie can be found here.