The Apache Tez project is an extensible framework built on top of Apache Hadoop YARN. It is used to process data, that earlier took multiple MR jobs, now in a single Tez job which uses Directed Acyclic Graph (DAG) for data processing. It is used for building high performance batch and interactive data processing applications. It drastically improves the speed, while maintaining the Map Reduce’s capability to scale to petabytes of data. Apache Hive and Apache Pig use Apache Tez.
The 2 main design themes for Tez are:
- Empowering end users by:
- Expressive dataflow definition APIs
- Flexible Input-Processor-Output runtime model
- Data type agnostic
- Simplifying deployment
- Execution Performance
- Performance gains over Map Reduce
- Optimal resource management
- Plan reconfiguration at runtime
- Dynamic physical data flow decisions
Tez has a simple Java API with three components –
- DAG – User creates a DAG object for each data processing job.
- Vertex – This defines the user logic and the resources & environment needed to execute the user logic. The user creates a Vertex object for each step in the job and adds it to the DAG.
- Edge – This defines the connection between producer and consumer vertices. The user creates an Edge object and connects the producer and consumer vertices using it.
How Tez works?
- Tez models data processing as data flow graph, where the graph vertices represent the application logic and edges represent the movement of data.
- It has a rich data flow definition API that allows users to express complex query logic.
- The user logic running in each vertex of the data flow graph is a composition of Input, Processor and Output modules. The processor holds the data transformation logic while the input and output determine the data format and how and where the data is read or written. Tez does not impose any data format and only requires that Input, Processor and Output formats are compatible with each other.
- Tez includes support for pluggable vertex management modules to collect runtime information and change the dataflow graph dynamically to optimize performance and resource utilization.
- The Tez execution engine framework efficiently acquires resources from YARN and reuses every component in the pipeline such that no operation is duplicated unnecessarily.