HDFS has a master/slave architecture and is built-up of basically two kinds of nodes: NameNode (which acts as a Master) and DataNodes (which acts as slaves). NameNode and DataNode are pieces of software that are designed to run on commodity machines which typically run on a GNU/Linux operating system (OS). HDFS is built using Java language, hence, making the deployment highly portable because of which it can be deployed on a wide range of machines.
The NameNode and it’s responsibilities
- As it is clear from the architecture diagram above, HDFS has a single NameNode which acts as a master that manages the filesystem namespace and manages how the files/data can be accessed by the clients.
- The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.
- The NameNode is the arbitrator and stores all the HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.
- The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.
- It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly.
DataNode and it’s responsibilities
- There are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.
- The DataNodes are responsible for serving read and write requests from the file system’s clients.
- The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
- The DataNode stores HDFS data in files in its local file system but has no knowledge about HDFS files. It stores each block of HDFS data in a separate file in its local file system.
- The computation is moved across DataNodes instead of moving the actual data. The processing of a task or computation is performed on the DataNode where the data actually resides.