HDFS, Hadoop Distributed File System, is a Java-based filesystem for storing large volumes of data in the Hadoop framework.
When we are thinking of dealing with enormous amounts of data, the first thing that comes to our mind is where do we store this data and how do we store it. We know that every single bit of data is important and analysis on the historical data can lead to some amazing insights on target audience and their behavior. Hence, it is essential that the big data solution or framework that we are using has a storage system that is fault-tolerant, scalable and cost efficient. And, HDFS is just what we need!
- HDFS is highly scalable. It has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. When there is a need to scale the system, it can be easily done by just adding the required number of commodity servers to the existing architecture.
- HDFS runs on commodity hardware that makes it cost-efficient. Commodity servers are generally low in cost and thus are easy on pockets.
- Another important feature of HDFS is Rack Awareness. For example, HDFS block placement will use rack awareness for fault tolerance by placing one block replica on a different rack. This provides high data availability in the event of a network switch failure or partition within the cluster.
- Hadoop moves computation across nodes instead of data on HDFS. The computation or processing of a task is performed on the node where data is stored. Hence, it reduces network I/O and provides very high aggregate network bandwidth.
- It constantly and dynamically monitors the health of the file system and rebalances the data on different nodes.
- It supports Rollback to previous version (or upgrade) in case of failures, thus, making it fault-tolerant.
- It supports a backup or a standby NameNode which acts as a backup in the event of the failure of the primary NameNode. It provides redundancy and high availability, thus, making it fault-tolerant.
- HDFS requires minimal operator intervention, allowing a single operator to maintain a cluster of 1000s of nodes.
These are some of the amazing features of HDFS which make it an efficient data storage file system.