Tuesday Big Data Series

Understanding HDFS quotas

Every Hadoop system has an Hadoop Administrator and Hadoop users/developers. The Administrator is responsible for deployment and maintenance of the entire infrastructure. He is responsible for cluster availability, file system management, security, installation of latest updates, and all other things that need to keep the system up and running.

The administrator is also responsible for setting quotas for every user who is on-boarded to the cluster. Whenever a new user is on-boarded to the cluster, a new account for that user is created where he gets certain amount of space/quota allotted for his data storage. Basically, there are two types of quotas: Namespace or Name Quota and Disk space or Space Quota. Let us understand them with an example with few assumptions.

Suppose a user “training” is on-boarded to a cluster. And, let’s say he has his root directory as /projects/training with a namespace quota of 20K and disk space quota of 1 TB allotted.

Namespace Quota or Name Quota

Namespace or Name Quota is a hard limit on the number of files and directory names that can be present under the user’s root directory.

Namespace Quota or Name Quota will be the total number of files (including directories) that can be created under /projects/training. So, from our example, our hypothetical user “training”, can create up to 20k files and directories exceeding which will result in errors.

In the event of exceeding the allotted Namespace quota, “Name Space quota exceeded” error is generally thrown.

Disk space Quota or Space Quota

Disk space or Space Quota is a hard limit on the number of bytes used by files that can be present under the user’s root directory. An important thing to remember is that it takes replication into account. So, if the replication factor is set to 3, and the data is of 1 GB, then actually 3 GB of data is stored.

The Disk space Quota or Space Quota will be the total space in terms of bytes allotted to /projects/training exceeding which will result in errors. So, our user “training” can store data up to 1 TB which includes the replication.

In the event of exceeding the disk space quota, “Disk Space quota exceeded” error is generally thrown.

Administrator commands to set and clear quota

These commands are only available to the administrator.

Set the name quota to be N for each directory.

  • hdfs dfsadmin -setQuota <N> <directory>...<directory>

Remove any name quota for each directory.

  • hdfs dfsadmin -clrQuota <directory>...<directory>

Set the space quota to be N bytes for each directory. This is a hard limit on total size of all the files under the directory tree. The space quota takes replication also into account, i.e. one GB of data with replication of 3 consumes 3GB of quota. N can also be specified with a binary prefix for convenience, for e.g. 50g for 50 gigabytes and 2t for 2 terabytes etc.

  • hdfs dfsadmin -setSpaceQuota <N> <directory>...<directory>

Remove any space quota for each directory.

  • hdfs dfsadmin -clrSpaceQuota <directory>...<directory>

Command to report and view the quota status

  • hadoop fs -count -q [-h] [-v] <directory>...<directory>

With the -q option, also report the name quota value set for each directory, the available name quota remaining, the space quota value set, and the available space quota remaining.If the directory does not have a quota set, the reported values are none and inf.The -h option shows sizes in human readable format.The -v option displays a header line.

Like it? Tweet it! Tweet: Understanding HDFS quotas http://ctt.ec/en58S @themarketeng

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s