Frequently Used Hadoop Distributed File System (HDFS) FS Commands
Hi Folks ! I am excited to cover some most frequently used HDFS Commands.
If you want to try these commands, please refer to my hadoop installation guide on Ubuntu here
Overview
The File System (FS) shell includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports, such as Local FS, WebHDFS, S3 FS, and others. The FS shell is invoked by:
bin/hadoop fs <args>
All FS shell commands take path URIs as arguments. The URI format is
scheme://authority/path.
For HDFS the scheme is hdfs, and for the Local FS the scheme is file. The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used.
Most of the commands in FS shell behave like corresponding Unix commands.
Here is a categorical top level View of the commands we would be covering in this article.
HDFS Top Level Commands Categories
1. dfs : General Distributed File System commands
2. Getconf : Config values from configuration
3. Namenode : Distributed File System namenode commands
4. Dfsadmin : DFS administration commands
5. Haadmin : DFS High Availability admin commands
6. Fsck : File system check commands to check health of file system.
7. Balancer : Block balancer commands
So let’s begin learning about each.
HDFS DFS Commands
1. Most of the linux commands like -cp, -chown, -chmod, -df, -du, -find,
-getfacl, -getfattr, -ls, -mkdir, -mv, -rm, -rmdir, -setfacl, -getfacl, -tail, -test, -truncate, -stat etc. works with HDFS too.
2. -appendToFile : Appends to a existing HDFS file
3. -copyFromLocal / -copyToLocal : Copy From / To Local File System
4. -moveFromLocal / moveToLocal : Move From / To Local File System
5. -put / -get : Copy From / To Local File System.
6. -createSnapshot / -deleteSnapshot : Create / Delete snapshot ( point-in-time copies or Backup for HDFS File System)
7. -setrep :Setup replication factor for a directory
Let us see some of these in detail:
appendToFile
$ hadoop fs -appendToFile <localsrc> … <dst>
Append single source, or multiple sources from local file system to the destination file system. Also reads input from stdin and appends to destination file system.
- hadoop fs -appendToFile localfile /user/hadoop/hadoopfile
- hadoop fs -appendToFile localfile1 localfile2 /user/hadoop/hadoopfile
- hadoop fs -appendToFile localfile hdfs://nn.example.com/hadoop/hadoopfile
- hadoop fs -appendToFile — hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
Exit Code: Returns 0 on success and 1 on error.
copyFromLocal
$ hadoop fs -copyFromLocal <localsrc> URI
Similar to the fs -put command, except that the source is restricted to a local file reference.
copyToLocal
$ hadoop fs -copyToLocal URI <localdst>
Similar to get command, except that the destination is restricted to a local file reference.
cp
$ hadoop fs -cp [-f] URI [URI …] <dest>
Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory.
Options:
- The -f option will overwrite the destination if it already exists.
Example:
- hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2
- hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir
mkdir
$ hadoop fs -mkdir [-p] <paths>
Takes path uri’s as argument and creates directories.
Options:
- The -p option behavior is much like Unix mkdir -p, creating parent directories along the path.
Example:
- hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
- hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir
moveFromLocal
$ hadoop fs -moveFromLocal <localsrc> <dst>
Similar to put command, except that the source localsrc is deleted after it’s copied.
mv
$ hadoop fs -mv URI [URI …] <dest>
Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across file systems is not permitted.
Example:
- hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2
- hadoop fs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1
If you are looking for detailed explanation of each of dfs commands and its attributes, please refer to this link.
HDFS getconf command
1. -namenodes: gets list of namenodes in the cluster.
2. -secondaryNameNodes : gets list of secondary namenodes in the cluster.
3. -backupNodes: gets list of backup nodes in the cluster.
4. -includeFile: gets the include file path that defines the datanodes that can join the cluster.
5. -excludeFile: gets the exclude file path that defines the datanodes that need
to decommissioned.
6. -nnRpcAddresses: gets the namenode rpc addresses
7. -confKey [key]]: gets a specific key from the configuration
Usage:
$ hdfs getconf -namenodes
$ hdfs getconf -secondaryNameNodes
$ hdfs getconf -backupNodes
$ hdfs getconf -journalNodes
$ hdfs getconf -includeFile
$ hdfs getconf -excludeFile
$ hdfs getconf -nnRpcAddresses
$ hdfs getconf -confKey [key]
HDFS DFS Admin Commands
1. -report [-live] [-dead] [-decommissioning]] : Report FS Stats
2. -safemode <enter | leave | get | wait>] : Safemode operations
3. -refreshNodes : Reread the hosts in cluster
4. -rollEdits : Rolls the edits on namenode
5. -setQuota / -clrQuota <quota> <dirname>…<dirname> : Set / Clear quota
6. -rollingUpgrade [<query|prepare|finalize>] : Rolling upgrade commands
7. -printTopology : Print topology
8. -setBalancerBandwidth : Sets the balancer bandwidth
9. -shutdownDatanode <datanode_host:ipc_port> [upgrade] : Shutdown DN
10. -getDatanodeInfo <datanode_host:ipc_port (50020)>]
11. -allowSnapshot / -disallowSnapshot <snapshotDir>
If you are looking for detailed explanation of each of dfs commands and its attributes, please refer to this link.
HDFS DFS HAAdmin Commands
1. -transitionToActive [ — forceactive] <serviceId> : transition the state of the given NameNode to Active (Warning: No fencing is done)
2. -transitionToStandby <serviceId> : transition the state of the given NameNode to Standby (Warning: No fencing is done)
3. -failover <serviceId> : Transition to failover
4. -getServiceState <serviceId> : Get service state
5. -checkHealth <serviceId> : Check health
Note : ServiceId = dfs.ha.namenodes.mycluster property in hdfs-site.xml.
Usage:
$ hdfs haadmin -checkHealth <serviceId>
$ hdfs haadmin -failover [--forcefence] [--forceactive] <serviceId> <serviceId>
$ hdfs haadmin -getServiceState <serviceId>
$ hdfs haadmin -help <command>
$ hdfs haadmin -transitionToActive <serviceId> [--forceactive]
$ hdfs haadmin -transitionToStandby <serviceId>
HDFS fsck command
Runs the HDFS filesystem checking utility. See fsck for more info.
1. -move : move corrupted files to /lost+found
2. -delete : delete corrupted files
3. -files : print out files being checked
4. -openforwrite : print out files opened for write
5. -includesnapshots : include snapshot data if the given path indicates a
snapshottable directory
6. -list-corruptfileblocks : print out list of missing blocks and files they belong to
7. -blocks : print out block report
8. -locations : print out locations for every block
9. -racks : print out network topology for data-node locations
10. -storagepolicies : print out storage policy summary for the blocks
11. -blockId : print out which file this blockId belongs to
Usage:
$ hdfs fsck <path>
[-list-corruptfileblocks |
[-move | -delete | -openforwrite]
[-files [-blocks [-locations | -racks]]]
[-includeSnapshots]
HDFS balancer commands
1. -policy <policy> : the balancing policy: datanode or blockpool
a. (default) Datanode : cluster is balanced if all datanodes are balanced
b. Blockpool : cluster is balanced if each block pool in each datanode are balanced
2. -threshold <threshold> : Percentage of disk capacity
3. -include / -exclude : Includes / Excludes the specified datanodes.
4. -idleiterations : Number of consecutive idle iterations
Usage:
$ hdfs balancer
[-threshold <threshold>]
[-policy <policy>]
[-exclude [-f <hosts-file> | <comma-separated list of hosts>]]
[-include [-f <hosts-file> | <comma-separated list of hosts>]]
[-idleiterations <idleiterations>]
Runs a cluster balancing utility. An administrator can simply press Ctrl-C to stop the rebalancing process. See Balancer for more details.
Note that the blockpool policy is more strict than the datanode policy.
So, that’s it for this article, we gave you overview of some frequent HDFS FS Commands, i recommend you to have a look at this page on apache.org for all other HDFS commands.
I hope, this serves as quick reference guide for revisiting the HDFS Commands.
If you like the article, do leave a clap, it really motivates me to write more.
See you in some other article..:) #Follow