Saturday, February 6, 2010

Hadoop AvatarNode High Availability



Our Use-Case

The Hadoop Distributed File System's (HDFS) NameNode is a single point of falure. This has been a major stumbling block in using HDFS for a 24x7 type of deployment. It has been a topic of discussion among a wide circle of engineers.

I am part of a team that is operating a cluster of 1200 nodes and a total size of 12 PB. This cluster is currently running hadoop 0.20. The NameNode is configured to write its transaction logs to a shared NFS filer. The biggest cause of service downtime is when we need to deploy hadoop patches to our cluster. A fix to the DataNode software is easy to deploy without cluster downtime: we deploy new code and restart few DataNodes at a time. We can afford to do this because the DFSClient is equipped (via multiple retries to different replicas of the same block) to handle transient DataNode outages. The major problem arises when new software have to be deployed to our NameNode. A HDFS NameNode restart for a cluster of our size typically takes about an hour. During this time, the applications that use the Hadoop service cannot run. This led us to believe that some kind of High Availabilty (HA) for the NameNode is needed. We wanted a design than can support a failover in about a minute.

Why AvatarNode?

We took a serious look at the options available to us. The HDFS BackupNode is not available in 0.20, and upgrading our cluster to hadoop 0.20 is not a feasible option for us because it has to go through extensive time-consuming testing before we can deploy it in production. We started off white-boarding the simplest solution: the existance of a Primary NameNode and a Standby NameNode in our HDFS cluster. And we did not have to worry about split-brain-scenario and IO-fencing methods... the administrator ensures that the Primary NameNode is really, really dead before coverting the Standby NamenNode to become the Primary DataNode. We also did not want to change the code of the NameNode one single bit and wanted to build HA as a software layer on top of HDFS. We wanted a daemon that can switch from being a Primary to a Standby and vice-versa... as if switching avatars ... and this is how our AvatarNode is born!

The word avatar means a variant-phase or incarnations of a single entity, and we thought it appropriate that a wrapper that can make the NameNode behave as two different roles be aptly named as AvatarNode. Our design of the AvatarNode is such that it is a pure-wrapper around the existing NameNode that exists in Hadoop 0.20; thus the AvatarNode inherits the scalability, performance and robustness of the NameNode.

What is a AvatarNode?

The AvatarNode encapsulates an instance of the NameNode. You can start a AvatarNode in the Primary avatar or the Standby avatar. If you start it as a Primary avatar, then it behaves exactly as the NameNode you know now. In fact, it runs exactly the same code as the NameNode. The AvatarNode in its Primary avatar writes the HDFS transaction logs into the shared NFS filer. You can then start another instance of the AvatarNode on a different machine, but this time telling it to start as a Standby avatar.

The Standby AvatarNode encapsulates a NameNode and a SecondaryNameNode within it. The Standby AvatarNode continuously keeps reading the HDFS transaction logs from the same NFS filer and keeps feeding those transactions to the encapsulated NameNode instance. The NameNode that runs within the Standby AvatarNode is kept in SafeMode. Keeping the Standby AvatarNode is SafeMode prevents it from performing any active duties of the NameNode but at the same time keeping a copy of all NameNode metadata hot in its in-memory data structures.

HDFS clients are configured to access the NameNode via a Virtual IP Address (VIP). At the time of failover, the administrator kills the Primary AvatarNode on machine M1 and instructs the Standby AvatarNode on machine M2 (via a command line utility) to assume a Primary avatar. It is guaranteed that the Standby AvatarNode ingests all committed transactions because it reopens the edits log and consumes all transactions till the end of the file; this guarantee depends on the fact that NFS-v3 supports close-to-open cache coherency semantics. The Standby AvatarNode finishes ingestion of all transactions from the shared NFS filer and then leaves SafeMode. Then the administrator switches the VIP from M1 to M2 thus causing all HDFS clients to start accessing the NameNode instance on M2. This is practically instantaneous and the failover typically takes a few seconds!

There is no need to run a separate instance of the SecondaryNameNode. The AvatarNode, when run in the Standby avatar, performs the duties of the SecondaryNameNode too. The Standby AvatarNode comtinually ingests transactions from the Primary, but once a while it pauses the ingestion, invokes the SecondaryNameNode to create and upload a checkpoint to the Primary AvatarNode and then resumes the ingestion of transaction logs.

The AvatarDataNode

The AvatarDataNode is a wrapper around the vanilla DataNode found in hadoop 0.20. It is configured to send block reports and blockReceived messages to two AvatarNodes. The AvatarDataNode do not use the VIP to talk to the AvatarNode(s) (only HDFS clients use the VIP). An alternative to this approach would have been to make the Primary AvatarNode forward block reports to the Standby AvatarNode. We discounted this alternative approach because it adds code complexity to the Primary AvatarNode: the Primary AvatarNode would have to do buffering and flow control as part of forwarding block reports.

The Hadoop DataNode is already equipped to send (and retry if needed) blockReceived messages to the NameNode. We extend this functionality by making the AvatarDataNode send blockReceived messages to both the Primary and Standby AvatarNodes.

Does the failover affect HDFS clients?

An HDFS client that was reading a file caches the location of the replicas of almost all blocks of a file at the time of opening the file. This means that HDFS readers are not impacted by AvatarNode failover.

What happens to files that were being written by a client when the failover occurred? The Hadoop 0.20 NameNode does not record new block allocations to a file until the time when the file is closed. This means that a client that was writing to a file when the failover occured will encounter an IO exception after a successful failover event. This is somewhat tolerable for map-reduce job because the Hadoop MapReduce framework will retry any failed tasks. This solution also works well for HBase because the HBase issues a sync/hflush call to persist HDFS file contents before marking a HBase-transaction to be completed. In future, it is possible that HDFS may be enhanced to record every new block allocation in the HDFS transaction log, details here. In that case, a failover event will not impact HDFS writers as well.

Other NameNode Failover techniques

People have asked me to compare the AvatarNode HA implementation with the Hadoop with DRBD and Linux HA. Both approaches require that the Primary writes transaction logs to a shared device. The difference is that the Standby AvatarNode is a hot standby whereas the DRBD-LinuxHA solution is a cold standby. The failover time for our AvatarNode is about a minute, whereas the failover time of the DRBD-LinuxHA would require about an hour for our cluster that has about 50 million files. The reason that the AvatarNode can function as a hot standby is because the Standby AvatarNode has an instance of the NameNode encapsulated within it and this encapsulated NameNode receives messages from the DataNodes thus keeping its metadata state fully upto-date.

Another implementation that I have heard about is from a team at China Mobile Research Institue, their approach is described here.

Where is the code?

This code has been contributed to the Apache HDFS project via HDFS-976. A prerequisite for this patch is HDFS-966.