Monday, June 22, 2009

Hadoop at Netflix

Netflix is interested in using Hadooo/Hive to process click logs from the users of their website. Here is what I presented to them in a meeting that was well attended by about 50 engineers. Following the meeting, a bunch of engineers asked me question related to the integration of scribe and hdfs and how Facebook imports click logs into Hadoop.

Here is a copy of my presentation .. slides.

Saturday, June 6, 2009

HDFS Scribe Integration

It is finally here: you can configure the open source log-aggregator, scribe, to log data directly into the Hadoop distributed file system.

Many Web 2.0 companies have to deploy a bunch of costly filers to capture weblogs being generated by their application. Currently, there is no option other than a costly filer because the write-rate for this stream is huge. The Hadoop-Scribe integration allows this write-load to be distributed among a bunch of commodity machines, thus reducing the total cost of this infrastructure.

The challenge was to make HDFS be real-timeish in behaviour. Scribe uses libhdfs which is the C-interface to the HDFs client. There were various bugs in libhdfs that needed to be solved first. Then came the FileSystem API. One of the major issues was that the FileSystem API caches FileSystem handles and always returned the same FileSystem handle when called from multiple threads. There was no reference counting of the handle. This caused problems with scribe, because Scribe is highly multi-threaded. A new API FileSystem.newInstance() was introduced to support Scribe.

Making the HDFS write code path more real-time was painful. There are various timeouts/settings in HDFS that were hardcoded and needed to be changed to allow the application to fail fast. At the bottom of this blog-post, I am attaching the settings that we have currently configured to make the HDFS-write very real-timeish. The last of the JIRAS, HADOOP-2757 is in the pipeline to be committed to Hadoop trunk very soon.

What about Namenode being the single point of failure? This is acceptable in a warehouse type of application but cannot be tolerated by a realtime application. Scribe typically aggregates click-logs from a bunch of webservers, and losing *all* click log data of a website for a 10 minutes or so (minimum time for a namenode restart) cannot be tolerated. The solution is to configure two overlapping clusters on the same hardware. Run two separate namenodes N1 and N2 on two different machines. Run one set of datanode software on all slave machines that report to N1 and the other set of datanode software on the same set of slave machines that report to N2. The two datanode instances on a single slave machine share the same data directories. This configuration allows HDFS to be highly available for writes!

The highly-available-for-writes-HDFS configuration is also required for software upgrades on the cluster. We can shutdown one of the overlapping HDFS clusters, upgrade it to new hadoop software, and then put it back online before starting the same process for the second HDFS cluster.

What are the main changes to scribe that were needed? Scribe already had the feature that it buffers data when it is unable to write to the configured storage. The default scribe behaviour is to replay this buffer back to the storage when the storage is back online. Scribe is configured to support no-buffer-replay when the primary storage is back online. Scribe-hdfs is configured to write data to a cluster N1 and if N1 fails then it writes data to cluster N2. Scribe treats N1 and N2 as two equivalent primary stores. The scribe configuration should have fs_type=hdfs. For scribe compilation, you can use ./configure --enable-hdfs LDFLAGS="-ljvm -lhdfs". A good example for configuring scribe-hdfs is in a file called hdfs_example2.conf in the scribe code base.

Here are the settings for the Hadoop 0.17 configuration that is needed by an application doing writes in realtime:

ipc.client.idlethreshold
10000
Defines the threshold number of connections after which
connections will be inspected for idleness.
ipc.client.connection.maxidletime
10000
The maximum time in msec after which a client will bring down the
connection to the server.
ipc.client.connect.max.retries
2
Indicates the number of retries a client will make to establish
a server connection.
ipc.server.listen.queue.size
128
Indicates the length of the listen queue for servers accepting
client connections.
ipc.server.tcpnodelay
true
Turn on/off Nagle's algorithm for the TCP socket connection on
the server. Setting to true disables the algorithm and may decrease latency
with a cost of more/smaller packets.
ipc.client.tcpnodelay
true
Turn on/off Nagle's algorithm for the TCP socket connection on
the client. Setting to true disables the algorithm and may decrease latency
with a cost of more/smaller packets.
ipc.ping.interval
5000
The Client sends a ping message to server every period. This is helpful
to detect socket connections that were idle and have been terminated by a failed server.
ipc.client.connect.maxwaittime
5000
The Client waits for this much time for a socket connect call to be establised
with the server.
dfs.datanode.socket.write.timeout
20000
The dfs Client waits for this much time for a socket write call to the datanode.
ipc.client.ping
false
HADOOP-2757