Saturday July 24, 2010

ZooKeeper loss of events problem... fixed

In my latest project at LinkedIn, I have been using ZooKeeper to track the state of all deployed services on every machine in production. The state is then used to drive the deployment as well as monitor the system. I ran into a really tricky bug which took me several months to address (the hardest part was finding what the actual problem was). The (simplified) code was the following:

ZooKeeper zk = ...
Map state = [:]

synchronized(lock) { state = trackChildren(state) }

// handle add/delete of children
private Map trackChildren(Map oldState) {
  def newState = new HashMap()
  def children = zk.getChildren('/state', childrenWatcher as Watcher)
  children.each { child -> 
    if(oldState[child] == null) // ADD
      trackChild(newState, "/state/${child}")
    } else { 
      newState[child] = oldState[child] // no change
    }
  }
  // DELETE: newState does not contain the children that have disappeared
  return newState
}

// handle child update
private void trackChild(Map newState, String path)
{
  if(path == null) return
  def child = new File(path).name
  try { newState[child] = zk.getData(path, childWatcher as Watcher, null) }
  catch(NoNodeException e) {  /* ok node is gone */ newState.remove(child) }
}

def childrenWatcher = { WatchedEvent event -> 
  if(event.type == EventType.NodeChildrenChanged) {
    synchronized(lock) { state = trackChildren(state) }
  }
}

def childWatcher = { WatchedEvent event -> 
  def child = event.path ? new File(event.path).name : null
  synchronized(lock) {
    if(event.type == EventType.NodeDataChanged) { 
      trackChild(state, event.path)
    }
    if(event.type == NodeDeleted) { state.remove(child) }
  }
}
What this code essentially does is the following:
  • Sets a children watcher on /state to be notified of children addition/deletion
  • For each child, sets a node watcher to be notified of child modification and deletion. Note that you need this watcher because ZooKeeper will not call the parent watcher when a child gets modified!
  • As a result, the map called state always contain a 'copy' of the data of all children.
From the ZooKeeper documentation, there are some key concepts to remember about watches:
  1. Watches are one time triggers; if you get a watch event and you want to get notified of future changes, you must set another watch.
  2. Because watches are one time triggers and there is latency between getting the event and sending a new request to get a watch you cannot reliably see every change that happens to a node in ZooKeeper. Be prepared to handle the case where the znode changes multiple times between getting the event and setting the watch again. (You may not care, but at least realize it may happen.)
  3. A watch object, or function/context pair, will only be triggered once for a given notification. For example, if the same watch object is registered for an exists and a getData call for the same file and that file is then deleted, the watch object would only be invoked once with the deletion notification for the file.
The code I wrote seems to handle #1 perfectly. On the other end, it does not handle #2 properly and this is where the bug gets triggered. I ran into a situation where a server would delete and rewrite the data very fast:
byte[] child1State = ...
zk.delete('/state/child1', -1)
zk.setData('/state/child1', child1State, -1)
In this scenario, I would receive 2 events:
  1. /state childrenWatcher would be fired with a NodeChildrenChanged event. And the bug gets triggered here. According to key concept #2, essentially events can be collapsed: in this case I am getting only 1 NodeChildrenChanged event and when I list my children (getChildren), I actually see no difference: child1 has been removed and added, but I don't see it. I am effectively loosing the 'add' child event with the code written this way.
  2. /state/child1 childWatcher would we fired with a NodeDeleted event resulting in state.remove('child1')... at this point I have indeed lost the state of child1.
With the following code on the server side:
byte[] child1State = ...
zk.delete('/state/child1', -1)
Thread.sleep(5000)
zk.setData('/state/child1', child1State, -1)
I actually get 2 events and I have enough time to process them so it works. But that is clearly not the right fix. In order to fix the problem, I did 2 things:
  1. The children watcher is only dealing with 'ADDs' and does not handle 'DELETE' anymore
  2. The child watcher handles 'DELETE' but also tries to set a watcher no matter what
ZooKeeper zk = ...
Map state = [:]

synchronized(lock) { state = trackChildren(state) }

// handle ADD only... let the children watcher handle DELETE
private Map trackChildren(Map oldState) {
  def newState = new HashMap(oldState)
  def children = zk.getChildren('/state', childrenWatcher as Watcher)
  children.each { child -> 
    if(newState[child] == null) 
      trackChild(newState, "/state/${child}")
  }
  return newState
}

// handle child update
private void trackChild(Map newState, String path)
{
  if(path == null) return
  def child = new File(path).name
  try { newState[child] = zk.getData(path, childWatcher as Watcher, null) }
  catch(NoNodeException e) {  /* ok node is gone */ newState.remove(child) }
}

def childrenWatcher = { WatchedEvent event -> 
  if(event.type == EventType.NodeChildrenChanged) {
    synchronized(lock) { state = trackChildren(state) }
  }
}

def childWatcher = { WatchedEvent event -> 
  def child = event.path ? new File(event.path).name : null
  synchronized(lock) {
    if(event.type == EventType.NodeDataChanged) { trackChild(state, event.path) }
    if(event.type == NodeDeleted) { 
      state.remove(child)
      // WARNING!!! UNINTUITIVE BUT CRITICAL!!!
      trackChild(state, event.path)
    }
  }
}
The code (in the childWatcher) most likely looks very unintuitive, but what is really happening in this case, is that the child on which the watcher is set will always receive the NodeDeleted event and according to key concept #1, if you want to be able to receive further events you need to set another watcher. It seems unintuitive to set a watcher on a node that just got deleted (and most of the time it will trigger a NoNodeException which is handled properly), but between the time the node was deleted and the time you check, it may very well have been recreated in which case you guarantee that you will have a watcher set on it and you won't loose it. If the node was deleted and not recreated right away, then the childrenWatcher will be the one setting it up again in the future if it gets recreated (pay close attention to the synchronization in the code as it is pretty critical for the system to work properly). Here is an example:
  • /state/child1 simply deleted
      1. childrenWatcher receives NodeChildrenChanged => does nothing
      2. childWatcher receives NodeDeleted => state.remove('child1')
    • or
      1. childWatcher receives NodeDeleted => state.remove('child1')
      2. childrenWatcher receives NodeChildrenChanged => does nothing
  • /state/child1 deleted / recreated (events get collapsed)
      1. childrenWatcher receives NodeChildrenChanged => does nothing
      2. childWatcher receives NodeDeleted => state.remove('child1') then state['child1']=new state + watcher
    • or
      1. childWatcher receives NodeDeleted => state.remove('child1') then state['child1']=new state + watcher
      2. childrenWatcher receives NodeChildrenChanged => does nothing
  • /state/child1 deleted / recreated (events are not collapsed)
      1. childrenWatcher receives NodeChildrenChanged => does nothing
      2. childWatcher receives NodeDeleted => state.remove('child1')
      3. childrenWatcher receives NodeChildrenChanged => state['child1']=new state + watcher
    • or
      1. childWatcher receives NodeDeleted => state.remove('child1')
      2. childrenWatcher receives NodeChildrenChanged => state['child1']=new state + watcher
    • or
      1. childrenWatcher receives NodeChildrenChanged => does nothing
      2. childWatcher receives NodeDeleted => state.remove('child1') then state['child1']=new state + watcher
      3. childrenWatcher receives NodeChildrenChanged => does nothing
As always with this kind of problem, the fix is really not that complicated. It is understanding the problem in the first place which is hard... You can check the ZooKeeper javadoc on kiwidoc :)

July 24, 2010 by Yan Pujante

Posted in java | 0 Comments »

Saturday July 10, 2010

Indexing android 'froyo' javadoc in kiwidoc

I just indexed android 'froyo' javadoc in kiwidoc and it was quite a challenge.

1. Downloading the source code

You have to go to http://source.android.com/ to find instructions on how to get it. This part is actually quite challenging because there are lots of steps to simply 'get it'. I understand why it is being done this way: you end up with a ready to develop/contribute environment. On the other end, if you simply want to look at the source code it is quite complex and potentially time consuming.

2. Downloading the 'correct' version of the source code

I know it is written in the manual but it is not very intuitive to download the correct version. Under the cover, it uses git but it is wrapped by a high level command called repo. When you use plain git in a project, you simply clone the project and then you can switch to a branch. When you use the repo command, you end up with an error message if you try to follow the git paradigm.
> repo branches
   (no branches)
> repo checkout froyo
error: no project has branch froyo
It is very important to 'clone' with the correct version originally:
repo init -u git://android.git.kernel.org/platform/manifest.git -b froyo 
I am sure there are very good reasons to do it this way, but it would certainly be easier if the command were to behave like git.

3. Compiling android

I use a Mac OS X machine (Snow Leopard) and bumped into a lot of issues:
  • Snow Leopard is not supported
  • It requires a case-sensitive file system (which is not the default on Mac OS X)
  • It requires Java 1.5 (which is eol now...)
Thankfully I had an old drive with Leopard on it and was able to eventually compile it using this version of the OS and creating a case-sensitive partition.

4. Make

For my own purpose I actually only needed the javadoc and the java source code. After running a full build (I just issued the make command) which took several hours to run (no kidding), there was no javadoc. Not being very familiar with make, it took me a while to figure out that you can run the command:
make -p > android-make.db
which "Print make's internal database". In this case it generated a 156MB file! Sifting through this file, I was able to identify the command I needed:
make offline-sdk-docs
which generated the javadoc.

5. kiwidoc generation

By slightly tweaking the javadoc command in the make file, I was able to generate the information I needed for kiwidoc: mainly all the source files to take into account as well as the classpath. Once I had this information, actually indexing it tooks 20s.

6. Android versions

I think that the android version naming, although I understand why it is being done this way, is quite confusing:
  • Platform version (2.2 is the 'current' latest one)
  • API Level (8 is the 'current' latest one)
  • Code Name (froyo is the 'current' latest one)
Code names are cool but it make it very hard to remember to which actual version they tie to: the Android API Levels page does not even mention codenames (so how do you know which version is 'eclair' for example ?). In kiwidoc, I chose to use the api level as the version of the library because it is what reflects what you program against: there could be a new platform version (ex: 2.2.1) without a change in api level. What matters from an application developer point of view is the api level.

7. Conclusion

It took me a really long time to index froyo (api level 8) but I succeeded which is very satisfying. I also 'promoted' the android platform all the way to the home page.

July 10, 2010 by Yan Pujante

Posted in pongasoft | 0 Comments »