Saturday, September 17, 2011

Race conditions are the result of lazy coders

Today I got a ticket about a race condition my boss identified in a self-written java data class(Java's  GreogorianCalendar class is extremely slow and just crap). The problem is very subtle and hard to reproduce.

Take a look at the following sample Java snippet:

class DateTime {
  MyCache datetime = null;

  private void initDateTime() {
    // Creates a fresh MyCache object
    // and initializes datetime.colonTime to null
    datetime = new MyCache();
  }

  String getColonTime() {
    // datetime is some kind of 
    // lazy caching variable declared 
    // somewhere(does not matter)
    if (datetime == null) {
      //Uses lazy to initlialize variable, takes some time
      initDateTime(); 
  }

  // Colon time stores hh:mm as string
  if (datetime.colonTime == null) {
    StringBuilder sb = new StringBuilder();
    //Now do some steps to build the hh:mm string
    //...
    //set colon time
    datetime.colonTime = sb.toString();
  }
  return datetime.colonTime;
  }

  public static class MyCache() {
    String colonTime = null;
    public MyCache() {}
  }
}





The problem is that the getColonTime( ) method sometimes returns null although it always should return a time in the format hh:mm, like 17:49. Your first reaction sure is "No, it will always return something, because datetime.colonTime gets the value of the StringBuilder."
You're - of course - wrong, but it is very normal that you thing this way. The little devil is sitting in the details, waiting with his triton to peak you really hard.

Assume that we have two threads A and B who call getColonTime() at the same time (or nearly, if you only got one CPU). Threads are managed by something called a scheduler, who determines which thread has to pause and which one has to be run next. There are various scheduling algorithms, but most of them act like threads were having a party and the scheduler has to divide the CPU time cake. He does by serving each thread a piece of the cake while the other threads got to wait for their own slice.
So each thread has a defines period of time it is allowed to run before being paused again by the scheduler (e.g. it ate his piece of cake and now enqueues again to get a new one). Oh and our little devil is the boss of the scheduler ;)

Back to the problem:
What happens if two threads A and B call getColonTime() at the same time? First, A enters the first if block, because datetime is null.  As it happens, the scheduler decides to pause A before it can execute initDateTime(). Now B can run, it enters the method, jumps into the first if block and creates a new MyCache instance. Notice that, as the comment says, this also will set datetime.colonTime to null.
It then resumes and executes the contents of the second if block. It's a really widely used lazy pattern - initialize the variable's value when needed once and afterwards only return that value on subsequent calls.
So, B reaches the line datetime.colonTime = sb.toString() and our colon time member now got a value and continues to just before the return statement. Our evil scheduler now might think that this might be a good opportunity to pause A and resume  B.

Thread B now calls initDateTime() the second time. The result is, guess what, a new MyCache instance is being assigned to datetime. As our intention was to initialize datetime.colonTime lazy, the value of the latter is null again. So where's the problem? It will run through the second if block again, gets the value of the StringBuilder and everything is fine again, isn't it?
Surely not. Little devil has one of its really evil days and pauses thread A before it can execute the datetime.colonTime = sb.toString() command. As a result, datetime.colonTime is still null.
Again, thread A is resumed. There's only the return statement left, so the method will return the value of datetime.colonTime, namely null.

I took a lot of efforts to write a unit test which triggers this problem at least sometimes, lets say for one in a hundred calls, but I failed. The probability for this to occur is so low that the result was null only in 5 or 6 in total of all test runs. Each test runs that method about 200 000 times using multiple threads and I did a lot of test runs....

So whats next?
I will have to make sure that I can write a reasonable test which provokes these errors. While browsing the internet, I stumbled upon a really young project called thread weaver which looks really promising Maybe I'll give it a shot on monday.