|
Comments
Did you read today's front page stories & breaking news?
SYS-CON.TV
|
Features Is Your JIT Telling You Lies?
Even the most sophisticated developer should avoid using micro-benchmarks as predictors of application performance
By: Paul Murray
Sep. 23, 2009 05:30 PM
Writing meaningful Java benchmarks is a tricky business. It's well known that the Java Virtual Machine's just in time (JIT) compilation process means that running an application for a few seconds won't let you predict the performance of the application over hours or days of uptime. In spite of this, developers often rely on micro-benchmarks to set performance SLAs for their applications. Micro-benchmarks test some small, discrete component of an application. They're usually written in an effort to benchmark a component considered critical to the app's overall performance. Here's a typical example, summing all the numbers from one to a specified limit: long accumulatedTotal(int limit) { How long does this method take to execute for different values of limit? On my 32-bit OpenSolaris machine using Java 6u16 the first call to this method shows a non-linear runtime for small values of limit:
The method is initially executed by an interpreter and the JIT compiler doesn't kick in until the JVM detects that the method is doing a lot of looping. It then takes some time to compile the method, during which time the loop continues to run under the interpreter. Once this process is complete, the JVM can jump into the compiled version of the method and the average runtime for the remaining iterations is greatly reduced. The upshot is that Benchmarks like SPECjvm2008 know about this quirk in the behavior of the JVM and try to account for it. SPECjvm2008 repeatedly runs its benchmark code for a fixed "warm-up" period - typically 120 seconds. The warm-up time must be long enough to allow compilation of all the performance-critical methods in the benchmark code. You can observe the compilation process in the JVM by adding the Java command-line option -XX:+PrintCompilation. The benchmark then continues to execute for a "run" period - for SPECjvm2008 the default duration is 240 seconds - and the benchmark result is the number of iterations of the benchmark that were completed per minute of the run period. But does this approach really predict application performance? As so often with the JVM, it turns out to be more complicated than you expect. A modern JVM such as Sun's Java 6 contains a JIT compiler that is capable of generating native code every bit as efficient as that produced by an ahead-of-time (AOT) C++ compiler. Among the many optimizations performed by the JIT is method inlining. Calling small methods can incur a large overhead - for example, Java classes routinely contain getter and setter methods that simply read or write the value of a variable. The cost of calling a trivial method can be high relative to the work performed by the method, so the JIT will attempt to copy the body of a small method into its caller. Consider a simple getter method like this: Class Limit { If this is called from a loop like this, where "o" is an instance of class Limit: for (int i=0; i<o.getLimit(); i++) { In an AOT compiler, this inlining might not be possible. For example, the object "o" might not be an instance of Limit but of some subclass of Limit that overrides the getLimit() method. Because it compiles with knowledge of runtime behavior and can throw away compiled code if necessary, the JIT can decide to inline Limit.getLimit(). If "o" is ever seen to be an instance of a subclass of Limit, then the compiled code can be discarded and the loop recompiled appropriately. Let's take a look at the Arrays.sort(Object[] a) method. The javadocs say that all elements in the array must implement the Comparable interface. Arrays.sort() make many calls to the compareTo() method of the elements it is sorting. The compareTo() method in class Integer is very simple: public int compareTo(Integer anotherInteger) { If you only ever call Arrays.sort() on arrays consisting entirely of Integers, the JIT compiler can detect this and inline the Integer.compareTo() method into the sort(). What happens if you later call Arrays.sort() to sort an array of instances of class Short? The compiled code performs a quick check on each array element to ensure that it is an Integer. When this check fails, the compiled code is discarded and Arrays.sort() is recompiled to account for the new set of observed array element types. The updated compiled code will now inline both Integer.compareTo() and Short.compareTo(). The decision about which inlined method to execute will be taken by explicitly checking whether the array element is of class Integer or Short. Even with two different implementors of the Comparable interface, the JIT compiler is still able to perform method inlining. Let's go even further and call Arrays.sort() on an array of instances of class Byte. If the JIT compiler now inlines Byte.compareTo() as well, we'll have three different inlined versions of this method in the Arrays.sort() code. This is all starting to get out of hand! Now the JIT compiler changes strategy. It throws away all its previous inlining of the compareTo() method and does a traditional vtable dispatch just like a static compiler would do. For very small methods such as Integer.compareTo(), this can have a significant performance impact as the time spent calling and returning from such a tiny method may be far greater than the time spent executing its code. The limitation that only two receivers of a virtual method dispatch can be inlined is known as the bimorphic inlining policy and is a poorly understood but significant limitation on JIT performance. Listing 1 contains a benchmark program that shows this effect in action. The test code calls Arrays.sort() five times. First it calls it on an array of Byte objects. This is simply a "warm up" and the results are ignored. It then calls Arrays.sort() on an array of Bytes, then on an array of Shorts, then an array of Integers, and finally Arrays.sort() is called for a second time on an array of Bytes. Here's the result of running this benchmark using 32-bit Java 6u16 on my OpenSolaris machine. For this test the arrays to be sorted have 100,000 elements: Java SortBench 100000 As you can see, the warm-up performance is - as expected - a bit lower than for the first pass of the byte sort test. However the performance of subsequent sort tests falls away quite sharply and the performance of the second byte sort test is only 64% of the performance seen first time round. For complex server-side applications where it's likely that Arrays.sort() will be called on by many different classes, the real performance when sorting 100,000 random byte instances is likely to be about 1,307 operations per minute, not 2,034 operations per minute as the "standard" benchmarking approach suggests. This limitation is also important in attempting to benchmark Scala code. Scalac - the Scala compiler - rewrites Scala code blocks as standard Java classes that implement a special Scala interface. This interface requires a method called apply() that executes the code block. As you create code blocks in Scala, you are actually creating tiny Java classes that implement an interface. This is the same as the scenario when calling Arrays.sort() with different implementors of Comparable. Take a look at the Scala benchmark in Listing 2. The benchmark repeatedly sums all the numbers from 1 to 1 million for a specified period of time and then prints out the number of iterations completed and the accumulated result. Notice that the same code block is used for all executions of the benchmark. This looks more straightforward than the Java case seen previously. But here's the output from this program: scala Test Even though the code block is identical in all four cases, each code block is handled inside its own unique class. Once again the effect of the bimorphic inlining policy in the JIT compiler is to generate more performant code for the runBench method when only one or two implementors of the interface have been seen. The bimorphic inlining policy is a great example of why micro-benchmarking isn't just hard in Java - it's all but impossible. Even if you know about this issue and work around it, the probability is that there will be something else in the JVM to trip you up and give you phony results. Generating masses of micro-benchmark numbers can be worse than useless - they give the erroneous impression of scientific accuracy that can lead to performance SLAs that can't be met in production. Bottom line - don't use micro-benchmarks as predictors of application performance. Run your app for real and profile, then profile again, and then profile some more! Reader Feedback: Page 1 of 1
Latest Cloud Developer Stories
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
|
SYS-CON Featured Whitepapers
Most Read This Week
Breaking Cloud Computing News
|
|||||||||||||||||||||||||||||||||||||||||||||||||