Comments
Patrick Collands wrote: collands (AT) gmail com I'd be very grateful for an invitation. Thank you.
Cloud Expo on Google News

SYS-CON.TV

2009 East
PLATINUM SPONSORS:
IBM
Smarter Business Solutions Through Dynamic Infrastructure
IBM
Smarter Insights: How the CIO Becomes a Hero Again
Microsoft
Windows Azure
GOLD SPONSORS:
Appsense
Why VDI?
CA
Maximizing the Business Value of Virtualization in Enterprise and Cloud Computing Environments
ExactTarget
Messaging in the Cloud - Email, SMS and Voice
Freedom OSS
Stairway to the Cloud
Sun
Sun's Incubation Platform: Helping Startups Serve the Enterprise
POWER PANELS:
Click For 2008 West
Event Webcasts
Is Your JIT Telling You Lies?
Even the most sophisticated developer should avoid using micro-benchmarks as predictors of application performance

Writing meaningful Java benchmarks is a tricky business. It's well known that the Java Virtual Machine's just in time (JIT) compilation process means that running an application for a few seconds won't let you predict the performance of the application over hours or days of uptime. In spite of this, developers often rely on micro-benchmarks to set performance SLAs for their applications.

Micro-benchmarks test some small, discrete component of an application. They're usually written in an effort to benchmark a component considered critical to the app's overall performance. Here's a typical example, summing all the numbers from one to a specified limit:

long accumulatedTotal(int limit) {
long result=0L;
for (int i=1; i<=limit; i++) {
result += i;
}
return result;
}

How long does this method take to execute for different values of limit? On my 32-bit OpenSolaris machine using Java 6u16 the first call to this method shows a non-linear runtime for small values of limit:

The method is initially executed by an interpreter and the JIT compiler doesn't kick in until the JVM detects that the method is doing a lot of looping. It then takes some time to compile the method, during which time the loop continues to run under the interpreter. Once this process is complete, the JVM can jump into the compiled version of the method and the average runtime for the remaining iterations is greatly reduced. The upshot is that for small values of limit, the runtime is non-linear, but as limit gets much larger it becomes much more predictable.

Benchmarks like SPECjvm2008 know about this quirk in the behavior of the JVM and try to account for it. SPECjvm2008 repeatedly runs its benchmark code for a fixed "warm-up" period - typically 120 seconds. The warm-up time must be long enough to allow compilation of all the performance-critical methods in the benchmark code. You can observe the compilation process in the JVM by adding the Java command-line option -XX:+PrintCompilation. The benchmark then continues to execute for a "run" period - for SPECjvm2008 the default duration is 240 seconds - and the benchmark result is the number of iterations of the benchmark that were completed per minute of the run period.

But does this approach really predict application performance? As so often with the JVM, it turns out to be more complicated than you expect.

A modern JVM such as Sun's Java 6 contains a JIT compiler that is capable of generating native code every bit as efficient as that produced by an ahead-of-time (AOT) C++ compiler. Among the many optimizations performed by the JIT is method inlining. Calling small methods can incur a large overhead - for example, Java classes routinely contain getter and setter methods that simply read or write the value of a variable. The cost of calling a trivial method can be high relative to the work performed by the method, so the JIT will attempt to copy the body of a small method into its caller.

Consider a simple getter method like this:

Class Limit {
private int limit;
public Limit(int limit) {
this.limit = limit;
}
public int getLimit() {
return limit;
}
}

If this is called from a loop like this, where "o" is an instance of class Limit:

for (int i=0; i<o.getLimit(); i++) {
// Do something
}
then it would be nice to inline the getLimit() method so that the resulting code looks like this:
for (int i=0; i<o.limit; i++) {
// Do something
}

In an AOT compiler, this inlining might not be possible. For example, the object "o" might not be an instance of Limit but of some subclass of Limit that overrides the getLimit() method. Because it compiles with knowledge of runtime behavior and can throw away compiled code if necessary, the JIT can decide to inline Limit.getLimit(). If "o" is ever seen to be an instance of a subclass of Limit, then the compiled code can be discarded and the loop recompiled appropriately.

Let's take a look at the Arrays.sort(Object[] a) method. The javadocs say that all elements in the array must implement the Comparable interface. Arrays.sort() make many calls to the compareTo() method of the elements it is sorting. The compareTo() method in class Integer is very simple:

public int compareTo(Integer anotherInteger) {
int thisVal = this.value;
int anotherVal = anotherInteger.value;
return (thisVal<anotherVal ? -1 : (thisVal==anotherVal ? 0 : 1));
}

If you only ever call Arrays.sort() on arrays consisting entirely of Integers, the JIT compiler can detect this and inline the Integer.compareTo() method into the sort().

What happens if you later call Arrays.sort() to sort an array of instances of class Short? The compiled code performs a quick check on each array element to ensure that it is an Integer. When this check fails, the compiled code is discarded and Arrays.sort() is recompiled to account for the new set of observed array element types.

The updated compiled code will now inline both Integer.compareTo() and Short.compareTo(). The decision about which inlined method to execute will be taken by explicitly checking whether the array element is of class Integer or Short. Even with two different implementors of the Comparable interface, the JIT compiler is still able to perform method inlining.

Let's go even further and call Arrays.sort() on an array of instances of class Byte. If the JIT compiler now inlines Byte.compareTo() as well, we'll have three different inlined versions of this method in the Arrays.sort() code. This is all starting to get out of hand!

Now the JIT compiler changes strategy. It throws away all its previous inlining of the compareTo() method and does a traditional vtable dispatch just like a static compiler would do. For very small methods such as Integer.compareTo(), this can have a significant performance impact as the time spent calling and returning from such a tiny method may be far greater than the time spent executing its code.

The limitation that only two receivers of a virtual method dispatch can be inlined is known as the bimorphic inlining policy and is a poorly understood but significant limitation on JIT performance.

Listing 1 contains a benchmark program that shows this effect in action.

The test code calls Arrays.sort() five times. First it calls it on an array of Byte objects. This is simply a "warm up" and the results are ignored. It then calls Arrays.sort() on an array of Bytes, then on an array of Shorts, then an array of Integers, and finally Arrays.sort() is called for a second time on an array of Bytes.

Here's the result of running this benchmark using 32-bit Java 6u16 on my OpenSolaris machine. For this test the arrays to be sorted have 100,000 elements:

Java SortBench 100000
(WARMUP) BYTE SORT PERFORMANCE = 1778 operations per minute
(FIRST PASS) BYTE SORT PERFORMANCE = 2034 operations per minute
SHORT SORT PERFORMANCE = 1623 operations per minute
INTEGER SORT PERFORMANCE = 1261 operations per minute
(SECOND PASS) BYTE SORT PERFORMANCE = 1307 operations per minute

As you can see, the warm-up performance is - as expected - a bit lower than for the first pass of the byte sort test. However the performance of subsequent sort tests falls away quite sharply and the performance of the second byte sort test is only 64% of the performance seen first time round.

For complex server-side applications where it's likely that Arrays.sort() will be called on by many different classes, the real performance when sorting 100,000 random byte instances is likely to be about 1,307 operations per minute, not 2,034 operations per minute as the "standard" benchmarking approach suggests.

This limitation is also important in attempting to benchmark Scala code. Scalac - the Scala compiler - rewrites Scala code blocks as standard Java classes that implement a special Scala interface. This interface requires a method called apply() that executes the code block. As you create code blocks in Scala, you are actually creating tiny Java classes that implement an interface. This is the same as the scenario when calling Arrays.sort() with different implementors of Comparable.

Take a look at the Scala benchmark in Listing 2. The benchmark repeatedly sums all the numbers from 1 to 1 million for a specified period of time and then prints out the number of iterations completed and the accumulated result.

Notice that the same code block is used for all executions of the benchmark. This looks more straightforward than the Java case seen previously. But here's the output from this program:

scala Test
WARM-UP RUN : 7628 operations per minute
FIRST RUN : 7713 operations per minute
SECOND RUN : 6185 operations per minute
THIRD RUN : 6134 operations per minute

Even though the code block is identical in all four cases, each code block is handled inside its own unique class. Once again the effect of the bimorphic inlining policy in the JIT compiler is to generate more performant code for the runBench method when only one or two implementors of the interface have been seen.

The bimorphic inlining policy is a great example of why micro-benchmarking isn't just hard in Java - it's all but impossible. Even if you know about this issue and work around it, the probability is that there will be something else in the JVM to trip you up and give you phony results. Generating masses of micro-benchmark numbers can be worse than useless - they give the erroneous impression of scientific accuracy that can lead to performance SLAs that can't be met in production.

Bottom line - don't use micro-benchmarks as predictors of application performance. Run your app for real and profile, then profile again, and then profile some more!

About Paul Murray
Paul Murray is a freelance specialist in Java performance analysis. He was previously a Technology Fellow at a leading Wall Street Investment Bank where he provided Java performance and scalability consultancy to several thousand Java developers. Paul was also a Member of Technical Staff at Silicon Graphics, working on dynamic compiler implementation and porting Sun's JRE to the SGI IRIX platform.

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1

Latest Cloud Developer Stories
The Enterprise Cloud Requires a real time infrastructure and a management discipline that understands and can enforce service level discipline.
CloudBench Applications, Inc. announced its financial results for the three months and nine months ending September 30, 2009. All amounts are stated in Canadian dollars unless otherwise noted. Revenues from BasicGov, the Company's cloud computing solution for local government, gr...
The new contract is an industry first, with CSC being the first Microsoft partner to lead and win a cloud computing services agreement of this scale. Under terms of the contract, CSC will provide Royal Mail Group's 30,000 employees with access to new IT services using Microsoft's...
Operates in over 170 countries and is one of the world’s leading providers of communications solutions and services. Richard Tarboton talks for MeettheBoss.TV on his role as Head of Energy & Carbon for BT and what they are doing towards reducing carbon emissions.
CA is going to put its Agile Planner software on salesforce.com’s Force.com platform in the first half to accelerate development time and give users visibility over their development initiatives to reduce time-to-market. Customers are supposed to be able to accelerate the deploym...
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

SYS-CON Featured Whitepapers
ADS BY GOOGLE

Breaking Cloud Computing News
CloudBench Applications, Inc. announced its financial results for the three months and nine months e...