|
Comments
Did you read today's front page stories & breaking news?
SYS-CON.TV
|
General Java Performance Tuning in Java
Performance Tuning in Java
By: Dov Kruger
Aug. 1, 2002 12:00 AM
The following mantra was first stated about two decades ago in Jon Bentley's "Programming Pearls" column defer optimization and get your code working first. This wisdom has been amplified by numerous writers on object-oriented design, coding, thinking, and more. The reigning philosophy has been stated as, "get it working first, then determine which areas are the critical ones and optimize only those." Since 20% of the code is run 80% of the time, this seems like a reasonable idea. Bentley was an early advocate of the use of profiling tools that show which parts of the code are run the most, and targeting the most critical areas first. On the whole, this is good advice, but the lesson has been learned too well. Today it's common practice to ignore efficiency, scattering layers of unnecessary inefficiency everywhere without thought. This article shows that it's just as easy to write faster code without taking extra development time to do it, and teaches something about the way Java optimizes your code.
Method Calls
What is the cost of a method call and how can you reduce it? A static method costs about three units of time on my Pentium 4 PC, where a unit is defined by an empty counting loop. A final method costs roughly the same, and an ordinary nonfinal method is about three times as expensive. The conclusion is obvious: whenever possible, use the final or static qualifiers on methods (in other words, if you don't intend to override the method, say so). Let's start by saying that if you want a program to run fast, get JDK 1.4 and run it with optimization turned on: java -server MyClass The -server option scans the entire loaded program as it's being run, eliminating methods by inlining them, turning methods into native assemblers, removing constant evaluations from loops, and other optimizations. It improves performance, often by a factor of 10 in CPU-intensive bits of code. It might surprise you to think about optimizing programs at runtime, but considering that Java runs on different machines, the only way to optimize for your particular processor is at runtime. This feature is new in 1.4. There's a "bad feature" in 1.3 that tends to invoke the JIT compiler lazily, often too late. If you compile the following program and run it under 1.3, any code in main is optimized, but code in f() is optimized only after it's called once: public static void f() { ... }In this case, since f() is called only once, it's obvious that the compiler had better optimize f before executing it, or not bother. The compiler makes certain assumptions about what is worth inlining based on the fact that inlining code can take more memory if the code is big. If you have a big routine, it won't be inlined. This can be very unfortunate in some specialized cases, which is where human intelligence comes in. Suppose you're writing f(), which calls g(), which calls h(). You're doing this just to break up the code and make it easier to read. However, even though f is the only function that calls g(), if g is big enough it won't be inlined. It doesn't matter if there's a big loop involved: void f() { g(); }because the time it takes to perform the loop dwarfs the time it takes to get in and out, and so the percentage cost of the procedure call is tiny. There is, however, a case to be made for applying human intelligence to inlining code. Frequently a method is large, but the first line of the method is a test that determines whether or not to execute the rest. Consider a logging routine. If debug mode is on, it should write out. But for all those cases where debug is false, why call the routine at all? Here's an example of this in action: private static boolean debug; In the first case, f7 calls the logging routine regardless of the state of the debugging flag: public static int f7(int n) {In the second case, f8 calls the logging routine only if debug mode is on. At the cost of an extra statement every time you call the log routine, this code runs 25% faster. public static int f8(int n) {Multithreading and Synchronization Since Java is a multithreaded language, the synchronized primitive is provided to make sure that multiple threads of execution do not destroy objects. When entering a synchronized method, it acquires a lock that's associated with the object and prevents any other synchronized method from entering. Acquiring such a lock is a slow machine language instruction. The result is that calling a synchronized method is three times as slow as an ordinary method, which in turn is three times as slow as a static or final method. The computer must first check whether someone else already has the lock, and if not, acquire the lock all in one atomic operation. public synchronized void f() { ... } Multithreading is a complex topic, and the reader is well advised to read one of the many books on the topic for a full understanding. However, a very quick overview of optimization should begin with the observation that since acquiring locks and managing them is a costly business, we should avoid it unless there is some real reason for their use. Furthermore, multithreading will only result in a gain of efficiency if we can overcome the immediate loss of efficiency that results from calling synchronized methods. In a great many situations, multithreading is not called for, and the simplest and best way to handle the situation is to say that the object is not thread-safe, and that programmers should never make two simultaneous calls to the same object. On the other hand, if multithreading has a significant advantage, the best way to achieve it, where possible, is to have more than one object and give each thread its own object. As perfect examples of this, consider IO streams in a Web environment. While there may be many threads simultaneously writing Web pages, each one is writing to its own Web page. In such cases, the IO stream for servlets could be implemented as a fast, unsynchronized version of PrintStream. Sometimes, however, there are applications (like a log) where it's vital that multiple threads be able to write to the same object. In such cases, synchronization is vital for correctness. While we can add a new class to the library to support unthread-safe IO, we must always continue to support thread-safe IO for those few cases where it's important. If you're going to acquire a lock, do so only once. Planning how locks are acquired and released is not only good optimization practice, it's worth really thinking over as this is one of those tricky areas where badly thought-out designs are not only slow, but often don't work in very subtle, nonrepeatable ways. These are the hardest possible situations to debug. Because acquiring the lock means that no one else can enter, synchronized critical sections should:
Simply removing all the synchronization from java.io.PrintWriter and writing a class that is functionally equivalent but not thread safe resulted in a 50% improvement in speed. Class PrintWriter contains synchronized methods that call other synchronized methods, in some cases three deep. The long chains of method invocation before getting to any actual code is a large part of what slows down IO.
Calling Native Methods
Having looked a bit at the implementation of the JDK, I can tell you that while it may be tweaked a bit, the reason is essentially sound to call a C++ routine, you must first make a native mode call (that's one) and then set up a call to the underlying C++ routine; twice as much work, twice as much time, right? And to communicate with anything in the Java environment takes further calls as well, so the only way you'll see a significant speed advantage is by staying in the C++ world for a while. In short, native methods seem to be totally outclassed at this point by a combination of increasingly good optimization in the Java world and the somewhat inefficient code involved in the communication between the two.
Creating Objects
In the following example, the first version, which creates only a single object and repeatedly queries it, is 800ms versus 26,300ms, or more than 30 times faster than the second one, which repeatedly re-creates the object. This is an extreme example, of course, because what is being done is very simple compared to the object creation, but it gives an idea of just how costly object creation is. public static int f10(int n) {Case Study While removing synchronization and streamlining the code path of PrintWriter resulted in a factor of two improvements in performance, eliminating the temporary string created in printing an int resulted in a sixfold performance improvement.
String Manipulation
String s = "a" + "b" + "c"; and know that StringBuffer is better: StringBuffer b = new StringBuffer();This knowledge seems to break down after this point. If you're processing large strings in StringBuffers, don't then turn them back into strings to pass them to another routine unless you're worried about multithreading problems. As long as you're processing single threaded, you're better off continuing to append into the StringBuffer until you're done. The following routine: public String getAsXML() {must make an unnecessary copy in order to turn the StringBuffer into a string. Then, if the caller is going to append more text, this string must be appended into yet another StringBuffer. This is a big waste. Instead, try: public void getAsXML(StringBuffer buf) {where the caller allocates the StringBuffer and passes it to the routine, which fills it. The caller can then continue processing. This approach has another advantage, namely that the caller usually has a much better idea of the total size of the StringBuffer at the end of processing. It is vastly more efficient, if you know how many characters are involved, to preallocate them rather than allow the StringBuffer to start at the default size of 16 and grow, which requires a lot of copying. For example, if you know the eventual size of the string will be as high as 2K, then: StringBuffer buf = new StringBuffer(2048);will typically result in approximately 100% performance improvement over the original string code. It's far better to overallocate than to underallocate and require a grow operation. Remember, this works only if the string in question is not being assaulted by multiple threads. Manipulating strings, even optimized ones, takes a fair amount of work and code, even if the string length is one. If you're processing a single character, using a char is much faster, so: buf.append('\n'); is significantly faster than: buf.append("\n");
Efficient Use of Lists
ArrayList v = new ArrayList(n);While both Vector and ArrayList use a doubling algorithm that will adaptively grab larger and larger chunks every time the size is exceeded, each time they grow an enormous expense is incurred. As with StringBuffer, it's about twice as fast to preallocate as much space as you'll need than to grow later, even if you overallocate. Last, remembering that object allocation is the slowest activity of all, you can easily see that this list, which must create object wrappers for each int, is vastly inefficient. The following code, using a list class written just for int elements (see my Web site for the code), runs a full four-and-a-half times faster than ArrayList. IntArrayList a = new IntArrayList(n);For scanning through an existing list, ArrayList is the fastest of the JDK list classes; getting an element from an array is a trivial operation, so synchronization dominates the time. Here, LinkedList can be monstrously slow if you use it incorrectly. Since LinkedList is not a random-access data structure, calling get(i) means it must start from element 0 and scan forward until it reaches position i. A loop that scans through the entire list is therefore not an O(n) operation, but O(n2). For a list of 100,000 elements, my computer performed the ArrayList traversal in 3.25 milliseconds. LinkedList traversal took an astounding 113,657 milliseconds, or 34,971 times slower. LinkedList l = l1;The correct way to code traversal through a LinkedList is to use the iterator design pattern: LinkedList l = l1;The lesson to be learned here is that it pays to understand your data structures well. Just choosing the right data structure for your situation can pay enormous dividends. And using one incorrectly, as in the case of LinkedList traversal, can be very costly. Last, if you want to store a list of primitives, the best way would be to have classes designed for the purpose, like IntArrayList. No one wants to go to the expense of writing and maintaining all permutations of lists for all the primitive data types; this is one reason Java needs a high-quality template facility like C++. That's a topic for another day, but one that I hope to revisit in a future article. For now, a friend and I are proposing some primitive list classes to add to the Java library, because when you do want a list of primitives, there's no substitute for a decent data structure.
Maps
Collisions cannot be totally eliminated in the general case, so the design of hash algorithms must allow for them. Therefore, each bin in the HashMap is essentially a linked list for all the keys and values that could hypothetically end up there. This means that every time you add an element to a HashMap, you're once again creating an object that holds the key, the value, and a reference to the next node in case any more values happen to land in the same bin. Object creation is the most expensive operation possible, so I've tried a different approach and have on my site a couple of experimental classes that perform twice as fast as HashMap (FastHashMap) or four times as fast if your key is an int (FastIntHashMap). They do, however, achieve part of their spectacular speed by not checking the size each time a new element is added, so you must allocate the right size table in advance. As with all other Java data structures, if you add too many elements to a Hashtable or HashMap, they grow. This is the worst thing you can do, since growing requires painfully reinserting every element. Hashing requires about 2530% more bins than there are elements for efficient operation. Always preallocate what you think is the right size for your Hashtable, be generous, and check at the end to be sure you were right and that the table did not have to grow. Last, because the hash algorithm for strings looks at every character in the string, avoid hashing large strings if at all possible. The smaller the string, the faster the hash.
Strength Reduction
Having admitted that most strength reductions are things compilers should do, if your compiler doesn't do them (and Java didn't used to), then it's up to you to do them yourself. In doing so, there are a number of issues: Will the resulting code be as simple as or simpler than the original? Gaining a little speed while losing understandability is not a great bargain. Will the resulting code be faster? Programmers often assume they're optimizing, when in fact they're doing the reverse. The kind of clock cycle counting is certainly better done by a compiler, with knowledge of the target CPU and environment if at all possible. The good news is JDK 1.4 now does some strength reduction. It's up to you to decide how much speed you need now. First, what not to do. Multiplications by the constant power of 2 are automatically converted to shifts by the computer: x * 2 x << 1More complex, but not worth it, are multiplications by constants: x * 10 (x << 3) + (x << 1)Divisions are not supported at the moment, but will be soon. If you need the speed right now, the speed of the division itself is five or six times faster. x / 2 x >> 1A much more important strength reduction, and one that the JIT is not likely to detect in the near future, also involves division. Often, programmers want to go around a loop, but do something different every n times. One standard trick is to count and take the counter modulo n, as in the following example: for (int i = 0; i < 100000; i++)This is slow; the following is four to five times faster: for (int i = 0, j = 10; i < 100000; i++,j--) {Similarly, if you have code in a loop like: j = (j + 1) % n;it's much faster to write: if (++j == n)In general, for any positive number x, x % n is equivalent to x & (n-1) if n is a power of two. So x % 8 == x & 7 as long as x is positive. Using the & operator is a lot faster.
Summary
Reader Feedback: Page 1 of 1
Your Feedback
Latest Cloud Developer Stories
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
|
SYS-CON Featured Whitepapers
Most Read This Week
Breaking Cloud Computing News |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||