Java 8 is on its way, bringing a host of new features to the most widely-used language on the JVM. Likely the most oft-noted feature will be lambdas, to which Scala and JRuby aficionados will release a sigh of “finally”. Less flashy, but very important to certain classes of multithreaded applications, are the addition of LongAdder and DoubleAdder, atomic Number implementations that provide superior performance to AtomicInteger and AtomicLong when under contention from multiple threads.
Some simple benchmarking illustrates the performance difference between the two—for the following benchmarks we used an m3.2xlarge EC2 instance, which provides access to all 8 cores of an Intel Xeon E5-2670.
With a single thread, the new LongAdder is one third slower, but when threads are in contention to increment the field, LongAdder shows its value. Note that the only thing each thread is doing is attempting to increment the counter—this is a synthetic benchmark of the most extreme kind. The contention here is higher than you’re likely to see in most real-world apps, but sometimes you do need this sort of shared counter, and LongAdder will be a big help.
You can find the code for these benchmarks in our java-8-benchmarks repository. It uses JMH to do all of the real work and Marshall‘s gradle-jmh-demo for plumbing. JMH makes benchmarking easy by doing all of the fiddly little bits for you, ensuring the resulting numbers represent the state of the art in JVM-based benchmarking accuracy. JMH isn’t amenable to running under perf, though, so we also have some simple standalone benchmarks for that.
More details with perf-stat
We wrote standalone benchmarks so that we could have more control and run them under perf-stat to get some more details about what is going on. The most basic thing is the wall clock time that each benchmark run took. These benchmarks are all run on an Intel Core i7-2600K (real hardware, not virtualized).
While AtomicLong is a bit quicker in the single-threaded case, it quickly loses ground to LongAdder, being nearly 4 times slower with two threads, and nearly 5x with threads matching the machine’s cores. More impressive is that LongAdder’s performance is constant until the number of threads exceeds the CPU’s number of physical cores (in this case 4).
Instructions per cycle
Instructions per cycle measures how much of the time the CPU has work to do vs. when it’s waiting for memory to load or cache coherency protocols to settle. In this case, we see that AtomicLong has disastrously bad IPC with many threads, while LongAdder maintains a much healther IPC. The falloff from 4 to 8 is likely because this CPU has 4 cores with 2 hardware threads each, and the hardware threads don’t actually help in this case.
The execution pipeline on the processor is divided into two major groups: the front end, responsible for fetching & decoding operations, and the back end, which executes the instructions. There isn’t much interesting happening with operation fetching, so let’s skip the front end.
Activity on the back end gives more insight as to what is going on, showing the AtomicLong implementation leaving more than twice as many cycles idle. AtomicLong’s high idle time is analogous to its poor instructions per cycle: the CPU’s cores are spending a lot of time deciding which of them gets to control the cache line containing the AtomicLong.