Java 8 Performance Improvements: LongAdder vs AtomicLong

Java 8 is on its way, bringing a host of new features to the most widely-used language on the JVM.  Likely the most oft-noted feature will be lambdas, to which Scala and JRuby aficionados will release a sigh of “finally”.  Less flashy, but very important to certain classes of multithreaded applications, are the addition of LongAdder and DoubleAdder, atomic Number implementations that provide superior performance to AtomicInteger and AtomicLong when under contention from multiple threads.

Some simple benchmarking illustrates the performance difference between the two—for the following benchmarks we used an m3.2xlarge EC2 instance, which provides access to all 8 cores of an Intel Xeon E5-2670.

With a single thread, the new LongAdder is one third slower, but when threads are in contention to increment the field, LongAdder shows its value.  Note that the only thing each thread is doing is attempting to increment the counter—this is a synthetic benchmark of the most extreme kind. The contention here is higher than you’re likely to see in most real-world apps, but sometimes you do need this sort of shared counter, and LongAdder will be a big help.

You can find the code for these benchmarks in our java-8-benchmarks repository.  It uses JMH to do all of the real work and Marshall‘s gradle-jmh-demo for plumbing.  JMH makes benchmarking easy by doing all of the fiddly little bits for you, ensuring the resulting numbers represent the state of the art in JVM-based benchmarking accuracy.  JMH isn’t amenable to running under perf, though, so we also have some simple standalone benchmarks for that.

More details with perf-stat

We wrote standalone benchmarks so that we could have more control and run them under perf-stat to get some more details about what is going on.  The most basic thing is the wall clock time that each benchmark run took.  These benchmarks are all run on an Intel Core  i7-2600K (real hardware, not virtualized).

While AtomicLong is a bit quicker in the single-threaded case, it quickly loses ground to LongAdder, being nearly 4 times slower with two threads, and nearly 5x with threads matching the machine’s cores.  More impressive is that LongAdder’s performance is constant until the number of threads exceeds the CPU’s number of physical cores (in this case 4).

Instructions per cycle

Instructions per cycle measures how much of the time the CPU has work to do vs. when it’s waiting for memory to load or cache coherency protocols to settle. In this case, we see that AtomicLong has disastrously bad IPC with many threads, while LongAdder maintains a much healther IPC. The falloff from 4 to 8 is likely because this CPU has 4 cores with 2 hardware threads each, and the hardware threads don’t actually help in this case.

Idle time

The  execution pipeline on the processor is divided into two major groups: the front end, responsible for fetching & decoding operations, and the back end, which executes the instructions. There isn’t much interesting happening with operation fetching, so let’s skip the front end.

Activity on the back end gives more insight as to what is going on, showing the AtomicLong implementation leaving more than twice as many cycles idle.  AtomicLong’s high idle time is analogous to its poor instructions per cycle: the CPU’s cores are spending a lot of time deciding which of them gets to control the cache line containing the AtomicLong.

See also

Posted by Drew Stephens @dinomite ·

Drew’s expertise lies in high-traffic web services, big data, and building hyper-efficient software teams. At Clearspring, Drew created and supported APIs that handled more than 6 million daily requests, monitored detailed metrics from processing clusters with hundreds of nodes, and delivered thousands of events per second to users in real-time. Drew is skilled in systems administration and takes automation seriously, applying the Unix adage "if you do it twice, you're doing it wrong" ruthlessly. As a certified Scrum Master, Drew worked alongside Ryan to build a kaizen-based development organization at Genius.com capable of delivering high-quality products as-promised and on-time. Drew spends his time away from computers lifting heavy things and racing cars that are well past their prime.

About Palomino Labs

Palomino Labs unlocks the potential of software to change people and industries. Our team of experienced software developers, designers, and product strategists can help turn any idea into reality.

See the Palomino Labs website for more information, or send us an email and let's start talking about how we can work together.