Saturday, December 22, 2012

...Except Some Are More Equal Than Others

(Or NUMA, NUMA yay!)

Not all memory is equal. Some is faster to access that others. And how you access it is also important.

If you read through arrays of different lengths, you will probably get a performance profile that looks something like this:

Figure 1. Number of Java longs read per nanosecond vs. length of array (bytes)
Performance drops off around the 6MiB mark as that's the size of the L2 cache on my MacBook Pro (Intel Core 2 Duo). I know it's 6MiB by running:

phillip:~ henryp$ sysctl -a | grep cache

hw.cachelinesize = 64
hw.l1icachesize = 32768
hw.l1dcachesize = 32768
hw.l2cachesize = 6291456

It looks like what we're seeing is that as the L2 cache fills up with the array, performance drops.

A similar thing happens with the L1 cache which is 32KiB on this machine. (Note, it's the l1dcachesize that represents the memory cache. "L1d is the level 1 data cache, L1i the level 1 instruction cache" [1]).

Running the same code that generated the data in Figure 1 but with a much smaller range of array sizes, we get something like:

Figure 2. Number of Java longs read per nanosecond vs. length of array (bytes) 
Here we see performance dropping off when arrays are about 32k. So much we'd expect. But look at the left hand side of the scatter graph. It takes a while for performance to reach the optimal point. This is nothing to do with something like the HotSpot compiler kicking in as all results in this post were from the second run. (The first run's results were discarded as the performance at start up was orders of magnitude slower than anything that followed).

It appears that the reason it's being slow initially is that an optimization has yet to kick in. That is, "Sequential locality, [which is] a special case of spatial locality, occurs when data elements are arranged and accessed linearly, e.g., traversing the elements in a one-dimensional array." [3]

The Intel docs say:

"The Intel® Core™ i7  processor has a 4 component hardware prefetcher very similar to
that of the Core™ processors. Two components associated with the L2 CACHE and two
components associated with the L1 data cache.  The 2 components of L2 CACHE
hardware prefetcher are similar to those in the Pentium™ 4 and Core™ processors. There
is a “streaming” component that looks for multiple accesses in a local address window as
a trigger and an “adjacency” component that causes 2 lines to be fetched instead of one
with each triggering of the “streaming” component. The L1 data cache prefetcher is
similar to the L1 data cache prefetcher familiar from the Core™ processors. It has another
“streaming” component (which was usually disabled in the bios’ for the Core™
processors) and a “stride” or “IP” component that detected constant stride accesses at
individual instruction pointers. " [2]

It seems reasonable that this prefetching is responsible for the average time to read data from memory initially being low for smaller arrays. That is, memory needs to be accesses in a linear manner for some time before the CPU can assume that it will continue to be accessed in a similar manner. But more evidence will have to wait for my new (Linux) machine to arrive next week where I can there execute the commands to demonstrate what is happening to the machine's cache [4].

[1] What Every Programmer Should Know About Memory, Ulrich Drepper
[3] Wikipedia.
[4] Martin Thompson's blog.

Sunday, December 9, 2012

Queueing Theory and Practice

Being a Brit, I love queueing. However, I hate it when my software does it.

This week, I've been wondering why we're seeing a lot of contention in our application. It comes mainly via JAXP code as it makes calls to the class loader (see YourKit screen grab below).

Quite why the Sun XML classes keep calling loadClass on the Tomcat class loader for the same class is currently beyond me.

[Aside: set the jaxp.debug system variable to see greater visibility on what the JAXP code is doing. I saw this defined in javax.xml.xpath.XPathFactoryFinder. See if your Java implementation also uses this to set a debugging field.]

Anyway, when I thought I had identified the problem, it occurred to me that this contention was only such a major problem when I started profiling. Like quantum mechanics, I was affecting the results by taking measurements.

It seemed that the slowdown in performance caused by attaching monitors to the JVM was enough to push it over the edge - thread contention went crazy. A quick Google gave me the equations to describe a M/D/1 queue (that is, a queue where arrival is randoM, servicing the request is Deterministic and there is only 1 servicer. This seems to model our situation well in that any thread can make a call to loadClass but they should all be serviced in roughly the same time).

I used Mathematica to plot these queue equations. Here is the expected average wait time, g(p, u):

g[p_, u_] := p/(2 u (1 - p))
g::usage = "Expected average waiting time"

 Plot[{g[p, 1], g[p, u], g[p, 10]}, {p, 0, 1.0},
  ImageSize -> {600, 320},
  ImagePadding -> {{25, 10}, {25, 10}},
  Frame -> True,
  AxesLabel -> {"arrival/service rate", "expected average waiting time"}],
 {{u, 2, "Service Rate"}, 0.1, 100, Appearance -> "Labeled"},
 FrameLabel ->
  Style["M/D/1 case (random Arrival, Deterministic service, and one service \
channel)", Medium]

where p is

rate work arrives / rate work is serviced

and u is the rate work is serviced.

The graph looks like this (where u is 1, 10 and floating):

The important thing here is that it is not a linear graph. With a service rate of 2 for our variable graph (the purple line), we see that the number in the queue is about 1.0 when the ratio of work arriving to being processed is 80% but if it goes to 90%, the expected queue size goes to about 2.5.

At this usage profile, a 10% change means the average wait time more than doubles.

This could explain why the system became so badly behaved when experiencing only a moderate increase in load.

Saturday, December 8, 2012

The thundering herd

Last week, I wanted a blocking queue that had all the characteristics of a priority queue so I wrote my own. You can find it in the open source Baron Greenback library here. It decorates a PriorityQueue rather than extending it as I didn't want to violate the contract that PriorityQueue provides regarding being unbounded.

My first cut of the code used Conditions hanging off a ReentrantLock and called signalAll. This can be optimised by instead calling signal instead.

From Java Concurrency in Practise:

Using notifyAll when only one thread can make progress is inefficient - sometimes a little, sometimes grossly so. If ten threads are waiting on a condition queue, calling notifyAll causes each of them to wake up and contend for the lock; then most or all of them will go right back to sleep. This means a lot of context switches and a lot of contended lock acquisitions for each event that enables (maybe) a single thread to make progress. (p303)

The Condition class "factors out the Object monitor methods (wait, notify and notifyAll) into distinct object" and so are analogous to the notify and notifyAll methods.

While Goetz warns that conditional notification "is difficult to get right" (you may wake up a thread that sees it doesn't fit the condition and then goes back to being suspended without waking anybody else up) as a rule one should use notify/signal rather than notifyAll/signalAll on an object that is acting as a semaphore.

Addendum: there is an excellent explanation at StackOverflow here that demonstrates getting the choice between notify/notifyAll wrong can be very dangerous indeed. It uses the example of many consumers and producers popping and pushing a stack. The takeaway point is that notifyAll wakes all threads and all of them will contend for the lock and then check the condition sequentially. Whereas a call to notify wakes only one thread and that thread might be a producer who cannot push any more because the data structure is full and therefore waits - ergo, deadlock.

Wednesday, December 5, 2012

Persistent but not Persistent

When is a persistent object not necessarily persistent?

You might think that persistent means that the object can be stored in some durable data store. But there is another meaning [1].

This other meaning describes an optimal way of dealing with data structures. Instead of mutating these data structures, they are decorated with an object that represents the changed structure.

This is easier for some structures than others. For instance, in a linked list each element only knows about the element in front of it. So, if you wanted to add a new chain to the end of the list, you'd just create a new object that represented that chain and whose head pointed at the tail of the original list.

The important thing to remember is that the original list has not changed. So, there is no need to worry about mutexes or race conditions.

There are downsides. If you want to insert in the middle of a single linked list, you'd have to copy the elements in the old list that followed your insertion point into your new chain.

Also, this technique does not work with some data structures, eg double linked lists.

That said, the Java collection classes employ this technique to good effect. If you look at TreeMap.subSet(...) for instance you'll see a new set is returned that contains a subset of the original by wrapping it.

Note: not only do changes in one set affect the other, the original data structure stays in memory even if all other references to it disappear. We found this the hard way when we thought we'd discard elements of a large list but found we were still consuming large amounts of memory.

[1] Persistent Data Structures.