Designing for high performance

Here’s the thing about high performance: you can’t just bolt it on at the end. It’s got to be baked in from day one. No doubt those of you who are experienced developers are now invoking the venerable Donald Knuth, who once said, “Premature optimization is the root of all evil.” But look at it this way: with very rare exceptions, no amount of performance tuning will turn an average system into a world class competitor.

Of course, high performance is the entire raison d’être for ElectricAccelerator. We knew from the start that parallelism would be the primary means of achieving our performance goals (although it’s not the only trick we used). Thanks to Amdahl’s law, we know that in order to accelerate a build by 100x, the serialized portion cannot be more than 1% of the baseline time. Thus it’s critical that absolutely everything that can be parallelized, is parallelized. And I mean everything, even the stuff that you don’t normally think about, because anything that doesn’t get parallelized disproportionately saps our performance. Anything that isn’t parallelized is a bottleneck.

Here’s an example: command expansion. You know, the bit of code that turns something like this:

$(CC) $(addprefix -I,$(dir $(SRCS))) -o $@ $<

into something like this:

gcc -I. -Isubdir -o foo.o foo.c

There’s no way around this translation. It has to be performed, for every command invoked during the build. Even if the command doesn’t have anything that needs to be expanded.

What happens if the expansion itself is time-consuming? For the sake of demonstration, we can make command expansion slow by sticking $(shell sleep 5) into the command — because the sleep appears inside a $(shell) function call, gmake is obliged to execute the sleep as part of expanding the command. Here’s the makefile we’ll use:

all: a b c d e a b c d e: @$(shell sleep 5) echo $@

Now, if you run this with an ordinary, serialized gmake, you’ll see that it takes about 25 seconds to execute. No surprise there, with 5 jobs that each have a 5 second sleep:

ericm@chester$ time gmake a b c d e real 0m25.092s user 0m0.012s sys 0m0.012s ericm@chester$

Now, see what happens when we run this same makefile with a parallel gmake invocation: it still takes 25 seconds, even though we have specified more than enough parallel jobs that all five jobs should run simultaneously!

ericm@chester$ time gmake -j 8 a b c d e real 0m25.016s user 0m0.012s sys 0m0.012s ericm@chester$

What’s going on here? Let’s take a quick look at the core algorithm in gmake:

  1. Find the next target that’s is runnable (all prereqs up-to-date).
  2. Expand the commands for that target.
  3. Run the command for that target.
  4. If the number of currently running commands is equal to the job limit (1 for serial gmake, N for parallel gmake), wait for a command to finish.
  5. Repeat until finished.

Maybe you see the problem already: gmake is a single-threaded program. Even when you specify -j 8, there’s only one thread executing that core algorithm. Parallelism doesn’t really enter the picture until the end of the algorithm, where gmake decides whether it can go ahead and start working on another target without waiting for the previous one to finish.

Being single-threaded means that command expansion is implicitly serialized: gmake can only expand commands for a single target at a time. Too bad for you if that expansion takes any significant amount of time.

So, could we fix gmake, so that command expansions could be performed in parallel? Well, it’s really hard to take something that wasn’t designed to be high-performance from the start and transform it into something with world-class performance. In this case, gmake’s heritage as a single-threaded application permeates every aspect of its implementation. For example, command expansion is performed using a single global buffer (see expand.c in the gmake sources). While it’s certainly possible to refactor this code, it would be non-trivial to do so.

Command expansion with ElectricAccelerator

In contrast, Accelerator was designed to be multi-threaded from the start, and we have made a deliberate effort to parallelize absolute everything we can, including even the behind-the-scenes stuff that you probably never thought about before reading this blog. Like command expansion. And sure enough, if you try this makefile with Accelerator, the total run time is about 5 seconds:

ericm@chester$ time emake Starting build: 12705 a b c d e Finished build: 12705 Duration: 0:05 (m:s) Cluster availability: 100% real 0m5.523s user 0m0.008s sys 0m0.016s ericm@chester$

Designing for high performance

When truly world-class performance is your goal, you can’t wait until implementation is complete to start thinking about performance. That goal has to inform your decisions at all stages of development, from design to implementation. With Accelerator, our obsessive focus on performance has resulted in an architecture that allows us to parallelize parts of the build process that other tools simply cannot.

One Response to “Designing for high performance”

  1. Shell commands in GNU make « blog.melski.net Says:

    […] The problem with this is that it moves the work into the command expansion phase, which means it can’t run in parallel with gmake. The fix for this one is to just drop the $(shell) […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: