Automating around scarcity by using virtual resources

[posted on behalf of Usman Muzaffar, who is on a long flight with no WiFi]

Here’s a sobering truth that shows up often in software automation: people are way better at sharing stuff than computers are. For example: say you have a scarce resource, like a box with special hardware or a service with serial access. You’re tasked with automating a software build/test/release workflow, and part of it needs to talk to this One Big And Fancy Thing. Do you try to teach your build script good playground behavior, so it automatically knows when to wait politely (and when, as deadlines approach, it should bully its way to the top of the slide), or do you declare this problem out-of-scope, and just provide the hook to let the team manage access manually?

The default on that checkbox is: *don’t automate*, for two reasons. First: letting people handle it means no extra work. More importantly, because we’ve been doing it our whole life, we’re actually pretty good at adapting to environments where we have to share things, whether that’s roads or restrooms or rack space. A small number of people on the same team with similar goals will usual self-organize around a few ground rules with a minimum of fuss. One clear and crisply delivered directive at a weekly team meeting (“OK guys the new 32-way sol box is for the full server test suite, so give that priority and check with each other before you use it for other stuff”) is often all it takes.

Second, technically getting the semantics of shared simultaneous access right is a notorious pain in the neck. As in any software automation system, there’s no credit for a partial answer: it’s a net loss if your script still needs a babysitter for the corner cases. So that means your solution needs to take selection and queuing and load into account, and have mechanisms for priority and pre-emption and be smart about busted network connections. More fundamentally, at its core it usually boils down to something awfully close to multithreaded programming, with the usual challenges in that space around semaphores, locks, deadlocks, races. Great stuff in a CS course or maybe your server’s ConnectionPool class — rathole alert in your build and test system!

So, largely with good reason, the automation train comes to a screeching halt right here. It’s just not worth the effort to build a system that’s going to manage the synchronization for parallel access to scarce resources. In other words: when shared resources wind up in the software production system, people show up next to them, and that sucks all the fun (and potential efficiency gains) out of automation. What to do?

One thing worth investigating are tools that can handle this for you.  Solving this was a key goal for our ElectricCommander product. Commander lets you describe your job as a series of command line steps, and each step can be specified to run on a resource. A resource is simply a system that we’ll remotely execute commands on, and it comes with a sack full of infrastructure goodies you’d expect like pooling, exclusive reservation, broadcast, security, access control, load balancing, and fault tolerance. As a user of the system, you specify what you want to run, and where you want to run it, press the ‘Go’ button and Commander does the rest, queuing steps when resources are oversubscribed and efficiently scheduling around your other constraints. Nice!

Then one day a customer asked us how they could automatically control access to a piece of hardware that simulated network traffic critical to the product’s system test. This wasn’t a gadget we could install software on; indeed, we couldn’t directly connect to it at all, so Commander can’t treat it as a resource. But it soon became evident that we could solve this just as elegantly with a simple tweak to the approach. Fundamentally, we needed the ability to specify that a step 1) needed access, 2) must block when it wasn’t available, and 3) once acquired, hang on to it until it was done. If something could just take care of this synchronization and queuing, the test could connect to the traffic simulator directly and simply execute as if invoked manually.

In other words: the problem called for a *subset* of Commander resources;  ignore half the stuff in the goodie sack (remote login, execution, fault tolerance, etc.) and you’re left with a general purpose resource access and acquisition facility. We set up dummy resources (good old 127.0.0.1, always up and ready for this sort of game!), injected them into the workflow and configured the job to hang on to them as long as it was talking to the traffic simulator. It worked beautifully: each test run was guaranteed to get just the access it needed, and for the first time, the customer had safe, parallel end-to-end automation for the full test cycle.

More importantly, this design pattern, since dubbed Virtual Resources, opened a whole new realm of possibilities. Once you start looking for them, there are *lots* of shared things in a software system that aren’t compute hosts, and they’re all threatening or overcomplicating automation in some way or another.  We’ve used Virtual Resources to manage access database tables, SCM labels, virtual machines, filesystem repositories, flaky external systems that don’t like more than one client talking to them, and our customers keep showing us new ways. It’s a great example of how the core of a clean design — a resource is something a job can request and relinquish — was readily adapted to a wider set of problems around Software Production Automation.