Hindenburg-disaster

Recently, Amazon took a lot of heat for releasing an iOS Kindle update that deregistered user accounts and deleted their entire ebook collections, especially when it warned users not to download its latest update. PandoDaily reporter Richard Nieva blamed a “frenzied, always-be-shipping culture,” in which software releases are “expected fast and furiously.”

But I disagree. I don’t think Amazon and fast and furious release cycles are to blame. I believe fault lies with Apple, and with its arcane App Store approval process.

I’m a proponent of a software development philosophy called “Continuous Delivery” where developers set up a process to push new code to customers very quickly and very frequently. For example, my company CircleCI pushes new code to customers almost ten times each
day. But rather than a callous and dangerous process, Continuous Delivery is part of a movement to create better software with fewer bugs which affect fewer customers. It is useed heavily by Google, Facebook, SalesForce, and of course Amazon.

First, some background. Software development is a house made out of straw, and the techniques and collective experience of the industry, for all its high-tech gloss, is actually pretty primitive. Before a piece of software makes it to users there is no way to guarantee that it will be bug free. Individual application are often made of millions of lines of code, with each line potentially offering problems. Many a PhD — including my own — has been spent working on techniques to formally analyse and improve software projects, but after 60 years we have no practical languages or tools to make it possible to ship perfect products.

What we have, though, is collected wisdom about the process of shipping software. Two of these schools, the “Lean” and “Agile” movements, center on the idea of using Continuous Delivery to reduce bugs. The idea is simple: instead of shipping code less frequently, we ship it more frequently. Instead of a release every year, we make one every day.

To the untrained, this seems like madness. Bugs are absolutely inevitable, so why propose a buggy release every day instead of one every year? The answer is that by shipping more frequently, we can control the risk of each release very effectively, contain the damage that a single release can do, and have a process that moves quickly to fix bugs and reverse any wrongs caused.

Each release does not have the same level of risk. If I make two changes to my software today and release them, the risks are straightforward and simple to revert. If instead, I make a thousand changes over the next 18 months, and release them to a million customers at the same time, breakage is utterly inevitable.

With small changes, it is easy to see how customers are affected, and to more easily track what a single change (or a small set of changes) does to a piece of software. Mozilla Firefox is a case in point.

Mozilla shipped Firefox 4 after an 18-month development cycle in which they struggled to keep bugs under control. Upon release, it changed to a release cycle that shipped a new product every six weeks. This started three months after Firefox 4 shipped, and Firefox 5 had basically no exciting features. It was stable, though, and introduced few bugs.

Since then, Firefox has been released 14 more times, so often that it dropped the version number. Instead of massive changes like in Firefox 4, Firefoxen 5 through 19 included incremental changes, allowing Firefox to improve at a slow, methodical pace and deleteriously affecting far fewer users in far less unsettling ways.

And as part of Mozilla’s release cycle, small numbers of alpha testers act as guinea pigs. That means a new feature is not just tested by its development team, its put through its paces by the first tens of thousands of users, then millions, then finally graduates to the full 500 million users.

Limiting the number of users who are affected by a single push is a key part of Continuous Delivery. Facebook also has a sophisticated process to limit the effects of bad code. Its “feature flags” allow it to turn on and off features to precise groups of users, such as urban
women who are engaged, aged 25 to 35, and live in Northern California. This allows Facebook use Continuous Delivery to incrementally improve the site for its nearly billion users.

So back to Amazon. I believe that the Kindle for iOS glitch was caused not by Amazon’s negligence but by Apple and its antiquated App Store, which relies on an approvals process that prevents Continuous Delivery. This ensures that changes are batched together and hit all potential users at once.

By requiring a long, unpredictable approval process, Apple actually incentivizes buggy releases. That’s because dozens of different features and bug fixes all ship at once, to all users, at an unpredictable time once the new version is approved. Apple’s release process actively prevents planning when a new feature will arrive, and how many users will be affected by an errant line of code.

To protect its users, Apple should revamp its approval process in a number of ways. First, it should allow users to opt-in to receiving alpha and beta versions of app releases, to create a small population to weed out the worst bugs before they hit the general population. (It might seem that having this population is a bad sign, but all users are defacto alpha testers now.) All releases should begin as alpha, then be promoted to beta, then finally promoted for general release.

Apple should encourage developers to ship releases more frequently by prioritizing review of small and tiny changes that are much lower risk than larger releases. And it should auto-update apps without user intervention, so users are not overwhelmed by more frequent updates.

Amazon is already one of the biggest users of Continous Delivery — they deploy new code at Amazon every 11.6 seconds. At the same time, Amazon is down so infrequently that their last outage made the headlines. By relying on Continuous Delivery, a software engineering best practice, Amazon reduces the risk to its users, and greatly improves the reliability of its entire platform.

As for the recent Kindle glitch, don’t blame Amazon. Blame Apple.