The Space Race That Wasn't: Boeing vs. SpaceX
Fixed-Price Contracts and Interminable Delays are a Bad Combination
It seems a little cruel to beat up on Boeing these days. But it’s fun and informative, so let’s do it.
The company’s Starliner spacecraft was supposed to have its much-anticipated and -ballyhooed first crewed flight on May 6, but the launch got delayed because an oxygen relief valve was “buzzing,” or opening and closing rapidly, during the countdown. While we’re waiting for them to replace it, let’s review Starliner’s history to this point. Let’s also see what’s been going on over at SpaceX with its Crew Dragon spacecraft. It’s an apples-to-apples comparison, and one that shows us how much difference there is between today’s upstart geek companies and the Industrial Era incumbents (IEIs) they’re competing with. And, not coincidentally, beating.
Here’s a timeline of some of the relevant events:
If I were a NASA administrator or Boeing shareholder, I’d have some questions about the company’s progress here vis-à-vis SpaceX. Heck, I’m neither of these and I still have questions. The difference between successfully executing 13 crewed space missions in under a decade post-contract and successfully launching none is a real difference. What accounts for it?
This IEI vs geek competition kicked off in September of 2014, when NASA awarded contracts to both SpaceX and Boeing to build a new crewed spacecraft and manage its launches. At that time, America didn’t have any spacecraft capable of carrying humans into space. We mothballed the Space Shuttle fleet in 2011, and had since then been relying on the Russians to get our people into space. This was not a great situation for all kinds of reasons, so NASA launched the Commercial Crew program to remedy it. Under Commercial Crew, private companies would for the first time be responsible for designing, building, and launching spacecraft under contract to NASA. And these were fixed-price contracts (the kind most of us are used to), not cost-plus ones (the kind defense contractors got used to, and the kind where it’s almost impossible to lose money).
Geeks in Space
SpaceX had an advantage here because it wasn’t building a brand-new capsule to transport humans up to space. It was instead upgrading its Dragon capsule, which since 2010 had been getting cargo up to the International Space Station. So SpaceX just had to do an orbital #Vanlife conversion of an existing spacecraft (which we should hope for the sake of its astronauts went better than most terrestrial examples).
But Boeing also had a couple advantages. For one thing, it got 60% more money than SpaceX to build its capsule. For another, it had a lot of experience with human spaceflight, while SpaceX had none. Along with Lockheed, Boeing was part of the United Space Alliance, which since 1996 had managed first the Space Shuttle program and then the American portion of the ISS. After the Alliance dissolved, Boeing continued as the prime US contractor on the ISS.
SpaceX got to work in its agile way, with rapid cycles and fast feedback. As I wrote here earlier, it also developed a mania for taking extraneous things out — both physical parts and process steps — and assigning specific responsibility for every requirement.
The launchpad mission abort test for Dragon took place in May of 2015. SpaceX then got to work in earnest on the cargo craft’s #Vanlife conversion. An uncrewed test of a crew-ready version of the Dragon capsule took place in March 2019 with the Demo-1 mission. However, things got delayed when the capsule used for Demo-1 blew up during subsequent testing on the ground. A new capsule was used for Demo-2, a crewed test flight in May 2020. D-2 went swimmingly, and in November of that year SpaceX started Crew Dragon missions. So far, it’s done 12 of them; eight for NASA and four for private concerns. It’s well-paid work; SpaceX makes an estimated $300 million or so for each crewed NASA mission, which is why the icons for these missions are green in the graphic above.
All in all, SpaceX’s success with Crew Dragon is a tidy example of the iterative, fast-cycle approach to getting things done that’s core to the geek way.
Failingwater
Boeing, meanwhile, approached the work of building a crewed capsule in the style we’ve come to expect from IEIs. This style takes its name from a now-iconic flowchart found in a 1970 paper by old-school geek William Royce about managing the development of large software systems, called “Managing the Development of Large Software Systems:”
To some, the above looked like a picture of water cascading down a series of pools, so the project management approach it depicted became known as the “waterfall” method.
The crucial thing to understand about the waterfall method is that it’s bad. Bad in the sense that it delivers lousy results: chronic delays and cost overruns, unpleasant late surprises, and related woes. I spend a lot of time in my book The Geek Way (much of chapter six, in fact) explaining why this is, and why the highly iterative agile approach favored by SpaceX and other geek companies does much better.
Two of the biggest problems with waterfall are that it leaves testing — contact with reality, in other words — until late in the process, and that its lack of iteration and rich communication inhibits the creation and sharing of new knowledge. Clay Shirky’s summary is brutal, but not wrong: “The waterfall method amounts to a pledge by all parties not to learn anything while doing the actual work.”
We get strong evidence that Boeing used the waterfall method from an Ars Technica story published on May 6, the day when it looked like a crewed Starliner would at last be launched. Author Eric Berger described the failures that led to the failure of the first uncrewed Starliner orbital test flight, thus necessitating a second one. Here’s a testing snafu:
This uncrewed flight test faced problems almost immediately after liftoff. Due to a software error, the spacecraft captured the wrong "mission elapsed time"… This led to a delayed push to reach orbit and caused the vehicle's thrusters to expend too much fuel. As a result, Starliner did not dock with the International Space Station.
The second error, caught and fixed just a few hours before the vehicle returned to Earth through the atmosphere, was a software mapping error that would have caused thrusters on Starliner's service module to fire incorrectly. This could have caused Starliner's service module and crew capsule to collide. Senior NASA officials would later declare the mission a "high visibility close call," or very nearly a catastrophic failure.
A couple of months after the flight, John Mulholland, a vice president who managed the company's commercial crew program, met with reporters to explain what happened. He acknowledged that the company did not run integrated, end-to-end tests for the whole mission.
But surely Boeing did a lot of other kinds of software testing throughout the project?1 Surely yes, but it doesn’t appear to have been super high-fidelity testing — the kind that provides a true test of the system. Instead, it looks like at least some of the Starliner’s testing was done to check a box and pick up a check. Berger:
In a fixed-price contract, a company gets paid when it achieves certain milestones. Complete a software review? Earn a payment. Prove to NASA that you've built a spacecraft component you said you would? Earn a payment. This kind of contract structure naturally incentivized managers to reach milestones.
The problem is that while a company might do something that unlocks a payment, the underlying work may not actually be complete. It's a bit like students copying homework assignments throughout the semester. They get good grades but haven't done all of the learning necessary to understand the material. This is only discovered during a final exam in class. Essentially, then, Boeing kept carrying technical debt forward so that additional work was lumped onto the final milestones.
So testing on the Starliner program was pretty bad. What about iteration, communication between parties, and learning? Also bad. Berger:
[A propulsion system] anomaly [that led to a huge fireball during a ground test] was caused, at least in part, by poor communication between [subcontractor] Rocketdyne and Boeing.
"Boeing and Rocketdyne more or less hated one another," one person involved in the test told Ars. "Everyone was in super-defensive mode even before this happened. It had been classified as a risk, but the two sides weren’t talking openly and honestly about it."
…the Boeing and Rocketdyne teams were effectively walled off from one another and did not iterate together toward a more effective propulsion system.
It’s bizarre to me that the waterfall approach (whether de facto or de jure) has been so popular for more than fifty years, especially since the picture that gave it its name was accompanied by a warning about how it could — nay, would — go wrong. In his 1970 paper Royce puts the following text just below his waterfall picture (emphasis added):
The implementation described above is risky and invites failure. The problem is… [that] the testing phase which occurs at the end of the development cycle is the first event for which [important phenomena] are experienced as distinguished from analyzed… Yet if these phenomena fail to satisfy the various external constraints, then invariably a major redesign is required… The required design changes are likely to be so disruptive that… either the requirements must be modified, or a substantial change in the design is required. In effect the development process has returned to the origin and one can expect up to a 100-percent overrun in schedule and/or costs.”
The biggest and most consistent difference I’ve noticed between IEIs and the upstart geeks that are beating their socks off in sector after sector is the latter’s repudiation of just about all things waterfall. The Starliner vs. Dragon chart at the top of this post helps me understand why this is. If the IEIs don’t fundamentally change how they manage large, complicated efforts I don’t see how they’re going to match the geeks’ cadence of innovation and execution.
As I wrote in TGW:
Companies that grew up during the industrial era have to throw away that era’s playbook if they want to stand a chance when the geeks come to town. The idea that companies following the old playbook can fight back effectively against the geek way by doing a major reorganization, embracing a bold new strategy, or shuffling the leadership is laughable. Incumbents did all of these things over the past twenty years; they didn’t halt the disruption, or even slow it down much. The industrial-era playbook yields companies that move too slowly, are wrong too often, miss too many important developments, don’t learn and improve quickly enough, and fail to give their people the autonomy, empowerment, purpose, and voice that they want and deserve. These are insurmountable handicaps once competitors start adopting the geek way.
It’s very difficult for American men of my generation to read that sentence without mumbling to ourselves “Don’t call me Shirley”