Rocket Lab’s Investigation of the Electron Incident
On July 31, 2020, Rocket Lab announced the Federal Aviation Administration (FAA) had approved the company to resume launches of its Electron rockets. This announcement came at the conclusion of an investigation into an "in-flight failure" on July 4. The cause of the failure was assigned to "a single anomalous electrical connection."
The company's CEO, Peter Beck, posted his own summary of the investigation and its findings to Twitter. As the official Rocket Lab announcement details, the company's Accident Investigation Board (AIB) performed a fault tree analysis and narrowed their considerations of the cause to the electrical connection. It goes on to say:
- This connection was intermittently secure through flight, creating increasing resistance that caused heating and thermal expansion in the electrical component. This caused the surrounding potting compounds to liquefy, leading to the disconnection of the electrical system and subsequent engine shutdown. The issue evaded pre-flight detection as the electrical connection remained secure during standard environmental acceptance testing including vibration, thermal vacuum, and thermal cycle tests.
Beck and the company deem this to be an aberrant event, to be remediated with extra testing and other unspecified processes.
Anyone with knowledge of the history of US spaceflight probably raised their eyebrows there. Rockets are known to fail, in fact are bound to fail; Beck acknowledged as much in a press conference on July 5. What is more concerning is the style of investigation performed, namely one that relies solely on a linearized model of the event. As Sidney Dekker lays out in his book, Drift into Failure, recourse to such a simple-minded approach alone may undermine future launches, with Rocket Lab none the wiser as to the systemic forces at work.
In his book, Dekker analyzes other spacecraft accidents: NASA's Challenger and Columbia accidents. He contrasts what he calls the "Newtonian-Cartesian" mechanistic explanation of the cause of the latter (foam striking the shuttle's left wing, producing a chain reaction that led the ship to explode and kill the crew) with a more complex story. To craft that story, Dekker draws on the work of Diane Vaughan. Vaughan wrote a 1996 analysis of the forces in NASA that produced the Challenger accident, and subsequently wrote a 2005 review of the Columbia event. In her work, she describes what she calls a "normalization of deviance" as actors within the system incrementally change their understanding of work over time. The workers performed their work each day and continually modified their understanding of "normal" results and what constitutes acceptable operations. This deviation over time effected a subtle redefinition that eventually moved the foam strikes they observed during testing from a 'safety' issue to a 'maintenance' issue, with attendant shifts in understanding of its risk to the actual shuttle flight.
Dekker uses those texts in Drift into Failure to emphasize the "cultural, political, organizational (or any other 'social') factors" that make NASA a socio-technical system, and therefore a complex one. Those 'social factors' include elements like production timelines, budget changes, organizational policy, legal codes, and, of course, safety. But safety was only one factor, one goal, amongst many that those working within NASA considered at any given time. The dynamic interplay of these factors over time produced the deviance Vaughan describes, and contributed to what Dekker calls "drift." Sensemaking, he says, is continually produced and reproduced over time in the course of normal work. As the various pressures buffet the workers and nothing seems to 'go wrong' at any given moment, the system may "drift" towards the boundaries of acceptable performance; performance rapidly deteriorates upon crossing those boundaries.
The statement from Rocket Lab hints at those social factors in their organization, for example, when it alludes to "standard environmental acceptance testing" and creating new processes to address the broken component. But we don't know what that standard testing is, or how those norms may have "drifted" over time due to competing pressures. How might the complex pressures of deadlines, budgets, and more have set up Rocket Lab for this incident? And what can they do to try to address this going forward? We don't know, because they seem to have only relied on a merely complicated fault tree model to assess the situation. That model doesn't take complexity and drift into consideration.
It's important to note, at this point, that Rocket Lab's AIB isn't strictly speaking wrong; or at least, I'm not in any position to say that their fault tree analysis is wrong. At time of writing, I've not been able to find any significant details about the process except their description and subsequent news coverage. As I'm not able to review their work, I can't really call it into question and say that the electrical connection isn't a valid, temporally-proximate cause of the incident. What I can do, however, is point out that the fault tree's linearized, mechanistic way of viewing the event isn't the only way to view it. If one takes another point of view, other avenues of investigation appear that may augment their approach: the AIB may consider Scott E. Page's call to employ many models in their investigation. Or if they want to move from a merely complicated technical investigation, they can explore the socio-technical complexity of their organization.
Regardless, Rocket Lab plans to resume launches in August 2020.
Nick Travaglini is a member of the Liberal Studies program at the New School for Social Research and member of NSPDOS. The opinions expressed here are his own.