The Safety of Work

Ep. 119: Should we ask about contributors rather than causes?

Episode Summary

Listen in as we explore the ever-evolving field of accident investigation research, where the shift from pinpointing causes to understanding contributors is gaining traction.

Episode Notes

Today’s paper, “Multiple Systemic Contributors versus Root Cause: Learning from a NASA Near Miss”  by Katherine E. Walker et al, examines an incident wherein a NASA astronaut nearly drowned (asphyxiated) during an Extravehicular Activity (EVA 23) on the International Space Station due to spacesuit leakage. The paper introduces us to an innovative and efficient technique developed during Walker’s PhD research. 

In this discussion, we reflect on the foundational elements of safety science and how organizations are tirelessly working to unearth better methods for analyzing and learning from safety incidents. We unpack the intricate findings of the investigation committee and discuss how root cause analysis can sometimes lead to the unintended consequence of adding more pressure within a system. A holistic understanding of how systems and individuals manage and adapt to these pressures may provide more meaningful insights for preventing future issues.

Wrapping up, our conversation turns to the merits of the SCAD technique, which champions the analysis of accidents as extensions of normal work. By examining the systemic organizational pressures that shape everyday work adaptations, we can better comprehend how deviations due to constant pressures may lead to incidents. We also critique current accident analysis techniques and emphasize the importance of design improvement recommendations. 

Discussion Points:


Quotes:

“We've been doing formal investigations of accidents since the late 1700s early 1800s. Everyone, if they don't do anything else for safety, still gets involved in investigating if there's an incident that happens.” - Drew

“If you didn't have this emphasis on maximising crew time they would have been much more cautious about EVA 23” - Drew

“Saying that there's work pressure is not actually an explanation for accidents, because work pressure is normal, work pressure always exists.” - Drew

“One of the things that is absent from this technique through and they call it an accident analysis method is there is no commentary in the paper at all about how to design improvements and recommendations.” - David


Resources:

The Paper: NASA Near Miss

The Safety of Work Podcast

The Safety of Work on LinkedIn

Feedback@safetyofwork

Episode Transcription

David: You are listening to the Safety of Work podcast episode 119. Today we’re asking the question, should we ask about contributors rather than causes? Let’s get started. 

Hey everybody. My name’s David Provan. I’m here with Drew Rae, and we’re from the Safety Science Innovation Lab at Griffith University in Australia. Welcome to the Safety of Work podcast. 

In each episode, we ask an important question in relation to the safety of work or the work of safety, and we examine the evidence surrounding it. We’ve recorded (I think) more than 10 episodes on accident investigation research. However, it remains a really interesting topic in safety science. 

In some ways, it’s been for a long time, somewhat central to our understanding of safety and driving improvements. Many involved with a number of organizations today are still looking for better ways to investigate, learn, and improve in response to safety incidents. 

Often, central to this discussion is the actual incident investigation or incident analysis tool itself—is there the best tool or is there a good tool that we can use to understand and improve following safety accidents? 

Drew, on the podcast a couple of times, you’ve talked about the state of accident investigation research within safety science. Do you want to talk a little bit about the state of that research and maybe research that looks to come up with new analysis techniques?

Drew: Sure. As you indicated, David, accident investigation is probably one of the biggest topics in safety science. I think arguably it’s the most universal and oldest safety activity that organizations do. We have been doing formal investigations of accidents since the late 1700s–early 1800s. Everyone, if they don’t do anything else for safety, still gets involved in investigating if there’s an incident that happens. 

Inventing new ways of doing it or ways of improving it is obviously going to be something that researchers are interested in. It’s the easy way to make a big splash as a safety research, or at least it seems tempting as a way to do that, is to come up with a new explanation for accidents or a new model or a new method and have people adopt it. 

I actually think it probably works in reverse, is people become famous first and then every famous person feels obliged to invent their own accident model or method. So if you think of any big name in safety, they’ve probably at some stage in their career got some method named after them. 

James Reason (I think) actually has three. He’s famous for the Swiss cheese, but he also came up with this thing called vulnerable system syndrome. Hollnagel got his functional resonance method, Leveson got STPA. It’s a fairly common thing to do. 

All of the models and methods have pretty much the same goal, which is we want to get broad learning out of a specific series of events. But it’s actually really hard to come up with a new method because you don’t really have brand new accidents you can test your new method on. 

What people do—and this paper’s going to be an example of it—is they take something that’s already been investigated, and they say here’s how I would’ve done it instead. But it’s a little bit hard because you can’t actually go and do new data collection, new evidence gathering to really test and prove your new method. You’re stuck with the facts that come from the original investigation.

David: That’s a good point, Drew. Maybe we need to think about designing our own accident analysis tool then if that seems to be a rite of passage.

Drew: Yeah, but if we’re going to be robust about it, we’ve also got to cause our own accidents, so we’ve got fresh new data to investigate using our new method.

David: All right, leave it with me. You mentioned a few of those models there and for some more contemporary safety scientists, like FRAM, STAMP, and Accimap (I guess) was a little bit earlier. 

I think one of the criticisms when accident analysis techniques move from linear root cause style of models that we’ll talk about, to more complex adaptive systems styles of models, is the accident analysis techniques often become larger, more complex, more time-consuming, resource-intensive, need more training for people to be able to run the process, get it and use it as designed. 

I guess the paper we’re going to talk about today proposes a new accident analysis method that the authors claim to be very resource-efficient and less resource-intensive than even a normal type of root cause investigation process. 

Drew, do you want me to introduce the paper?

Drew: I think you know all of the authors, so you are probably going to do a better job at introducing them than I will.

David: Well, a little bit. We’ve had some research from the Ohio State University and professor David Wood’s lab on the podcast before. The authors of this paper are Katherine Walker, David Woods, and Mike Rayo. Mike Rayo’s the current professor and director of the lab, the Cognitive Systems Engineering Lab at Ohio State University in the US. 

Katie or Katherine is now currently a cognitive systems engineer at Mile Two software development, so I guess putting this into practice, and she completed her PhD in 2021. This research that we’ll talk about today was a core part of her PhD, which was in trying to look at systemic contributors to accidents. Part of her PhD involved developing this novel accident analysis technique and applying that technique in certain domains.

The title of the paper is Multiple Systemic Contributors versus Root Cause: Learning from a NASA Near Miss. The journal is the proceedings of Human Factors and Ergonomic Society annual meeting (HFES) in 2016. 

Drew, I first came across this analysis technique in 2017, and again in a bit more detail in 2019, talking with Dave and Mike at the bi-annual Resilience Engineering Association conference. I know that by 2019, this analysis technique was being applied in domains like air traffic control in Europe, in various healthcare organizations. 

I actually haven’t heard much recently about it. I actually came across some hard copy handouts that have been sitting on my shelf for five years or so, and that encouraged me to do a little bit of a literature review to understand where it was at. 

There’s been a handful of other applications published, of researchers and practitioners applying this technique in a number of organizations. That’s where it’s coming from. Anything you want to add, Drew?

Drew: Just that the purpose of this paper is really to explain the technique. It’s not really an empirical research paper. The way they’re explaining the technique is they’re taking an existing incident; David’s going to talk us through that incident. They’re talking a little bit about what the original investigation discovered, then they’re showing how their own new technique applied to that same incident would work, and some of the principles they’re applying.

They don’t make strong claims for the method. They do indicate why they think their outcomes would be different from what was found in the original. It’s not very robust as validation for the technique because these are real experts in understanding accidents who are doing this re-analysis, so it’s hard to tell whether what’s coming is coming from the technique or coming from their own insights. 

To do a fair test, you’d need to train a bunch of independent people in different techniques, and get them to apply the technique rather than have the researchers do it. But that’s not what the point of the paper is. The paper’s not really a proof that it’s better. The paper’s an explanation for how it works and the rationale behind it.

David: And there are two papers that are subsequent to this paper which attempted to do just that, which was to train internal health and safety resources in the application, and get them to apply it. Let’s talk about the original proposed technique itself. 

Drew, what I might do is just explain this incident that we’re going to go through. What the paper did was talk about what the root cause analysis performed by the NASA organization concluded, and then talk about (I guess) the application of their technique to the incident and what that found. 

Nothing I’m going to say is not public domain information. This story’s been presented at a number of health and safety conferences internationally. I was involved in bringing one of the deputy program directors out to Australia last year to do some workshops for some industry in Australia around this, so nothing I’m going to say is in any way not public knowledge. 

In 2013 during a space walk, one of the International Space Station astronauts nearly drowned inside their own spacesuit. Water rush into the helmet area, covering the nose and mouth, led to a situation where the astronaut had to abort the spacewalk and (I guess) just made it back inside the space station before they drowned. It was a very, very serious, I don’t know, Drew, if you could call that a near miss. It was not really a miss, but it was a very close call to a very bad outcome.

What had happened in this event is during the extra vehicular activity—they call it EVA and number 22, which was the 22nd spacewalk—during EVA-22, this was the first spacewalk that one of the astronauts had ever been on, at the end of the spacewalk, one of the astronauts, when they were taking the equipment off the astronauts and helping the person get out of all this equipment, noticed that there was a bit of water inside their helmet. This was seen to be a bit more than normal, but maybe it was sweat. Oh, it’s actually a bit more than that. What’s going on? 

They talked about it and debriefed it as a team, and came to the conclusion that it was a leaking drink bladder. If anyone’s done any bush walking or hiking, you have a drink bladder inside your backpack, a tube, and a mouthpiece, and you can just bite down on that mouthpiece and you can drink. That’s no different to what they have inside the spacesuits. 

After a fairly short amount of time, short discussion, they led to the conclusion that it was a leaking drink bladder. They did actually replace it inside the suit. They threw the old one out. They didn’t test the old drink bladder to confirm that it was actually the thing that leaked. 

There was a bit of miscommunication between the astronauts as a result of different nationalities and different languages. They didn’t confirm whether or not the astronaut had drank all of their water or not drank any of the water. There were a lot of things that were just, okay, this is what happened, let’s move on. That was EVA-22 and nothing really happened. 

The very next week, they did the 23rd space walk. Same suit, same astronaut outside the vehicle, and they had this situation where a whole lot of water rushed into the helmet. Now, as that was starting to occur, mission control just assumed it was the same problem again, assuming that the bladder was leaking again. So they encouraged the astronaut to just keep going and let us know if it gets any worse. Then drink the water, make sure there’s nothing left in your bag.

It got to the point where they had to abort the spacewalk. Even at that time, they weren’t even pursuing the idea that the water was coming from somewhere else. They’d still just assume that it was the whole drink bag inside the helmet. 

It turned out it was a faulty piece of equipment that had failed in a way that they would never have predicted for that thing to fail. Without going into the full engineering analysis that was done over multiple years, that’s probably enough background context for the root cause analysis.

Drew: Thanks for that, David. What did NASA decide were the root causes after their analysis of that incident?

David: Would you like to go through the ones in the paper?

Drew: I think if we’re going to compare the new technique to the old technique, we should have some basis for comparison.

David: NASA formed an independent committee following that EVA-23. The committee identified the following root causes. This is from the NASA 2013 report. They really looked at what caused the water and what caused Mission Control not to respond quickly to respond adequately to EVA-22, what caused Mission Control to delay aborting the EVA-23 when the situation was starting to deteriorate. 

There are about five key points here. One was around this emphasis to maximize crew time on audit to utilize work being done. The space station needs a lot of maintenance. There’s a focus and a priority on getting that maintenance done, getting the spacewalks done, getting everything that needs to be done, done.

There was a perception in the International Space Station community that these drink bags leak, but you’ve actually got just a normal, off-the-shelf drink bag. You’ve got it in a zero gravity environment, you’ve got it inside a spacesuit, you’ve got a whole bunch of different equipment pressing and moving around on it. All it needs is a bit of pressure on that valve to just release the water into the suit. It was a fairly commonly-held perception that these drink bags leak, so that’s the cause of any water in the suit and the helmet. 

The flight control team had this perception that any incident report process on orbit is a very resource-intensive exercise, which made them very reluctant to invoke it. And I guess, Drew, this is a bit like a flashback to the space shuttle program and Diane Bourne’s work about just the tiger teams and any problem that gets raised, which might be a flight safety risk, creates a huge process to investigate and understand, and then delay operations while that’s all going on.

The fourth cause was that no one really applied their knowledge of physics, of water behavior in zero gravity, to water coming from what is called a vent loop. I don’t even know the strand of physics that looks at hydro movement of water, but if an expert had a detailed look at that, may have actually realized that it couldn’t have been necessarily the drink bag. 

Then this idea of normalizing the minor amounts of water in the helmet. So even though water shouldn’t be anywhere inside the helmet, because it happens, it happens, and we don’t see that as a risk or an issue. 

Those were the five causes. I don’t know if I really like the way that those are described, Drew, but you can get the picture of the view that the investigation formed.

Drew: It seems a little bit ad hoc, a mix between these very specific things and these very vague things. But I guess the key point that they’re making in this paper is that NASA treated these things like stopping points. You work backwards from the incident and you find various things along the way that are root causes in the sense that if you cut those off, they would prevent the accident happening. 

If you didn’t have this emphasis on maximizing crew time, they would have been much more cautious about EVA-23. If they didn’t have this perception that drink bags leak, they would’ve taken the EVA-22 leak much, much more seriously and interpreted the leak during EVA-23 differently. If they didn’t see the anomaly report process as being resource-intensive, they would’ve invoked that process, would’ve investigated between the spacewalks, would’ve found out what was going on, and would’ve fixed things before EVA-23.

You get the idea that you work backwards to these things, and once you’ve got something which seems hard, concrete, and fixable, that’s your root cause and your stopping point.

David: Let’s talk about some of the main points of the paper then. There’s this root cause where we’ve come up with these five points along the way or either behaviors or human errors or just even somewhat pressures, resource constraints, time constraints, whatever these things are. If any of these five things had been different, then we might’ve avoided the incident. This is the idea of root cause or root causes. We find these five root causes, then we can try to work out how to not have them in the future. 

The main points of the paper is that these focus on root causes basically creates actions that add additional defenses, which can increase the negative pressures for compliance. If one of the things was, oh, okay there was this perception that drink bags leak, well, maybe we need to do a really big debriefing process after every single space walk, and then we need to look at every single little thing. That would add to this pressure of, if there are any problems, it takes a lot of time. So maybe we should just suppress this information and that. That was one of the main points of one of the challenges with root cause, is that it adds extra things which create even more pressure into the system.

Drew: The paper goes through a background discussion of the general challenges and problems with root cause analysis–type approaches. I think the biggest philosophy is this idea that if you just go backwards from the accident, trying to work backwards through the system, it gives this mindset of finding lots of opportunities to add in different defenses. Almost as if you can correct the problems by creating this shield around wherever the problems occurred.

But that doesn’t end up in a good system. That ends up in a system which has got a lot of these bolted-on defenses around where all of your past accidents have happened. And that can actually become a more unworkable system that has more time pressure because there’s more stuff that needs to be done. It just creates new opportunities for things to go wrong by all of these different processes and different responsibilities interacting with each other. 

This is probably a fairly common message for the listeners, but maybe new to some of the audience for this paper. We want to shift the mindset away from thinking about what went wrong, to thinking about how the system generally performs, how it adapts, how people deal with pressure, and understanding why sometimes how they deal with pressure makes things more difficult and brittle.

David: And then Drew, I think one of the paper’s main points is that organizations aren’t safe one day and unsafe the next day, on the day the incident occurs, that these blunt end or organizational pressures on frontline or sharp end work are just normal aspects of work systems. If you look at Rasmussen’s risk dynamic, risk modeling paper, these are just normal the way that work happens. People are always pressured and constrained. 

But the problems arise when these pressures push operations further and further into, paper calls it degraded conditions, but we talk about drift and drift over time where we’re eroding safety margins and we’re getting closer and closer to not being able to identify and respond to a major situation. 

One of the final point here is that the organization needs to support making these sacrifice judgements actually to see where these margins of safety are being eroded, and being able to back off the production pressure in real time, to basically reinstate that particular safety margin to support safety and trade-off, needing to get this space walk done this week, knowing when that call needs to be made.

Drew: That’s a really important criticism that this paper makes of existing approaches to an investigation. We might come back to this, David, because I don’t think their technique actually does a good job of solving this problem, but they do do a really good job of explaining the problem. 

The central point is that saying that there’s work pressure is not actually an explanation for accidents. Because work pressure is normal. Work pressure always exists. So you are saying, oh, the organization was under pressure to cut costs and therefore the accident, like every organization is always under pressure to cut costs. Every organization is always under pressure to save time, to save money, to do things quickly. 

The really hard question is why do we usually adapt fairly well to that pressure and manage to get work done safely anyway? And why does it sometimes fail? And when it sometimes fails, how do we avoid saying, well it was just this one time it failed. How do we go back to the system without going far back into the system that we are just blaming the vague pressures again? How do we get to that useful middle bit of making the adaptive systems work well and work better, and learn from the accident how to better cope with pressure instead of just either blaming the pressure or blaming the people who broke under pressure?

David: It’s a good point Drew. I think the way that this model tries to talk about—and we’ll talk about the model shortly—those organizational pressures, how those pressures create conflicts or trade-offs, and how those conflicts and trade-offs then lead to adaptations of work. Real work gets changed as a result. When do those changes basically erode safety margins and lead to incidents, and trying to track that through the organizational system, is what this method tries to do.

It’s going to be hard, and we’re not going to have this diagram in front of us, but do you want to explain the model to people, and then we’ll do the reinterpretation of the incident?

Drew: Sure. Basically, they’re talking about two distinctions, sharp versus blunt, and distal versus proximate. Two distinctions gives you basically four quadrants you are working in. You can have sharp and distal, sharp and proximate, blunt and distal, blunt and proximate. 

They don’t give precise definitions for any of these things, but basically when they say blunt, they’re talking about management. They’re talking about things that create pressures and priorities. When they’re talking about sharp, they’re talking about specific actions that people take in work. When they talk about distal, they just mean removed in time and distance from the accident. And when they talk about proximal, they mean close in time and distance to the accident. 

If you apply that to this case, anything that happens not on the space station is going to be blunt because it’s providing pressure and priorities for the accidents for the astronauts, but it’s not taking direct actions involved in the work. Anything that happens on the space station is likely to be sharp, but it could still be removed from the accident in time and distance. It could be on a previous spacewalk or between spacewalks. 

Whereas the stuff that happens on that particular spacewalk is going to be proximal. It’s going to be sharp if you are the person on the spacewalk, it’s going to be blunt if you’re the people at a distance trying to provide advice, set priorities and pressures for the people on the spacewalk. 

David, is that a fair description of the four quadrants?

David: Yeah, and I mentioned that the process they claim to be quite resource-efficient; that four quadrant model. Then you just map with bullet points what these precious conflicts and adaptations are. You draw some lines between them, so you’re trying to get that this pressure on productivity contributed to this particular conflict, that contributed to this front or sharp end decision and action—decision being made, action being taken—so you end up with this pattern diagram. 

Actually in the title there, the SCAD title is the systemic contributors and anomalies analysis and diagramming. The whole accident report almost entails a single diagram, the whole accident analysis. That’s really the output of the process. 

Drew: What I think is really novel is not the diagram or even the exact process they go through. It’s the philosophy of working from the blunt end towards the sharp end rather than the other way round. 

Normally, what you try to do is you take the events in the accident and you try to say, okay, what caused those events and what caused those events? What caused those events? That takes you sometimes to things that are distal. It sometimes takes you to things that are the blunt end, but only if you can trace a clear path backwards. 

Whereas this technique says, okay, let’s try to find what are the pressures that exist within the organization normally all of the time. How do those pressures cause work to adapt? How have people adapted? How do they usually adapt to those pressures? And then let’s look at how that affected the events of the accident. 

So you’re not trying to trace down into the roots. You’re trying to find the broad stuff first and then apply it to the accident. I think that shift in thinking is really quite significant, and could apply even if you ignore the rest of the technique.

David: We’ll talk about this model at the end, Drew, as a spoiler for practical takeaways, but I actually think this model might actually be a better normal work analysis tool rather than maybe an incident analysis tool, because of it working from what’s normally my organization today and how’s that shaping work today or how might that shape work today and tomorrow.

Drew: I’d buy that, but I’ve argued for a long time that we should investigate all of our accidents as normal work investigations rather than the accident.

David: Yeah. Very good.

Drew: There’s a particular process for conducting interviews for collecting the data that feeds into this model. David, do you want to talk through how the SCAD interview would go?

David: It wasn’t in this particular paper, but as I started to pull up the papers, the more recent papers of the applied work is, is basically what they call the entire SCAD process, is you need to get people involved in the event and ask them what basically what was different, just to identify the adaptations. 

The way they do that is they ask people what’s the textbook approach here? What’s the way a new person would do this? How did it happen on this occasion? You’re really trying to get a ‘where is the work as done’ kind of shifting in this particular instance. I think there’s an underlying assumption here that an incident is caused by or results from an adaptation to work that basically erodes a safety margin and the event occurs.

What they want to do is actually find what’s different, understand how that compares to the standard work like I just mentioned, and then start to understand why that adaptation occurred based on probing the conflicts and tracing back the sources of pressure. 

It’d be almost like just asking a person why was work happening like this? Is it different to the way it normally happens? Why were you doing it? Because I didn’t have this piece of equipment or I didn’t have enough time. Okay, well why didn’t you have enough time? Because the organization hadn’t planned this project properly. Why hadn’t the organization planned the project properly? Because the organization hadn’t sold enough work this year. We were sending in low bids for everything we were doing. Okay, why? Because we hadn’t read the market appropriately and we hadn’t set appropriate targets for market conditions.

You can see in that how you’ve probed 3, 4, 5 times back into the blunt end that a board setting a target that was not considerate of market conditions has over a 9- or 12-month period led to someone performing a task in an unsafe way.

Drew: No, no, David. Just to check my own understanding because the way you described it then almost sounded like people do RCA normally with this, assuming that the incident is something that is deviant and then tracing back to court what caused that deviance. 

If I understand correctly, when you are collecting this data for the SCAD approach, you are not necessarily focusing on just on what happened differently in this incident. You’re collecting generally examples of when this particular task tends to vary. You might even collect a task of a different space walk on which work didn’t go exactly according to the schedule. 

You’re looking for generally how does this task adapt and what are the pressures and forces that often cause it to adapt in particular ways, which of those adaptations happen once, which of those adaptations are permanent drift or changes to the task that have happened over time. You’re trying to get this link between broad forces and generally how the task adapts, rather than just seeking out what went wrong on this particular occasion that caused it to be. Otherwise, adaptation just becomes another word for deviance.

David: You’re right. My discussion wasn’t broad enough. They do talk about the triggering event, the situation because it is an accident analysis technique, so they are looking at the specific event. But then the exploration moves quickly to how this type of work is normally done. 

Like you said, it actually turns into this more normal work exploration. What’s the textbook approach to this? How do other people do it? What’s my normal way of doing this? Was anything I did today different from my normal way or different from other people’s normal way? It is very much trying to put the event in context of how the system normally functions. Is that what you were saying Drew? 

Drew: Yeah, and I guess I was pointing towards, often the trap with these methods is once you turn them into a method, you can end up just getting right back into how you were previously doing investigations, just using slightly different language and diagrams because sometimes it’s really quite subtle. What’s the distinction or underlying philosophy that’s trying to guide you down a different path as opposed to the specific diagram or method that could just take you down the same path you’ve always gone.

David: So that’s the approach. Let’s talk about the reinterpretation of this EVA incident. Do you want to go through what the paper has suggested here or would you like me to do that?

Drew: I’m happy to go through it. I do just want to acknowledge that when they do this reinterpretation, they’re stuck with the facts that the original investigation found. They’re always going to be vulnerable to the argument that they’re just giving the same facts in a different order. What they say is there was a breakdown in learning between—

David: Can I just add? One of the things that isn’t clear in this paper that could be some confounder is that by the time that these researchers had a lot of close access to NASA, and NASA learned a lot about this event from when the 2013 root cause analysis was first published through to several years later when they tracked back and bottomed out some of the engineering requirements. These researchers were close to NASA throughout that whole period. 

In terms of your comment there about how it would normally work, in this case the researchers, the lab may have known a lot more about this event from NASA than what the original RCA investigation actually knew.

Drew: Well, I was trying to be charitable to the researchers here, but I think you’ve just taken away their best excuse for the fact that their reinterpretation adds very, very little to the original interpretation.

David: They may have. I’m not sure, but I know that NASA did learn a lot more and the lab was doing some work with NASA through that period.

Drew: Anyway, here’s what they say. They say there was a breakdown in learning between the two space walks. The practitioners—the people in the space station—adapted to the production pressure in the system by stopping activities non-essential to the spacewalks, such as updating their failure modes, effects analysis, and their critical items list. 

Because they didn’t go through the anomaly report process after the first spacewalk (EVA-22), they missed out on an opportunity to discover a failure mode that had never before occurred, but was going to occur again on the very next space walk. The second space walk was scheduled only a week after the first space walk. 

Doing this really resource-intensive paperwork failure mode fix analysis would just for what they thought was a minor issue when they were needing to spend their time instead getting ready for the next space walk probably wouldn’t have been allowed anyway.

We have a learning breakdown which reinforced the original interpretation, which was just that the drink bag was leaking. That was one of the main reasons they continued during the spacewalk that went wrong to look and think that this, the drink bag was just the explanation. 

Now David, I don’t think that’s any better than the original analysis. I think it says exactly the same thing, and it’s just a hindsight claim that if they’d done more analysis work between the spacewalks, then in hindsight maybe they would’ve found and fixed the issue or at least understood the issue a little bit better. 

That’s what the original investigation said as well. It said that they hadn’t done this thing between the spacewalks because it was a really resource-intensive activity, and they didn’t have time to do it. I’m not really sure what the contribution of the technique is here.

And we still don’t have any real evidence or insight into what we do about this. We know that there’s always going to be this time pressure. We know that the safety techniques are onerous and that there’s a perception that they’re not going to be able to produce anything useful. We don’t have any good evidence even as to whether it would’ve been a good or bad idea to do the analysis. It’s only hindsight that says that it’s a missed opportunity. We haven’t done any tests of the analysis technique to see whether it’s actually good at finding this thing. 

I’m not certain that the re-analysis has learned anything except to draw this explicit link between pressure and not doing the analysis, which was implicit in the original investigation anyway.

David: Maybe they’ve talked about the proximal contributors in terms of the organizational pressures, which is what the model is trying to do. Some of the things in the original investigation, like the normalization of drink bag leakage and things that aren't as prominent. It was more prominent in the root cause it’s less prominent here. It’s more like I see it’s a changed emphasis on the original report, probably placed more emphasis on some of the immediate actions of the people, and this is trying to place more emphasis back on the blunt end of the organizations. But in terms of new and different insights, there’s a huge amount of new and different insights.

Drew: So it shifted the emphasis on who it’s judgy about. The first version is implicitly a little bit more judgy about the sharpened people, and the second version is a little bit more judgy about the management people. But it’s really the same explanation, just with different emphasis.

I think the real helpful question is when you have time pressure, when you have short durations between events, what’s the best process for making sure we’ve learned from the first event before we have the second event? Neither investigation has really managed to answer that. Although I think NASA has done a lot of work in the meantime to be better at answering that.

David: Exactly. I think somewhere in here, just in terms of one example of just about assumptions—I know we’ve talked in risk assessment about assumptions, and then here we’ve got a lot of assumptions about a drink, a particular drink bag—NASA talks a little bit about this big, long administrative process. 

It actually wouldn’t take much to get one of the astronauts to fill that bladder up with water, squeeze it, and see if it actually did leak, and actually to talk to the astronaut and ask a few questions about how much did you drink during the spacewalk, was was the volume of water in the bag the same as the volume of water in the helmet. I know NASA very much like if we would’ve asked about three questions, we would’ve found out in 15 minutes that that wasn’t the issue.

Drew: The whole thing is they wouldn’t have asked those questions unless they’d started an anomaly process in the first place. And the anomaly process doesn’t stop at three questions. Anomaly process stops a month later, and the spacewalk was in two weeks’ time.

David: And they’ve significantly changed their spacewalk debriefing process. At the time it was a very small affair, which was an exception-only discussion with a very limited group on the ground. If only with the flight director, they’re my astronauts no one else talked to them. And now they have a very broad detailed exploratory debriefing process around these activities to try to find any of these subtle differences that have occurred. 

It’s not to say that this technique was designed to pick all of that up, but one of the things that is absent from this technique, Drew, and they call it an accident analysis method, is there is no commentary in the paper at all about how to design improvements and recommendations.

Drew: No, and I don’t want us to get too judgy ourselves on one paper out of a series when we know that the same PhD student wrote follow-ups and wrote a whole thesis, which answers a lot of the things that we’re raising as shortcomings in this particular paper, which is you one out of several.

David: And I mentioned some of those have been followed on. I don’t know if anyone picked up when we said this journal is from the Proceedings of a Human Factors and Ergonomic Society annual meeting. This is a conference paper, and this may have been the very first paper that, about the timing of 2016, this is likely to be the very first presentation and publication in the wild of this particular technique. We’re talking about eight years ago, and I know the lab spent the last eight years working on this technique.

Drew: What I’d really like to get onto is some of the takeaways and contributions which at this stage of the research weren’t well-evidenced, but I think are still useful things for us all to think about when we are thinking about how to do investigations regardless of what specific technique we’re using.

David: Great. Drew, takeaway.

Drew: The first takeaway I had was, this is always our challenge is how do you go beyond the immediate events to find broader systems and broader learnings. We all want to do that from our investigations. We all get frustrated when we don’t manage to do it effectively. 

The particular suggestion that I really liked from this work is not to try to do it just as this back tracing process trying to go, this happened. Okay, why did it happen? Why did it happen? Why did it happen? Why did it happen? That’s always going to be limited, is going to risk hitting these dead ends that end up just blaming other very proximal things.

I like the idea of just jumping straight to the pressures and asking people right up what are the pressures guiding this work? How are you adapting in response to those pressures? How do you feel about those adaptations? Do you feel forced to make them? Do you think they’re good at adaptations? How did they affect the incident playing out? 

If we’d adapted differently, would we have not had this incident? Okay, so how could we have adapted differently? I like that just jumping straight to the solution rather than trying to trace a careful network of this-caused-that, which I think often ends before we get to things that are useful.

David: I agree. I like that. 

Drew: The second one, which is fairly related, is just changing that language away from causes to talk about pressures and contributors. You don’t have to have a provable link where but for this happening, the accident wouldn’t have happened, or if we could have prevented this thing, we would’ve prevented the accident. 

Lots of things we can improve still wouldn’t have prevented the accident, but they’re still worth improving. The more we insist on that direct causal link, I think the more we miss out on opportunities for learnings in accidents.

David: And I do like that language shift to talk about pressures and contributors. I think the number of times I’ve heard practitioners say and talk to investigators say on this day this was different and the incident occurred. I think there’s a generally held view that accidents happen when work changes in a way that maybe is a little bit different to how it might normally be done. This language and this approach is really just trying to find those adaptations and the pressures that are leading those things to be made.

Drew: The third one I’ve got, which is not directly about this paper so much as the paper that this is part of our whole category of, is you read enough papers about new accident methods, and you realize that the really smart people who write these new methods still tend to reproduce their own hindsight and their own prejudices about what causes accidents when they apply their new methods. 

That’s not them doing something wrong, It’s just that it is a really strong and inevitable human technique that when we think we’re explaining something, we’re actually reproducing just our own thoughts about what causes accidents. 

So don’t beat yourself up when your own investigations don’t work as well as you’d like. There isn’t a magic method. Smart people have the same cognitive traps. It really is a case of over time building up ideas and principles that tend to make investigations more useful rather than a particular method being particularly bad or a particular method being particularly good.

David: The last one that I just wanted to add was around the comment that I made further up in the podcast about accident analysis techniques. Any accident analysis always is we struggle with something bad happened, something went wrong, we need to make it go good or go well in the future. 

It’s not the hindsight prejudices that you mentioned. But increasingly, we talk about focusing on normal work, and I think some of these techniques are really potentially good for normal work studies. Actually going out and actually just trying to look at like you said, what are the pressures today, this week? How is it shaping work? What’s being done differently as a result? How do we feel about those adaptations in the context of safety margins and safety risks?

I do like the simplicity of this type of a model, a four quadrant model. A few questions, three or four questions to ask people, and you might get a lot of understanding about your organization as a result. I know that this team were doing that with air traffic control. 

On the podcast I can see it, but I’ve got some single side, A4 bits of paper that they were leaving beside the air traffic control console. And any little adaptation they made, they did. They trained the air traffic controllers to just jot down on that four quadrant model, oh, I did this differently today and here’s why, why, why. It actually was almost like an adaptation reporting processing in real time. I think it’s looking at some of these techniques not just for their usefulness in accident analysis, but their potential usefulness in understanding work.

Drew: This isn’t a takeaway, David, but this is what I love about doing work involving NASA, is it really drives home when the rocket scientists are still vulnerable to the same traps with safety that the rest of us are. It’s an indication that none of us have the resources to throw at investigations that NASA has. 

If their time would be better spent looking at normal work rather than accident investigation, maybe we shouldn’t be trying to devote so much effort to getting investigations perfect, and just recognizing that there are fundamental limitations when we are trying to do hindsight looking at what went wrong, and just shift the effort away from that activity that is so hard to get right.

David: So, Drew, to put you on the spot, the question we asked this week was, should we ask about contributors rather than causes?

Drew: It probably helps but still doesn’t fix the problem that we are facing with trying to get useful system changes out of investigations.

David: Excellent. And if you’re a safety scientist looking to create your own accident analysis technique, just hope you don’t get Drew as a peer reviewer for your paper. 

So that’s it for this week. We hope you found this episode thought-provoking and ultimately useful in shaping the safety of work in your own organization. Send any comments, questions, or ideas for future episodes to feedback@safetyofwork.com.