The Safety of Work

Ep.87 What exactly is Systems Thinking?

Episode Summary

In today’s episode, we discuss another in our series of foundational papers: “Applying Systems Thinking to Analyze and Learn from Events” published in a 2011 volume of Safety Science by Nancy Leveson.  Leveson is a renowned Professor of Aeronautics and Astronautics and also a Professor of Engineering Systems at MIT. She is an elected member of the National Academy of Engineering (NAE). Professor Leveson conducts research on the topics of system safety, software safety, software and system engineering, and human-computer interaction.

Episode Notes

We will review each section of Leveson’s paper and discuss how she sets each section up by stating a general assumption and then proceeds to break that assumption down.We will discuss her analysis of:

  1. Safety vs. Reliability
  2. Retrospective vs. Prospective Analysis
  3. Three Levels of Accident Causes:
  4. Proximal event chain
  5. Conditions that allowed the event
  6. Systemic factors that contributed to both the conditions and the event

 

Discussion Points:

 

Quotes:

“Leveson says, ‘If we can get it right some of the time, why can’t we get it right all of the time?’” - Dr. David Provan

“Leveson says, ‘the more complex your system gets, that sort of local autonomy becomes dangerous because the accidents don’t happen at that local level.’” - Dr. Drew Rae

“In linear systems, if you try to model things as chains of events, you just end up in circles.’” - Dr. Drew Rae

“‘Never buy the first model of a new series [of new cars], wait for the subsequent models where the engineers had a chance to iron out all the bugs of that first model!” - Dr. David Provan

“Leveson says the reason systemic factors don’t show up in accident reports is just because its so hard to draw a causal link.’” - Dr. Drew Rae

“A lot of what Leveson is doing is drawing on a deep well of cybernetics theory.” - Dr. Drew Rae

 

Resources:

Applying Systems Thinking Paper by Leveson

Nancy Leveson– Full List of Publications

Nancy Leveson of MIT

The Safety of Work Podcast

The Safety of Work on LinkedIn

Feedback@safetyofwork.com

Episode Transcription

David: You're listening to The Safety of Work podcast episode 87. Today we're asking the question, what exactly is systems thinking? Let's get started.

Hi, everybody. My name is David Provan. I'm here with Drew Rae, and we're from the Safety Science Innovation Lab at Griffith University. Welcome to The Safety of Work podcast. In each episode, we ask an important question in relation to the safety of work or the work of safety, and we examine the evidence surrounding it.

What's appearing to be a current series now, Drew, we're discussing some of the foundational papers from some of the more popular maybe or influential authors in safety. In episode 74, we discussed a paper by Daniel Katz. I don't think he would have claimed himself as a safety author, but some very relevant ideas for us in safety today. In episode 85, we talked about Amalberti's paradoxes of almost totally safe transportation systems.

In Episode 86, we introduced Jens Rasmussen, who we framed as the intellectual grandfather of a lot of the recent safety theories. Now in this episode, Drew pulled out some work by Professor Nancy Leveson. We're going to make quite a few comparisons between Leveson and Rasmussen from the last episode. So if you haven't listened to that, we'll wait. Go back to episode 86 and take a listen. But if you're still with us or if you've come back and rejoined us, let's jump straight into the paper.

Drew: Yeah. Sure, David. I don't know about you. I went down quite a few rabbit holes preparing this episode, but I forgot to check Leveson's Rasmussen number. Do you know if Leveson ever actually directly co-published with Rasmussen?

David: No, I do not. But she is no more than one removed.

Drew: Yeah, that's pretty much what I was thinking. Often, when people draw a diagram of people who were influenced by Rasmussen, there's this generation that includes Woods and Hollnagel, and Leveson gets included in that set. Let's go straight into the paper and then we might talk a bit about Leveson as we go.

The paper is called Applying systems thinking to analyze and learn from events. It was published in the journal Safety Science in 2011. One thing that Leveson is really good at is she makes pretty much all of her work open access. So you can just directly search for the title of the paper and find a copy of it on her website. It's not available open access through Safety Science, but you can just find a PDF really easily.

A little bit about Leveson. Leveson is particularly well-known in the safety critical systems area of Safety Science. So if you do work that involves high technology, in particular software or aerospace, you will most certainly encounter Leveson's stuff before. If you're on more of the occupational health and safety or the well-being side of things, less likely to have directly run into Leveson's work.

Leveson is one of those people who's just a really good writer. I love picking up any of her stuff. She's got a great paper, for example. It just talks about the evolution of steam engines and the different attitudes of UK regulation and US regulation. You think that that'd be just a dry topic, but she writes really clearly and well. It takes analogies, metaphors, and uses examples.

Some authors, you say, don’t go and read the original, but Leveson is always easy and fun to read. She's particularly well-known for two books. The first one is a really early one called SafeWare, which is one of the first safety critical systems books. SafeWare was published in 1995, which was back when I was still in university.

It's actually the first book about safety that I ever read. It's one of the things that actually got me into safety engineering as a career. David, did you ever run into Leveson's work before you started your Ph.D.?

David: Yeah, I was familiar with the second book which you haven't mentioned, so I'll mention it, which is Engineering a Safer World. I had a number of process safety and system safety engineers bring that to my attention, the copy of that book. I probably read Leveson's work before I read any of Sidney Dekker's or Erik Hollnagel's work because maybe there wasn't as much around. It's pre-2008, 2010.

Drew: We tried to be fairly neutral on this podcast about fights between safety authors. It is probably worth mentioning though that Leveson's a fairly prickly character. Even though she's co-authored papers with lots of people you might have heard about, she's done multiple papers with Sid Dekker. She co-edited one of the foundational volumes of resilience engineering with Hollnagel and David woods.

She really likes to call out other people in her work. She'll put whole sections in her papers or even has written entire papers just criticizing other people's approach to safety. I probably just should be fair and acknowledge that I once worked for Leveson. I was once fired by Leveson, so we had our own prickly encounter. 

Basically, my own personal position is I really respect her own work. But I think some of the criticisms she has of other people are often mischaracterizations or a bit unfair. So don't use Leveson as a source of what other people mean. Read the other people first then read Leveson's criticism of them.

David: Yeah. I don't know Nancy, but I suppose a number of our listeners would have seen it even recently this year. In fact, she published a paper on Safety-III. So people asked us to do a review of that paper, and we didn't want to do that because we didn't feel it was contributing probably new ideas as much as it was trying to write down other people's contributions.

So it didn't really fit with us being neutral, but her work stands alone and even rereading this paper that we're going to talk about today, I really enjoyed just reading through. This is a good one to read. This is a good one for people to pick up and read. It covers a lot of ground.

Drew: Just to characterize overall where Leveson comes from before we get into the rest of the paper, Leveson is similar to the other Rasmussen descendants. She does see problems with traditional approaches to safety, and wants to critique them and provide alternatives. Where she differs is that she seems to mainly characterize Safety-I—the command and control approach to safety—is that it's done badly and in the wrong places.

She doesn't have the same fundamental disagreement that people like Hollnagel have. Hollnagel will often drift into language that suggests throwing out Safety-I. He doesn't always mean it. He often backs off from those sorts of things, but he seems to be taking that very attacking position. Whereas Leveson is much more saying, we can do this right, we can engineer our way towards a safer world, which a lot of the more social scientists don't really believe.

David: I think we have that clash of domains and worldviews between the engineering sciences and the human sciences. Neither is right or wrong and it's genuinely an end. I think it's when one camp thinks that they have all the answers where I think it gets particularly problematic. You seem to do quite well, Drew, coming from an engineering background and jumping across into social sciences. We'd almost sometimes forget that you did have an engineering background.

Drew: The interesting thing is that Leveson comes from a psychology background and jumps across to engineering. We had one particularly harsh conversation when she accused me of not thinking like an engineer. In hindsight, that was probably actually fair. I'm an engineer who's gone to the dark side. She's a psychologist. She's gone across to the engineering side. The two mindsets really do sometimes talk across purposes about safety.

David: Okay, Drew. Let's get stuck into this paper. There's a bit of ground to cover. Do you want to sort of frame up the introduction and then we'll bounce through each of the key sections?

Drew: So if you've done as we asked and you've looked at the most recent couple of episodes, you're going to detect a pattern that almost all of these papers start off in the same way, which is trying to characterize, what is the big problem with safety today that we're trying to address? Often, they say a very similar thing.

Leveson starts the paper by saying that major accidents keep happening. It's really frustrating that they seem to be preventable because they've got these same causes, and it seems like we're failing to learn. So why is this? She suggests that there are three options.

The first one is that in analyzing the accidents, we're not truly discovering the underlying causes. The second possibility is just that learning from experience doesn't work like it's supposed to. This idea that we learn from experience is not all it’s cracked up to be, or we are learning but we're learning in the wrong places.

We're going to come back later because she is going to actually have some good reasons why all of these three things might in fact be happening. More generally, how come it seemed like a lot of this safety stuff was working well previously that we were in some industries, just incrementally getting safer, not just around occupational health and safety, but around the rate of major accidents.

Aircraft were genuinely crashing less. Railways were genuinely crashing less. How come that it's slowed down and is not continuing to improve? The real answer that she wants to give is that systems are becoming more complex. The way in which we try to think about that complexity is not helping us. We always need to abstract and simplify. But if we do that in the right way, it helps us. If we do it in the wrong way, it hides the real things that we need to fix.

David: If our listeners listen to last week's episode about Rasmussen and even the one before about Amalberti, they did ask these similar questions, as you mentioned. I don't mind these papers where authors start with here's a problem and here are some hypotheses for what might be causing that problem. I'm not going to go and research it. I'll just lay out my thoughts about what's happening here.

Amalberti saw this presence of accidents, this inability for us to maybe eliminate or even reduce further accidents and errors. It's kind of inevitable. I like Amalberti's thinking in some ways to normal accident theory. The systems are so complex.

We'll do a lot to make them safe, but at some point, they'll fail in certain ways. That was Amalberti's (I suppose) presence. But Leveson really sees any remaining accidents that we haven't prevented as evidence that our existing approaches aren't perfect. If we can get it right some of the time, why can't we get it right and engineer it right all the time? Is that a fair way of thinking of the way that these papers are being framed?

Drew: I think that's very fair. I think you can see the subtle difference in engineering versus social science thinking. Someone like Amalberti and I think Rasmussen, but not to the same extent because Rasmussen is a little bit engineery. They're describing an almost inevitable world. That this is just the way things are and my job is to explain it to you.

Why do we keep having accidents? This is why we keep having accidents. Whereas the more engineering you get, the more they're taking personal offense to the fact that we're still having accidents. It's not why we are having accidents, let me explain it to you. It's why are we still having accidents? What are you going to do about it?

I should just explain because we haven't done a list of surveys, so I don't know how many people who listen to our podcast are engineers. But engineering generally has quite a long education that instilled some very particular ways of thinking about the world. It's not the same as tinkering.

Engineering isn't about just designing stuff from scratch. Often your engineering lecturers will tell you, this is the difference between doing engineering properly and just messing around. Engineers, before they build something, they predict how it's going to work. They apply science to model the system and to know what the end result is going to be.

It's never perfect, so next time, you create a better model and you have a better understanding of how it's going to be. If you're trying to make a faster plane, a tinkerer just keeps modifying the plane until it gets faster and faster.

Whereas the engineer is supposed to know even before they build the plane, how fast it's going to be. If it's too fast, the engineer gets upset. My model was wrong. I've got to have a better understanding of how planes work so that next time I can predict how fast it's going to be.

From that mindset, if you have an accident, that is proof that your model of how accidents are caused isn't perfect. You don't just have to prevent the accident, you got to go back and fix up your accident model. If you can really understand how something works, then you can control it.

David: I think some people get around this engineering approach by arguing that we all follow this engineering paper arguing that we do know how to prevent incidents. We just don't apply this knowledge properly. I think what Leveson was saying in those earlier things is that we can't get this right. But she wants to understand that it's not just about making that argument about how do we apply this properly.

She really wants to ask why the knowledge doesn't get applied properly. So she goes through the rest of his paper with a few things and is trying to point out in her mind why this knowledge isn't being applied properly because the way that we're thinking about this isn't allowing us to apply it properly.

Drew: Yeah, and part of that is just extending your model of the system to say, okay, the misapplication of knowledge, that's part of what we need to model. If there's an engineering technique that would stop an accident and we're not following that technique, then we've got to model how we're choosing our techniques and understand that bit as well.

The paper is very neatly in a set of sections, each section has its own theme. Each theme is basically around an assumption. Leveson spells out what the assumption is and then deconstructs it.

David: I actually like this format. For listeners who pick it up and hopefully you do pick it up, it's not actually that long of paper, but she starts each of these four sections that we're going to talk through with a core—what she believes is a general assumption—and then sort of outlines an argument and breaks it down. I actually didn't mind that format of a paper.

The first section is safety versus reliability. She starts by saying there might be this general assumption that I'll quote, "Safety is increased by increasing the reliability of individual system components. If individual components do not fail, then accidents won't occur."

Drew: To unpack that assumption, she gives some definitions. In particular, there's this definition of reliability and definition of safety. So reliability is when something performs its intended function. That's the intended function with respect to its mission. If we're designing an aircraft and its intended function is to get the passengers from A to B. Whereas safety is the absence of accidents, which is not the same as the intended function.

Safety is like a side effect of the intended function. You could have an aircraft that got the passengers from A to B and they all were dead on arrival, she would see those as two different things. Now for really, really simple systems, the two are closely related.

If the brakes on your car are unreliable, that's also unsafe. As you get more complex, then you can have systems that are made up of reliable components, but getting less safe, even as those components get more reliable. Or as you make components more safe, then the system gets more unreliable.

She gives quite a long example of an accident. I thought it might be a bit more useful to give a couple of simpler examples than to go through the whole accident. A simple example of the difference in reliability and safety is let's say the post office is delivering parcels. Reliability is getting those parcels to you on time. Safety is about the posties getting hurt.

You could make a rule that increases safety by saying they're not allowed to drive in the rain. But that will make the system less reliable because your parcels are getting there late. Or you could make a rule that posties have to drive really, really fast, which will increase the reliability. The parcels would more often get delivered on time, but would decrease the safety. There'd be more chances of an accident. Sometimes the two compete, sometimes they mean the same thing.

David: Yeah, I like the core argument here. It's worth us even thinking beyond the four assumptions that we're going to talk about in these papers. It’s a good reflection exercise for any listener to do is, what are the assumptions that are held inside our organization? I've spent a lot of time in organizations trying to challenge just this saying that good safety is good business because I think this idea of safety reliability being different things is good safety and good business is a very oversimplified assumption that doesn't hold true once you get into operational environments.

Drew: Yeah, there are times when it would be true, but there are also times when it is very, very not true. I think for the sake of clarity, in this section, Leveson starts attacking high reliability organizations theory. But the reason she is attacking it is because of their use of the word reliable, mainly, which I think was HRO as being social scientist misusing an engineering term.

She takes them literally and says, okay, they're using this term reliable, they mean reliable. This is what's wrong with their theory. When in fact, I think a much fairer criticism would be HRO theorists aren’t engineers, they're misusing engineering terminology. They should have called it something else. David, are you okay if we just skip that whole attack on HROs?

David: Yeah, absolutely. Her point is probably just saying it’s an HSO theory like a high safety organization rather than reliability. Choosing your language carefully is the lesson there and I think we can move on.

Drew: Okay. But there is a more fundamental disagreement that I think is fairer and more interesting, which is HRO theory says that we build up safety by deferring expertise to the frontline and we make things locally safe and locally resilient. Leveson says that the more complex your system gets, that local autonomy becomes dangerous because the accidents don't happen at that local level.

If someone is laying bricks, sure, they're an expert in laying bricks. They can understand that the way they lay that brick might strain them, might get hurt. They're probably the experts in how to do that safely. But if the bricks that they are laying are part of the shell around a nuclear reactor, then they can't see any of that.

They don't understand that getting the precise angle on those bricks is important because for them getting their job done, that's irrelevant. You got to have someone who knows the whole overall shape to know that the angle matters. There are lots of situations where that very Safety-II idea that work is the expert in their own work is not in fact true. They are experts locally, but the hazards are invisible to them. They're global.

David: Yeah, Drew. I think that's an important point. Just to reinforce another example, I've spoken a lot to some of the other authors around this idea of fairness to expertise means autonomy in the frontline, and the workers are best placed to decide how to safely perform their work.

In simple craft related systems, I think the original examples, if someone's been a hairdresser for 45 years, if someone's being a shoe shiner or something, in my last organizational role, I always thought, I don't want an operator in major hazard facility to be walking around and deciding for themselves which valves they turn, what sequence they turn those valves in, where they push product through the system and that. Because they don't have the perfect information on the functioning of the whole system. They might not understand all of the technology.

I think what Nancy is saying here is that people can think they're being safe in relation to their own work, their own experience, and their own knowledge of the system, but they can be dangerously changing the dynamics of the way the system works to be overall, much less safe.

Drew: I think this is one of the reasons why safety people talk past each other sometimes. If we're talking about totally different systems, then, of course, we're going to have different assumptions. What tends to happen is people try to adopt techniques that have worked apparently well in fields like aerospace. They're shifting those techniques into other organizations like construction, and we're changing the fairness of that assumption that the real hazard happens at the system level.

Yes, there are some things in construction like whether the building is going to fall down, where absolutely you don't want the local worker changing the arrangement of the bolts. They may be what seems like a simple meaningless change to them may in fact change the load on the entire structure. But most of the time in construction, the safety has to do with the immediate safety of the worker and the people around them. The very designed, very top-down approaches that work in aerospace don't work where the worker genuinely is the expert in their own safety.

David: What was Leveson's answer then to this relationship between reliability and safety?

Drew: Okay. This is a little bit of a deep dive, David. I apologize for this, but I don't think I could explain Nancy's stuff without actually unpacking this whole idea about, what do we mean when we talk about systems thinking?

David: Perfect, Drew, hence why I just read the first line of that and went, Drew can take care of this much, much better than me. Hence, the question. What does Leveson have to say about reliability and safety? What does Drew have to say about what Leveson has to say?

Drew: I don't actually really have any criticism of this. It's a bit of a historical thing. When Levison talks about systems, she's particularly talking about a thing called systems theory, which is itself kind of vaguely defined because it was originally German, and that translation to both systems and theory is a little bit imprecise. So the whole field of systems theory has very gray boundaries.

Leveson is working with a subfield known as cybernetics. She chooses the word systems instead of cybernetics, which is not unfair, but I think cybernetics is more precise. The overarching idea of any systems theory is that there are principles about how systems work that are common to all systems, regardless of what the system is. So basically, humans work the same as airplanes, just with different technology.

There are some things about both of them that can be made up of how you get a bunch of individual components all doing their own thing that work together to create these emergent behaviors. With humans, it's like little blood vessels, hormones, and cell walls. With aircraft, it's electronic systems, mechanical systems, laminar flow of air. But ultimately, systems theory says, how do you take all those little components and get things flight or human movement out of them?

Cybernetics is one particular approach, which says that we look at everything using circular causality—feedback loops. A feedback loop is just where one component sends commands to another component, which produces outputs that go back to the first component. The really simple example that everyone uses is a thermostat. The thermostat, if it's too cold, it tells the heater to turn on. When it gets hot enough, the thermostat tells the heater to turn off again.

You can imagine this little loop with arrows going from the thermostat to the heater back to the thermostat. You can make that as complex as you like because you can put another component monitoring the thermostat, and another component monitoring that component. You can have another component, which is measuring patterns in the whole thing and trying to predict when the thermostat is going to turn on or off, and preemptively manage your heating so that the heating comes on and off before the temperature drops too low.

Eventually, you build up a massively complicated computer that's as sophisticated as a human. That was the dream of cybernetics is basically that we could build humans out of the massively complex arrangements of feedback loops. It never got that far. There are some mathematical reasons actually why it would never have worked. So we had the field of AI went off in a bunch of different directions.

That's why cybernetics isn't talked about a lot today but appears in all the Sci-Fi from the 1950s and 1960s. You always hear people talk about cybernetics because people thought that was the future of technology was mimicking any system made up of transistors, creating feedback loops. But this is where that whole criticism that you hear people in safety, particularly in new view safety we talk about all the time complaining about linear views of safety.

Ultimately, that comes from this idea of systems theory. A linear system is usually just one that doesn't have feedback. The tricky thing about feedback is you imagine trying to work out the chain of events. If we don't have linear feedback, then A causes B causes C. But the moment you get feedback, C causes A. So trying to unpack it going backward doesn't work because you just end up in an infinite loop.

That's the whole complaint that Leveson has. You try to model things as chains of events. You just end up in circles because real systems always have feedback. So just as a very classic Leveson example, he's got a diagram in the paper that I recommend anyone looks at because just as this diagram explains the way Leveson sees the world, the lowest level of the diagram is a physical process. Maybe it's a pump, maybe it's an engine, maybe it's a circuit. It's being controlled by actuators and being monitored by sensors.

That's your basic little feedback loop. But then you keep going through the diagram and eventually, you get to congress making laws and holding hearings. That's a feedback loop. They're all connected by just different layers of feedback. So congress is connected to the regulatory agencies, which is connected to high management in companies, which is connected to middle management companies, which is connected to engineering processes, which goes sideways to the manufacturing processes.

All of which comes down to is the engine getting controlled properly by the actuators and the feedback. To understand the system, you got to understand that whole picture. You can't just say the engine broke.

David: In this section on safety and reliability, I guess what Leveson is saying is that reliability is typically a lot about component individual component reliability. It's sort of a RAMS assessment, kind of. Here she's saying, actually, safety as a property of that system isn't the sum of the reliability of the individual components. Basically, that's the argument behind saying this is not an assumption that we should hold if we're trying to improve safety.

Drew: Yeah. She says, in fact, that we should really just think about them as two separate things. All of our systems are designed both for reliability, which is getting the mission done and for constraint, which is getting the mission done in an acceptable way. The biggest constraint is usually about safety, but you can also include things like environmental laws fit in there as well.

There are only some acceptable ways of doing things. Our systems have to manage both getting the stuff done and keeping it constrained. We've got to understand how all our management processes, our laws, all our cultural influences work to do control and feedback to keep those two things in check.

David: Can we move on to the second section now? You're up for that?

Drew: Yup, let's go for it.

David: Awesome. The second section, Leveson titles, Retrospective vs. prospective analysis. Really, do we analyze what happened or do we try to analyze and predict what might happen? Leveson states this assumption at the start. There's a general assumption that retrospective analysis of adverse events is required and perhaps the best way to improve safety.

That, I think, might be a generally held assumption in safety. Let's look at incidents that occur to see what we need to do to improve. In fact, that's probably one of the most central processes we can have.

Drew: Yeah. Say, if there was one thing that almost everyone in safety agreed on, it would be that one.

David: It might be. You mean agreed that that is true or agreed that that is a general assumption?

Drew: I would agree that it's true. Even the people who say it's not true still write entire papers where they're analyzing past events. We all try to use bad things as ways of explaining how to do the good things.

David: Even practitioners who don't think the incident investigation is particularly useful still do lots of incident investigations in their role. Got it. Okay. Let's consider that a central logic for safety. How does Nancy break this apart? You did a history of Airbus in this section as well.

Drew: David, I don't know if this is just the way my mind works, but I tried to give an example, and then I worry about whether the example is correct, so I go and have to look up the history of it.

David: Let's frame a few of these points and I'll throw to you for examples.

Drew: Sure, but you did just to pull back the curtain. We've got like two sentences in our script here that involve an hour of deep dive into trying to understand accident reports about the Mars Polar Lander.

David: All right. The core theme of Leveson's argument here is that looking at what has gone wrong in the past is never enough. Accident analysis has been pretty successful in fields where the basic technology changes very slowly. She's called out here things like aeroplanes, trains, and nuclear power stations. Feedback loops make the designs better.

If an accident occurs, I can feed forward the lessons from that into the design of model B, model C, model D of that particular technology. She also says it's a terrible feedback loop. Point that I just made, if the accidents and investigations are happening much more slowly than the technology is changing, then there's kind of no point. If I learn about something that's going wrong with the model and I've already released it into production for the next model, then I’ve missed that opportunity.

Drew: Yeah. A couple of examples here are just how slowly planes change. I don't know who's been on a plane recently. David, I know you've had a couple of flights.

David: I got one, and I'm, fingers crossed, about the time that this episode comes out, I may even be overseas. It's a wild three weeks to happen between now when we're recording this and then.

Drew: If you go on a plane, there's a good chance that it's going to be a Boeing 747, an Airbus A330, or one of the direct variants of those two planes. Those first flew back in the early 1990s. That's just how long we've had these fundamental designs. Before the A330, the previous one was the A300, which was 1972. So it's going back another 20 years.

Even though there aren't that many aircraft accidents, that's a lot of opportunities to incrementally improve the design of those aircraft to make them super, super safe. Anything that was wrong with the design, we can fix and still keep flying the plane for 20 years later benefiting from that fix. It's worth doing a two-year-long investigation.

David: Yeah. I don't know if it's still true now. But I remember saying when I was in my first lot of cars, and you know how cars every two or three years, they'd roll out a new series, and then roll out a couple of models of that. It was always a saying. Never buy the first model of a new series of cars. Just wait for the subsequent models where the engineers had a chance to iron out all of the bugs in the first model.

Drew: Yeah, my variant on that is that aircraft are totally safe on average. But the very first year after a new aircraft comes out, and once they get sort of 20 years old and beyond that, that's not average. That's when the accidents are happening.

David: There’s a sweet spot. All right. Good to know.

Drew: To compare that with planes, that's a really slow evolution of technology. Lots of opportunities to improve for accident investigation. I think most people would agree that even if they got problems with accident investigations, the places where it tends to work pretty well are things like aircraft.

Compare that to the Mars Climate Orbiter, which is a really fun accident we should talk about some time. It basically crashed into the Mars atmosphere in September 1999. They're immediately investigating it because they've got another spacecraft—the Polar Lander—on its way. It's already been launched. It's already heading towards Mars, possibly going to have the exact same accident.

They rushed the investigation report of the Climate Orbiter. They get the report out in early November. Every section of that report basically says, here's what went wrong, and here's what you need to know about how this might affect the Polar Lander. So they're trying to beat that cycle, get the accident investigation out in time before the technology changes.

The Polar Lander arrives in December, a month after the accident report, and it crashes. It crashes for a totally different reason because it's using a landing system that wasn't on the Climate Orbiter. In fact, it wasn't on any previous spacecraft. There is no way we could have used accident analysis to fix what went wrong with the Polar Lander. The subsequent missions didn't use the same landing system. Even what we learned from that investigation, we couldn't use for future systems, except don't do this.

That's the problem with fast technology. Leveson says, "This is why our hazard analysis can't be based on looking at what we know goes wrong." Because what we know goes wrong is too late. The technology has moved on. We've got to have ways of analyzing systems that we have never seen before with failure modes that we've never seen before and interactions that we have never seen before.

Sometimes that means we got to eliminate hazards that we can't even identify or predict. We will have methods that can deal with things that we can't predict, which seems almost like a contradiction.

David: Yeah, it does. I think it does in some ways, but I think that the argument there and the examples that you've given are exactly why we can't simply think we're safe because we're not having incidents. If we do have an incident, think that by fixing that particular failure mode, then we're back to safety again. I think that's this core argument Leveson got here that that's not enough because things will always keep continuing to fail in new ways.

She talks about hazard analysis, which might be a new term for anyone in for many people in safety or in engineering. She says, this has been used for very dangerous systems for half a century. We can identify the causes of accidents that have never occurred previously so that we can prevent them from occurring the first time. We start by looking at known failure modes or we start by looking at interactions among system components, instead of starting by identifying these hazardous states.

What happens if the aircraft continues to climb? What happens if this state or mode of operation occurs? Then we try to determine how they might be made possible, and then we try to understand how we can adjust the design of that system to make that less possible or eliminate that potential state. As a non-engineer, have I described the hazard analysis type process there, or have I made a [...] of that?

Drew: No, I think you've got that spot on, David. To move towards how Leveson implements this, I think we need to move on to the next section and talk about their causation models. Leveson says that in any sort of accident investigation, we've really got three levels that we look at. The first level is the basic, what she calls, the proximal event chain. I really like the term proximal. It gets away from either root cause, main cause, or it's just proximate.

The proximal event chain is the things that happen directly associated with the accident. Then we've got the conditions that allow those events to occur, and then we've got the systemic factors that contribute to both the conditions and the events. I'd be interested in your view about how fair this is, David.

She says most accident analysis techniques are pretty good at identifying the proximal chain of events, and they're pretty good at identifying the conditions underlying those events. But that's because they rely on an assumption that cause and effect are directly related. So as long as you can see the cause is happening, you can see that this causes that, you can get the chain of events very easily, and then you can get the conditions fairly easily because you can draw a direct arrow between each of the conditions, and one or more of the events that happened.

David: Yeah, and I think the use of the word cause there might be particularly problematic. But I think what has been described there about accidents is a reflection of what typically happens. I also like the word proximal as opposed to a sharp and blunt end sort of distinction. You hear sometimes in an incident investigation where people will say, oh, we went right back to the start of a shift and looked at everything that happened from the start of the shift all the way through until hour number six when the incident occurred, thinking that they've done this very long broad understanding of what happened. But that's a snapshot in time of the system functioning as a whole.

Then there are people who go, okay, turned up to work, then did this, and then that contributed to this, then that contributed to this, then this decision was made, and then this and this and this. Okay, we know exactly what happened. We know the conditions that were present at the time, therefore we know why it occurred, then this big lead to, and therefore we know how to make it not happen again.

Drew: Leveson says that systemic factors, the reason why they don't show up a lot in accident reports, is just because it's so hard to draw a causal link that they're too indirect, which means that for most investigators, they feel very—she didn't really talk about the psychology of investigators, but I think what matters here is that people feel uncomfortable or they feel that they might be challenged. Or even just that their techniques don't support this idea of putting in a cause that you can't directly link to the accident.

The example that I was thinking of is why applying things like [...], people so often, when they go to organizational factors, just put down things like supervisory actions because you can make a causal link between the supervisor and the work. Even though it might be true, it's very indirect to draw a link between the CEO and the work.

David: Yeah. Actually, I might do that after, Drew. Do you want to just talk about Leveson's critique? It's probably a kind word—critique. Leveson takes a real shot at the 5 Whys method. Do you want to uncover what that looks like?

Drew: Yeah. I always get frustrated when Leveson does this to people that I like or things that I think she's being unfair about, but she doesn't mince words. She just chooses something that she doesn't like, directly calls it out, directly names it, and directly criticizes it. She says certain techniques. Let me just say, particularly the 5 Whys method. Why is 5 Whys bad?

Just for anyone who doesn't know, the idea of 5 Whys is you see something happen and you're just, okay, sure that happened, but why? The idea is you do that five times, the five being a bit of an arbitrary number. It doesn't have to be five, but it just takes you backward, supposed to lead you to more distant, more systemic causes.

She said, each time that you ask why, you don't find the whole possible stretch of answers, you just find a small number of answers to that why question. Your answers are going to be different for different people, so they're not repeatable. More importantly, why do different people come up with different answers? Because people say what they already know or they already assume is important.

We're not actually spreading out to all of the answers to the why. We're just having one or two answers, which takes us down a narrow path. The result is that even though we think we've got a technique that's supposed to help us get towards systemic answers, actually, it doesn't. Actually, it only helps us with what we already know of it, which tends to be very strictly causal and misses out on the full range of indirect causes.

David: I think with an incident investigation like this, we have had a lot of podcasts on this safety work idea in terms of just organizations wanting a simple, clean answer. So that's why there's still this desire for the root cause. I know this idea of root causes being. If you look at an incident, say you pick Chernobyl. I know it's been claimed that the root cause of Chernobyl was someone not following procedure, or the root cause of Chernobyl was a complete failure of the economic-political, and energy approach throughout Europe.

The example I was going to give before was my own involvement in an incident where the root cause was someone used the wrong tool. When you actually went in and tried to really get underneath that, the person that selected a wrong tool because that tool was available to them, because they wanted to get their job done quickly, because time was becoming increasingly important, because the company had just put in place a new scheduling system to monitor productivity. 

They had put the wrong timings for certain jobs into that system and were encouraging some jobs to be done much faster than they could ever be done. That was being done because of the financial position of the organization due to adjusting acquired privately from a government organization due to the need for shareholder returns from the parent company.

So a person using a tool was related back to transfer of ownership and financial targets within a shareholder company. I'm not saying that's a direct causal chain, but if I then put an investigation report on a table that says the owner of this particular company needs to change their financial targets, that's a strange conversation for an organization to have in response to a safety incident. That's why these things never quite get to any—but the principle that Nancy said at the start, which is we're never actually fixing the causes of what's actually happening in the business.

Drew: Yes, and particularly, she says, we're focusing on the wrong causes. There's a direct quote that I think I'll just give the exact quote because it says it better than I can. She says, "Accidents are often viewed as some unfortunate coincidence of factors that come together at one particular point in time and lead to the loss. This belief arises from too narrow a view of the causal timeline… Systems are not static. Rather than accidents being a chance occurrence of multiple independent events, they tend to involve a migration to a state of increasing risk over time. 

A point is reached where an accident is inevitable (unless the high risk is detected and reduced) and the particular events involved are somewhat irrelevant. If those events had not occurred, something else would have led to the loss."

She's directly saying the particular events don't matter. Any number of those particular events could have happened. If we tried to focus too much on those individual events, we missed the things that would have led to other events. She just says also in the paper that if we don't sometimes consider what else would have happened if we put in place our recommendation, sure our recommendation would have stopped this, but then that would have happened instead.

David: All I was going to say is we encourage our listeners at the start of this episode to go back to our previous episode, modeling dynamic risk from Jens Rasmussen. We said we're going to draw some distinctions. If you close your eyes and listen back to that quote that Drew read out, it could easily be lifted just straight out of Rasmussen's paper that we discussed last week.

Drew: Yeah, David, I think you made the claim that Rasmussen made a fundamental change in how we think of accident causation, and no one else has ever shifted us back away from that. I think we can say here that this is the Rasmussen causation model. Leveson is just lifting it directly and applying it.

David: I think that's really interesting when we said about some of the authors that we speak about quite often and when we talked about that is I think they would all put their names to that quote that is being read out. We maybe just emphasize different points and have different ways of trying to get to that.

Drew: We've speculated before about different ways of teasing out different safety beliefs or safety assumptions. I think this particular model would be a really interesting litmus test to just throw in. I don't know if you've noticed how much LinkedIn recently has just been overpopulated by polls. Anyone I know likes a poll immediately that comes up on my feed even though it's irrelevant.

This would be a great thing to just do a yes/no poll on. Here is a model about how accidents happen. Do you think that is the way the world works or don't you? I'd be just curious how many people completely buy into this and how many people have reasons for rejecting it.

David: We got over 3000 followers on the LinkedIn page. If you write the poll question and the answers, I'll post it up before we publish this episode and it'll be interesting. We might have the results beforehand.

I thought where you're going to go with that poll was to ask people yes/no. Are the events involved in an incident relevant for making improvement? Yes or no?

Drew: I think you got to give the full quote for context there. I think often we do dumb things down to such simple statements that we cause false controversy in safety.

David: Very good point. I was being quite antagonistic with that question. Good.

Drew: You had something else in our notes here about mental models that I thought might be worth talking about for a bit.

David: We've always got to remember when things get published. This was published in 2011. There's a model there that says, here's the actual model, here’s the actual system, which is, I suppose, from an engineering point of view, as built, this is what the system actually is.

Then there's a model, which is the designer's model of the system, so what's in the head of the designer. Then there's the operator's model of the system, what's in the model of the operator, and she talks about these gaps and the gaps between those three mental models or three representations of the system. I like the way Woods talks about this where he says things like, the system always does what it was designed to do, it's just not what the designer intended.

I'm thinking that even the language that she's using, we would interpret very closely aligned now to like, work as imagined, work as done. How does the design that's going to happen? How does the system actually work? Those perspectives.

I just thought that was good because it's 2011 before some of those things became popular. She hasn't actually used any of that language, but she's trying to make the readers aware that there are these different perspectives of the system or something. So I just thought it was a useful tie-in back to other things that we talked about a lot in safety.

Drew: Yeah. Just a slightly indirect point there, David. One thing that sort of curses safety science a little bit as a field is we tend to cite less than other people do. If you look at most cited papers, I think some of the really important papers in safety are cited less than they would be if they were in a niche medical field or something like that.

I'm always curious with things like this. How much people have independently come up with the ideas versus how much they have read each other and been influenced by but just not directly cited versus how much they're just sort of drawing on the same deep well of ideas?

A lot of what Leveson's doing is drawing on a deep well of cybernetics theory. I wonder if maybe some of those other authors like Rasmussen are influenced by that, whether it's because Leveson has directly read lots of Rasmussen and influenced by it, or whether she's actually going to be sitting in a room having a conversation with Hollnagel and they both then go away and write different papers that express that conversation in different ways.

David: Yeah, or standing in front of a whiteboard and genuinely trying to come up with something new or trying to figure something out for herself. A good point about citations is we can't just assume in safety that because the person hasn't referenced their idea.

Drew: On the other hand, maybe this is the way the world works and just two separate people discovered it.

David: Yes. Let's do some practical takeaways. Do you want to get it started?

Drew: Okay. The first one is, Leveson in this paper isn't talking directly about her most popular technique called STAMP. All of the ideas in this paper are what gave rise to STAMP as a technique. If you're looking for, how do I do all this stuff that Leveson is talking about? She does give much more direct instructions elsewhere. If you like the ideas, it is worth trying out the STAMP modeling technique.

I've given lots of students this technique to do and students generally find it a bit harder than other techniques. But I think it's simpler than things like FRAM, which are even harder still. So it's a good middle ground between very difficult modeling techniques and very simple modeling.

The great thing is that even if you don't do the technique exactly the way she says to do it, I still think you get something out of trying to map out the feedback loops inside your organization's processes. I've often said, safety management system is supposed to be a set of feedback loops. If you can't draw it like that, then maybe the system is not working the way it's supposed to be working.

David: We used the word STAMP there, people can look that up and there are lots of papers. I'm just thinking that maybe wherever we get to with his little run we're having some of these foundational papers, we could do kind of a what is series like what is STAMP, what is FRAM, what is a learning team, and maybe we could actually explain some of these things back to the source what these things actually are?

Drew: Yeah, that could be fun. I'm looking forward to trying to describe a diagramming technique through an audio podcast.

David: Medium is important, but we can paste some pictures somewhere. Drew, your second takeaway.

Drew: The second one is that investigations should focus on fixing the part of the system that changes slowest. If you're designing railways, then probably, it is the technology that changes slowest. You should be doing investigations that can come up with improvements to railway design. But more often, Leveson says, it's the broader company system that changes slowest, and the technology moves really fast.

So changing the technology is not going to help you. You got to change the system that produces the technology. I think you can do that within any single accident investigation. If Joe falls off a ladder on Tuesday while changing a light bulb, you got to say, what's the longest constant here?

Is it Joe? Is it the fact that we're changing light bulbs? Is it the fact that we were using a ladder in this particular case? Pick the part of it that changing it is going to change it for a broad class of people for a long period of time, not that's going to change this particular event that's already happened.

David: Yeah, I think that's a good reminder for our response to incidents in all organizations. The last point here that you've got is, again, replaying back to that quote earlier, for dynamic systems, the exact frontline events and intervening in the exact frontline events really doesn't matter that much for improving safety. I don’t think that's a fundamental mindset shift.

Drew: I quite like Leveson's approach to this, which is to spill out the three areas. We've got the immediate proximal events. Sure, in the accident report, describe them. We might not even need to evaluate them, we might just need to tell the story. This is what happened on the day. This is what happened during the week. This is what happened in the month.

Then have a section that talks about the conditions that allowed that to happen, and then have a section that talks about what are the broader systemic factors that we need to be aware of. Don't worry too much about proving the precise links between those things because proving the precise links just weakens our ability to talk about these systemic factors. Just treat them as separate topics that are worth thinking about when you're thinking about how to prevent accidents.

David: Perfect, Drew. Thanks. Great takeaway. Thanks to the idea. We might just run off and create a new model of incident investigation so that companies can replace their frustration with some of their existing systems like [...], TapRooT, 5 Whys, and others. So look out for that in 2022. Side hustle for Drew. Perfect. Drew, the question that we asked this week was, what exactly is systems thinking?

Drew: At its broadest, system sinking is just adoption of that Rasmussen causation model. That is that the accident arises from a change in risk over time, which then gives rise to the specific events. But what we need to look at is what's causing that change in risk over time, not necessarily those specific events that could have happened in any number of ways, any number of people, any number of times.

Leveson's particular brand of systems thinking involves two specific things. One of them is to extend what counts as your system as broadly as possible to include your design processes, your regulatory processes, your management processes. Secondly, understand that system as an interacting set of control and feedback loops, trying to do two things to maintain the mission and to constrain the way you achieve the mission.

To do Levison's style of systems, then you've got to do all of that. More generally, it's really just the Rasmussen model.

David: Excellent. Great. That's it for this week. We hope you found this episode thought-provoking and ultimately useful in shaping the safety of work in your own organization. Send any comments, questions, or ideas for future episodes to us at feedback@safetyofwork.com.