The Safety of Work

Ep. 63 How subjective is technical risk assessment?

Episode Summary

On today’s episode, we discuss the subjectivity of technical risk assessment.

Episode Notes

As risk assessment is such a central topic in the world of safety science, we thought we would dedicate another episode to discussing a facet of this subject. We loop back to risk matrices and determine how to score risks.

Join us as we try to determine the subjectivity of risk assessment and the pitfalls of such an endeavor.

Topics:

Risk matrices.
Why the paper we reference is a trustworthy source.
Scoring risks.
How objective are we?
How to interpret risk scores.
What the risk-rating is dependent upon.
Practical takeaways.

Quotes:

“The difference between an enumeration and a quantitative value is that enumeration has an order attached to it. So it let’s us say that ‘this thing is more than that thing.’ “

“I think this was a good way of seeing whether the differences or alignment happened in familiar activities or unfamiliar activities. Because then you can sort of get an idea into the process, as well as the shared knowledge of the group…”

“So, what we see is, if you stick to a single organization and eliminate the outliers, you’ve still got a wide spread of scores on every project.”

“We’re already trying pretty hard and if we’re still not converging on a common answer, then I think we need to rethink the original assumption that there is a common answer that can be found…”

Resources:

Are We Objective?

Risk Perceptions & Decision-Making in the Water Industry

Feedback@safetyofwork.com

Episode Transcription

David: You're listening to the Safety of Work Podcast episode 63. Today we're asking the question, how subjective is technical risk assessment? Let's get started.

Hi, everybody. My name's David Provan and I'm here with Drew Rae. We're from the Safety Science Innovation Lab at Griffith University. Welcome to the Safety of Work Podcast. In each episode, we ask an important question in relation to the safety of work or the work of safety, and we examine the evidence surrounding it. Drew, what's today's question?

Drew: David, today, we're going to ask a question about risk assessment. Risk assessment, as you know, is a topic that's dear to my heart and has come up before on the podcast.

A couple of episodes, to mention, way back in episode 8—it was one of our more popular episodes—talking about risk matrices. Then, in episode 40, we talked about one of our own papers, talking about expert judgments of risk. In both of those episodes, we referred to some other research that compares what happens when you get different technical experts assessing the similar risks and whether they come to the same answer. That's the question that we want to look directly at in today's episode.

The broad problem is that if we say that risk is objective or the term we usually use is positivist, we're saying that there is out there in the real world somewhere, there’s a true value for the risk. It's just waiting for us to go and find out what that true value is.

On the other hand, if you say that risk is subjective or constructed, then we're saying that risk is something that exists inside our own heads so we can't go and find out what risk is. We can just negotiate some sort of social consensus.

Now, normally that's just a sort of philosophical question. It's something that scholars of risk like to talk about down at the pub, at their conferences, or in their journals. It's also a really practical question that we can look at when we've got the same risk and multiple people are independently assessing that risk.

If risk is really objective, then you'd expect people to be able to find out what that answer is. You'd expect them not to be perfect, but at least most of the time, come up with the same answer. If the risk of a particular thing is low or 10 on a scale of 25 then you'd expect most people assessing it to get an answer that’s low or an answer somewhere around 10. Maybe some people will give 8. Maybe some people get 12, but you wouldn't expect the answers to be spread all over the place.

The paper we're looking at today is directly going to be looking at that idea—looking at whether if we give people individually the same problem to solve or the same situation to assess the risk for, do they give us the same answer?

David: Drew, when we talk about risk assessment, that's a word that I suppose a lot of our listeners and some professionals use every single day, but it's a very broad term. It can sort of talk about a whole lot of different practices and different ideas.

Now, there's a couple of things I just want to test before we dive into the paper. The paper uses the word technical risk or quantitative risk sometimes. In the paper we're talking about today, when they're saying that, they're really just talking about risk matrices where you've got a category label, you replace it with a number, you add or multiply those numbers together. Then, you convert it back into a category label of risk. Now, I've always thought of this as semi-quantitative risk, not actual like probabilistic risk assessment.

How do you see those differences between a risk assessment you do with the risk matrix and what you might do as a risk engineer with an actual QRA?

Drew: That's a really good question to ask. I think the starting point to talk about here is that there is no such thing as semi-quantitative. It's a word that I admit that people use and I understand why they use it but there's nothing that is sort of a number. Risk is never a number. You can't do a purely quantitative risk assessment.

The reason risk is never a number is because what we're always talking about, the underlying reality, if we think that risk is objective, is that there is a probability distribution of particular outcomes. What we're trying to do is describe that probability distribution. The question is how precisely do we want to describe it?

When people talk about a quantitative risk assessment, what they're talking about is generating either point probabilities or generating that entire distribution as accurately as they can for a particular set of described outcomes.

When people talk about semi-quantitative, really what they're talking about is treating risk as an enumeration. Now, the difference between an enumeration and a quantitative value is that enumeration has an order attached to it.

Let's say that this thing is more than that thing. That's the reason why people use risk matrices. It's what they're doing in the paper we look at today. It’s, we've got a whole bunch of projects we're trying to put them in order from least risk to most risk. Somewhere in the middle, we're going to start drawing lines and treating risks differently based on where they are in that order.

No, it's not probabilistic. It has lots of probabilistic questions underneath it. Very often people are talking about what they think the likelihood of particular outcomes are. They’re just very seldom being explicit about those numbers.

David: Thanks, Drew. I think that's a really good short overview lecture for our listeners on quantitative and non-quantitative risk. Sorry to put you on the spot with that question. It was just, we talked about how important language is when we say risk assessment. The reason that I ask that question will hopefully make sense as we go through this paper.

What I don't want is some of our listeners who are probably more in the risk engineering world who go, yeah, but the people who are in this study weren't experts and the risk assessment method wasn't even a reliable risk assessment method. Of course, the outcome is what it was.

Drew: I'm glad you asked it. We've had a couple of people asking us to do an episode on qualitative risk assessments. Again, I think when we talk about qualitative risk assessments, usually we're talking about enumerated in that we're trying to put this into an order so that we can then categorize them as acceptable or unacceptable.

Now, interestingly, you're talking about a sort of expertise in risk assessment. There are very strong arguments that that is exactly what people are doing, even when they think they're doing very quantitative risk assessment. But really, the underlying process is really one about categorizing risks into acceptable or unacceptable. They’re just using numbers to describe those decisions.

David: Drew, I don't know how close this is to the topic of this week or risk in general. Since I've got your expertise on that as well, I’ll just be selfish and ask you another question.

Say, we're thinking about how aligned we can get around the risk assessment. I think one of the challenges that we face in organizations and it'll talk a little bit to the method in this project, is that we're trying to assess the risk of a particular activity on a particular day. Really, that can only have two kinds of outcomes: it can go well or it can maybe not go well.

If we think of driving, for example, we know in much of the developed world, or if we take the Australian example, that something like 1 in 10,000 of the population will be fatally injured in a motor vehicle accident every year. The statistics, however many 100 people over 20 million population every single year give or take, 10%.

For one driver on one day with one journey, trying to create some sort of assessment of that particular activity becomes a very different proposition than risk at a population level.

I'm just curious as to your thoughts about, if we think about the question today, maybe you can answer now, maybe you can give your thoughts at the end, is it ever realistic to expect people to align around something that's sort of so singular and so specific as opposed to just saying a line over these sort of broad population type risks?

Drew: I think if we are trying to get people to make point estimates of tiny probabilities, then we know that humans just aren't very good at that. If we're trying to get people to align on that sort of fundamental question of whether a risk is acceptable or unacceptable, then the people who talk about risk as socially constructed would say, that's exactly what we do.

The question isn't what is the particular risk of this particular driver on this particular day. The question is, generally speaking, do we want to let this sort of person drive? As a society, we've come up with processes to make that decision. We've come up with licensing systems to make that decision. We've come up with fines and court prosecutions if we think people are getting that decision wrong. That's how we socially negotiate what is and isn't an acceptable risk of driving.

I think when it comes to things like determining the approval of a project, determining the acceptance of a tender, or determining whether a particular product should be allowed onto the market, that there is a reasonable expectation that those decisions shouldn't come down to just the particular attitude of a particular person.

People generally believe and act in a way consistent with this general belief that it's the process itself which comes to the answer, that a reasonable person following that process will get the same answer each time, and two reasonable people will come roughly to the same answer.

Sure, there might be small individual differences, but that's because they're making slightly different assumptions or have access to slightly different information. They shouldn't be just fundamental differences.

You can imagine a meeting where we've got 10 projects on the table. Which of these projects are we going to approve? We don't want every single person in the meeting to look at every single project. We would rather have a process that if someone else has classified this as high risk, that we can trust that classification. Otherwise, what's the point?

David: Yeah, I like that. What you just said there, it reminds me, we haven't talked about our literature review that we wrote on the safety profession, but we did make a statement in there which has been quoted a few times, which is, “Safety isn't a standard to be achieved. It's a point of consensus among stakeholders.” Maybe that's true of risk as well.

It really doesn't matter if it's high, medium, whether it's acceptable or not. Even in our risk matrices episode in episode eight, I think we said that at the end of the day, a risk matrix might just need to have two cells in it. One is to list all the risks you don't need to do anything more about and one cell just to list all the risks that you agree that you do need to do more about. Getting hung up on whether something is high, medium, low, extreme, or otherwise might not be as important as what you didn't go and do about it.

Drew: We said that a little bit facetiously, although I'm willing to stand by it as a conclusion. I do understand and accept that it's a reasonable question to then ask, well, how do we know which of those two boxes does something go into? Can you give us a process, please, for what goes in one box, what goes in another one box, and what are the things in the middle? Then suddenly, you've gone from two boxes back into high, medium, and low.

David: Let's assume that a risk matrix is a process that people can choose from to decide which box those risks are, which of those two boxes, those risks might go in.

What we're going to test in this episode is whether we can get people in similar roles to individually assess the same risk and get them to come to the same answer. If that process is working, that similar people looking at the same risk should come out with a similar conclusion. That would be the, I suppose, the hypothesis for the process.

Drew, would you like to introduce the paper that you've chosen for today?

Drew: Okay. The paper is called Are We Objective? A Study into the Effectiveness of Risk Measurement in the Water Industry. Some background details. It was published in the journal MDPI Sustainability in 2019. I think, David, this is the first time I haven't said—and this is a reputable journal on the podcast—Sustainability is a rapid publication journal and it's got a very broad scope. I'm cautious about the quality of peer review, let's put it that way.

Remember that the quality of a journal doesn't really say anything positive or negative about any individual paper. It's just a background calibration for how closely you need to scrutinize the details. You're likely to scrutinize a paper in a good journal a bit less and trusted a bit more. When the journals are less reputable, then you've got to look closer at the technical details.

The authors of this paper are all from University of Melbourne. They’re Anna Kosovac, Brian Davidson, and Hector Malano. Dr. Kosovac is currently a research fellow and this paper comes out of their PhD project. I think you can take it as a given that the work was good enough to be awarded a PhD from the University of Melbourne. I think we should probably link the full PhD in the show notes because it's a really fascinating look at how you've got these technical risk processes. Then, you've got individuals with individual risk perception and then you're feeding both of those things into influencing decision making.

The project is all focused around project approval type decisions in the water industry—with projects ranging from replacing pumps and pipes to deciding whether we should drink recycled water or put fluoride into the water supply.

The other two authors, professors Davidson and Malano, were the supervisors for Anna's PhD. They both have deep expertise in water infrastructure, including on the business side.

When it comes to how you should default treating this paper as a source of evidence, I would assume that it's definitely a trustworthy source of information about the water industry and trust anything that it says about how the water industry works on face value. I'm more inclined to read carefully when it talks about risk and the particular methods it's used to test out how people assess risk.

David, do you think that's sort of fair from the paper?

David: Yeah, I think that's fair. I had a quick look through the reference list, Drew. I saw your research pop up in the reference list. The literature, if you didn't have a really deep discussion on risk, it was quite a high-level overview. I mean, it did talk about some good history of risk matrices in organizations, individual social perception of risk, and things like that. It was probably light on risk content.

Drew: Yeah. David, I think you've sort of uncovered my secret method of finding some of the papers for this study. I look up who has cited my work then go and read it. If I like it, it's likely to be interesting to me. That is actually how I found this paper.

They didn't just cite us. They also cite most of the big names you would expect in general risk assessment, but not some of the people who were really critical about what risk assessment means or what risk is philosophically. It's sort of a very practical literature review about risk assessment.

David: The method that was undertaken in this research, they had 77 professionals from four different water authorities in Melbourne. The different local government or state government area water authorities, obviously, there was some good industry collaboration around this project. All of those 77 professionals were people who make decisions about water projects within their organization and they have experienced following their organizations in the risk assessment process.

Drew, I wouldn't call any of those participants risk experts or necessarily risk managers. I think that this is a little bit important when we talk about some more of the methods. The selection criteria for participants was that the participants just had to have experience undertaking a single risk assessment on a project and to have decision making responsibility for projects. They're not risk engineers, they're not necessarily safety professionals. They’re project managers, project engineers, site managers, and other sort of operational roles in projects.

Now, that doesn't matter one way or the other way. I just think it's important for listeners who are going to think about what we talk about with findings to think about who the people were who were making these assessments.

Drew: That's an interesting one, David. When we're evaluating any process, I think it's important that we evaluate it with the people who are using that process. I think if this is like a study of how do risk matrices work in the water industry and they'd picked 30 students from University of Melbourne, then it would be reasonable to say that the students were disagreeing because they weren't familiar with the process. They're not the people who actually use it day-to-day. They don't understand the project.

This is not a process which has been designed for use by risk experts. This is a process that is designed and is used by exactly the sort of people who are involved as participants in the study. I think that is actually the right set of participants. We can ask later whether those are the people who should be making the risk assessment decisions. That's another discussion we can have.

In terms of how do you evaluate a method, absolutely. The perfect way is get the people who are supposed to be the ones using this method and get them to do it. Each participant was given seven hypothetical projects and they were asked to score the overall risk of that project based on likelihood and consequence.

I have to admit, this is not a task that I've ever had to do myself. It's more of a project manager style risk assessment. I'm much more used to assessing the risk of particular outcomes because this is much more classifying just the overall project. What's the likelihood and severity of the worst thing that could happen on the project?

They were asked to use their own organization's methods. The participants are in four groups from four organizations, each one using a slightly different method that the authors claim is basically coming from the same standard. For three of the organizations, this means that they're using a 5x5 risk matrix. Refer back to our episode 8 for why you should not do that. That's the standard that they were following.

Then, for three organizations, the total score is you just multiply the likelihood score by the consequent score. The lowest possible score is 1, the highest possible score is 25. There are some numbers in between that are impossible to get because you can't get them by multiplying those numbers together. Like, I think, a score of 17 is impossible.

For the other organization—organization number four—they do it a bit differently. They get their risk score by adding together the likelihood and the consequence. The researchers included them just by taking that added to get a score and scaling it up so that it's spread over between 1 and 25.

They then took the square root of that final score from each person and used the square root as their main value. We'll talk about that one again in a moment. The end result is that for every 1 of the 7 projects, they have 77 separate scores coming from 77 different people. Everyone is using the same information. Apart from the organizational differences, they’re all using the same methods. If the process is working correctly, you'd be expecting them to come to roughly the same answer.

David: I think just in terms of these seven different projects, to give people a sense of what the participants were actually assessing, they gave them four familiar projects. They see something like pipe replacement along a busy road, which is a task that these project managers and water authorities would do all the time—excavate, replace a pipe, manage traffic. Then, they gave them these three unfamiliar projects you mentioned, like, should we drink fluoride or recycled water? For example, a hypothetical implementation of a new radiation-based water treatment method.

I think this was a good way of seeing whether the differences or alignment happened in familiar activities or unfamiliar activities. Then, you can get an idea into the process as well as the shared knowledge of the group and what might have been affecting it.

Drew: David, if you don't mind, we've talked about the positives of the method. You mentioned that they're doing realistic participants using projects that the participants are mostly familiar with and methods that the participants are familiar with. I do want to take a couple of minutes just to nitpick some problems with the methods.

The first one is that you can read the entire paper and they don't actually say what the project descriptions are. I had to go back and find the original thesis to find the project descriptions. They were all pretty thin. They're just a couple of paragraphs about each project. That is a big problem because the less information people have, the more inconsistent they're going to be. They're going to be making all sorts of assumptions.

I reckon there are some people who give them a very thin description and just the uncertainty attached to it being so thin is going to cause them to give a particularly high risk score.

David: This is one that I realized after you’d called that out in the notes about the original descriptions, if you had to come up with one number for a project, like I just mentioned earlier, replacing a section of pipe along a busy road, you’ve got to think about the excavation. You’ve got to think about the pipe handling. You've got to think about all the traffic management. Then, you've got to think about all the controls that you think you'd have in place in your organization, how many times you do that task, and then just basically put one number on the table for that project with a couple of paragraphs of information. Yeah, there's a lot open to interpretation.

Drew: The other thing is that the statistical methods that they've used are a little bit dodgy. Just a couple of examples of that. They say in the paper that when you multiply two numbers from 1-5, the results are going to be a little bit positively skewed, which is true. They then say, that's why they took the square root of the results, which is nonsense because the square root of a positively skewed distribution is still positively skewed.

I have to admit, I actually sat down with an Excel spreadsheet and tested this out. Yeah, there's no way that taking the square root of the results adds anything except that it makes it look to the human eye a little bit more like a normal distribution.

There's a few things they've done through this throughout that make it look like they're doing an analysis that's more sophisticated than they're actually doing, but that don't improve the validity of the results. I think somehow they got into their heads that they needed to have things that looked like normal distributions. They've worked really hard to make the data look normal when none of the tests they do actually need normal distributions for the results to be useful.

On the other hand, they have some extra tests that they really should have done, which would have genuinely helped us out with the results that they didn't do. I'll talk about those as we go through. There are some questions that naturally arise that they could have answered with a little bit more of an analysis.

I will say, though, that none of these statistical problems compromise the results. They're just annoying. They're the sort of thing that happens when you submit your paper to a journal that doesn't give it thorough peer review—the sorts of things that peer reviewers would find and fix.

David: The original aim of this paper was to see whether we are objective. It's in their paper title. How they would define the objective, these scores that would come out on a range of 1-5 would then be categorized as low, medium, high, or extreme, which our listeners will be very familiar with. It is typically like a green, yellow, orange, or red colored box on your matrix.

What the researchers were looking for to say was that, would the individual assessments all come out as high, will come out as medium, or will come out as extreme? They weren't necessarily expecting the individual numbers between 1 and 25 to be the same or 8, 10, or 12. What they were hoping to find—I think hoping to find or maybe not hoping to find but one of the outcomes would have been to have the majority of assessments for the individual projects to align in one of those four categories.

Drew: Yeah. You would hope that at least people could agree that the risk is low, or you have a mix of low and mediums, or a mix of medium and highs, or a mix of highs and extremes.

David: Drew, before we jump on the side, for our listeners who are listening through now, the minimum possible score is one. If it's a likelihood of one and a consequence of one, multiply them together, it's a one. The highest is 25 which is obviously a certain likelihood and a multiple fatality event. The risk score is 25.

Have a think about familiar projects in your organization. Think about what range within there would you expect to find? Whether it's 8-12 or 10-20, have a think for yourself, if you gave this task to 20 people in your organization, what range of values would you expect?

Drew: And, what would you consider to be like a nice tight clustering? What would you consider to be a way too spread out? Have we stalled long enough, David?

David: Yeah, Drew. It just still baffles me.

Drew: Yeah. For three of the projects, the range of scores went from 1-25 which means quite literally, we have a project which is going onto a busy street to replace a stretch of pipes. One person thinks that the worst thing possible is almost impossible and will be insignificant. Another person thinks the worst thing possible is almost certain and is going to be catastrophic.

One person would be comfortable doing it every day without much precaution. The other person thinks the company should never, ever, ever do anything like this particular project. That's three of the projects.

Two of the projects were much, much more conservative range. Their scores range from 2-25. Then, for the other three, the range was from 1-20 so there was no one up in that catastrophic range, but all other scores were given.

This means that for all seven projects, for the exact same project, depending on which assessor you got out of the 77, you could get a score that either said the risk is negligible or the risk is extreme. The rest of the paper is just testing out, what are the possible reasons that this could happen?

David: Drew, this is a really good discussion. I'm hoping or expecting, at least I was, for our listeners to be going, gee, I think I've got an answer for why that might have happened, maybe it was this, or maybe it was that.

I think we talk through some of these tests because the researchers did do a pretty good job. When they got those results and analyzed the results, there's a couple of pages where they try to look to see if they can explain that range. Even if I think the researchers didn't expect to get that, are we objective? I still think that the results that they would have got would have surprised them.

Drew: The first and I think the most obvious possible explanation is that it's just outliers. You interviewed 77 people. One of them is going to just take one, one, one, one, one. Someone else is going to take five, five, five, five, five. Most obvious treatment is you cut off the most extreme scores. They did that in this study using standard deviations. Sort of, how many are within one standard deviation? How many are within two standard deviations?

Given that they are assuming a normal distribution, all these just means is you take either the middle 95% of answers or the middle 68% of answers. Take the most conservative one, just the middle 68%. We cut out 30 of the people who are responding and consider those to be too extreme.

The project with the greatest consensus—the tightest grouping—still ranges from between 3 and 10 which is between low risk and medium risk. Then, all the others are considerably worse between low and high or medium and critical.

David: I think for our listeners, so far, none of the projects that were assessed did they get kind of a +2 or -2 standard deviation, that middle 68%, none of those projects. They got 68% of people to cluster within the same level of risk.

Drew: Yeah. Basically, this means that even when we exclude all the people who are most disagreeable, we still can't agree on which band the risk is going to go in.

Now, what we don't know—this annoys me because the researchers should have checked this—is whether it's the same people every time. Are there some assessors who are consistently giving very low scores or very high scores? Or is it that everyone is varying their scores project to project? Sometimes one assessor is out of the ballpark, sometimes it's another assessor.

If it was consistent, if it's the same group of people who are always very low or very high, then you can explain that, you can train it away. You can prevent those people doing risk assessments. You can fix the problem. That would be really good to know.

The second possibility is you could say, well, maybe it's because they're for different organizations. Maybe the reason they can't agree is that each organization does different types of projects, has different risk tolerance.

When you look at the organizations in isolation, you do get a slightly tighter range of scores and each organization assesses each project a bit differently. That's what you'd expect if you randomly took subsets of the data anyway. When you randomly take a subset, you expect it to have a small distribution.

Again, there are statistical tests that they could have done here to see whether the organizations are systematically different that they didn't do. Even without that testing, you can see that the differences between the organizations aren't enough to explain the differences between the individuals.

David: I think this finding is important because our listeners are probably unlikely to care about what happens across four different organizations. They're more likely to care and be interested in what happens within their own organization. Know that that range still existed when you had people who possibly had a much closer aligned set of assumptions.

Even if they're referring back to, what process does my organization follow for this project? What risk controls do we have in place for this type of activity? What's our incident history associated with this type of project? Possibly a much closer set of assumptions. You still can't tighten up that range of assessment outcomes.

Drew: Yeah. What we see is that if you stick to a single organization and you eliminate the outliers, you've still got a wide spread of scores on every project.

Third possibility is that it's the type of project, that's why they gave some very mundane projects and some slightly left field projects. I think what they were expecting is that people would be very good on the mundane projects and very disagreeing on the novel projects. What they found is that people think that novel projects have higher risk, but they don't have more agreement about that. That's got a higher average score but the spread is still the same.

David: I think this is, I would have thought, somewhat to be expected. I think if people are assessing something they're familiar with, they might have some overconfidence in how it's performed. They may not be hurting people every day doing these tasks. When you're starting to ask people about radiation treatment methods and things like that, they have to stop and think about how they might actually do that project. It might be things that they might think their organization doesn't have control for.

Just putting myself in the shoes of participants and putting myself through that thought process, this was something that I could see how those scores might be higher on average. But, when you've got a 1-25 spread, it's just as wide anyway, isn't it?

Drew: Yeah. It's not really possible to get worse than a 1-25 spread when you move from the mundane to the novel projects. David, at this point I really think we're left with two possible explanations for the results.

The first possibility is that the assessors genuinely are that bad at coming to agreement so that people genuinely, routinely in the real world, when making these sorts of assessments, are being very inconsistent with each other.

The other possibility is that there's some factor about real world projects and real world risk assessments that makes them perform better than they did in this particular study. For example, maybe there's some information that normally causes those different scores and the researchers have left out that information. The participants have had to guess. They've all come up with different guesses and that's why the results are spread.

David, I realize you don't have information, you're basically guessing, but if you had to guess, which one would you say it is? Do you think it's an experimental effect or do you think that this is a real world thing that's happening?

David: Look, I don't want to sit on the fence too much, but it could be a combination of both. I think there's an experimental effect here. I think asking people to come up with a number about a job where they get a couple of paragraphs of information means that there's a lot of guessing and assumptions. There's a lot of individual thought processes that are very hard to pin down with the experimental design.

We still see it in organizations in practice all the time. You might just be talking to a group about working at heights and you'll have some supervisor somewhere that thinks, it's a high risk. Then, you'll have some supervisor who thinks it's a low risk with a very specific situation, the same sort of task, the same training, the same fall arrest, same whatever it is. I do think there's some real world individual differences in how people perceive risk.

Drew: Yeah, I think the same. I do wish that the researchers had done a little bit more to interrogate the real world effects of this. I think they had access to the organizations, had access to the people. They could have explored the reason for the outliers a bit more directly in the study.

Also, I know that there have been previous studies that have done that extra step and it hasn't fixed the problem. There have been people who've done similar work where they've tried to work out what are the different assumptions that people are making when they get different results. Then, they've got the participants to agree on the answers to those assumptions and redo the risk assessment. It hasn't really helped.

I think the theory that the differences come from different assumptions is partly true. Also, there's a problem that's bigger than just lack of information causes the different risk assessments.

David: Maybe you don't know what you're going to find, so you don't always know the questions you ask when you're doing research. If you're doing a PhD, the objective of doing a PhD is to get the PhD, not necessarily solve every problem that you come across through the process. Maybe they didn't expect to have this range of results. They didn't have all of the design that they might have needed to to understand it.

You're right. Yeah, it would have been fascinating to go back. I would love to get straight to the 77 people and give them a very specific risk. Like, give them, okay, what is the risk of a work group being hit by a car and give them the road, the time of day, all of the controls that are in place to speed the cars that are going? How many cars are going past per half hour? How many people in the work group? What task are they performing? What is the weather invisibility like?

Just give him a really, really specific situation. Ask them to assess that risk with almost no assumption left to be made and almost do it like you said, with clarifying every single question in assessing what happened. That would be a really interesting study to know. Maybe, Drew, if these studies have been done, you would know if they've been done. It would be interesting to see whether the clustering gets tight in that sort of a situation.

Drew: Yeah, I hear absolutely what you're saying, that if you're doing a PhD at some point, you just go to graduate. You can't just keep doing extra studies to explain your results. You try to explain the results and you find even more things.

The next study that I would like to do is actually just to tell everyone their own scores, tell them what their distribution was, and just ask them to talk about... You gave 25. Other people gave answers ranging from 1-25. What were you thinking? Why do you think that you were more scholarly than the other? Just find out what they think the reason is that they disagree with other people.

David: That would be a really fascinating research question because the answers you would get is an understanding of the thought processes that people take. If you can cluster those around 3, 4, 5, or 10 common thought processes that people adopt when they do risk assessments, then you can think about how your process can push people into one of those thought processes so that you actually know how they're approaching the particular assessment task. Or, you can actually start to understand and interpret the differences that you get based on some of those common ways that people think about this.

Drew: Yeah, I guess underlying this, I've always got the suspicion that the person who scores things as 1 is the person who knows that if their project has any higher than low risk, then they're going to have to do extra paperwork. The person who scores it as 25 is the person who just wants that project to be caned anyway. Things that the easiest way to get rid of it is to mark it as high risk.

David: Yeah. That's not even cynical anymore. We know that's often the way that people use organizational processes to achieve all sorts of means and to achieve all sorts of goals.

Drew: Let's move on to the conclusion by the researchers. I've got a fairly large blog quote here. Maybe we won't read it all out. Fundamentally, they're saying that the risk rating is, “dependent upon the person who undertakes the assessment, despite the risk assessor being provided with identical information to other assessors and using the same organizational risk assessment process.”

What that means in practice is that if you're then using those decisions to make decisions about allocating funding, then it is potluck. If you get the assessor who gives you a low score, you're going to get funded. If you get the assessor gives you a high risk, then you're not going to get funded. That's not to do with fundamentals about the project, that's to do with fundamentals about who got asked to do the risk assessment. The research is quite rightly, consider that a bit of a problem when you're trying to allocate taxpayer funds using a supposedly objective process.

David: Yeah, Drew, that's a reasonable conclusion. The researchers seem to spend their time in the built environment, engineering sort of world. They weren't particularly looking at how projects get approved, if you like. Obviously, this idea that if a project has a high risk, then it may not get approved. Also, there's some other real world problems in the safety space, not just about which projects get approved or not—the project they didn't really make any conclusions about. Well, what does this actually mean for safety practice within organizations, understanding, and allocating resources towards the mitigation of different safety risks?

Drew: That sounds like a good prompt to move on to takeaways. Just before we do, the final thing I wanted to say about the paper itself is that I see this result very much as putting the ball into someone else's court. Regardless of the limitations with what the researchers have done, they have got a reasonably valid answer to the question.

The answer is these processes are not giving you consistent results. You can pick holes in that, you can pick holes in their methods, you can find limitations. But that's the answer that's on the table. The ball's now in the court of someone who wants to use these methods to demonstrate that, no, they can use the methods more reliably, more consistently than the researchers have shown. It's good enough that we should now accept this as the default answer. Unless someone proves otherwise, this is how inconsistent risk assessment is.

David, I am really interested in your thoughts about what's the appropriate takeaway. I'm a bit disappointed given how good the rest of the research is that the authors went down the very predictable path. They found this really interesting thing. Their response is to say that, well, we've got to work out how come it's this bad and we've got to fix it. They've seen this process which has an international standard for risk assessment that has very standardized processes. It's very technical, it's very supposedly objective. It's very heavy on the process. Still, it's giving this wide variety of results.

Their answer is, well, we've got to fix the process. We've got to put more process in. We've got to make it more objective. We've got to fight more to give people consistent answers.

My personal opinion is that this misses the whole point. We're already trying pretty hard. If we're still not converging on a common answer, then I think we need to rethink the original assumption that there is a common answer that can be found and just accept that no, this is like proof or at least a demonstration that risk is not this objective thing that we can use very intensive processes to go out and find.

David: Yeah, I agree with what you said there. I like the way you're challenging this very basic assumption for all of our risk assessment. That there is a level of risk that's out there that's positivist or objective. Like you said, a true level of risk or real risk. We just need to keep searching for methods to get closer and closer to that true risk.

Now, if we think it's risky, something more constructivist, which is that it doesn't actually exist. It only exists in our heads or in the collective heads of people in an organization. If I can agree, then it becomes less about what the actual value is and more about what the alignment, the uncertainty, or the consensus is around a particular risk issue.

Now, that's not practical for people, but I think it becomes quite practical quite fast because risk should be a central idea for everything that we do in safety practice. Fundamentally, everything we do comes down to how important we think things are in our organization or how likely we think things are to hurt people. We've got to think about what we're basing all of those decisions that we make in safety management in our organization.

In your organization you're basing, you've got these critical risks and you've decided what your critical risks are based on where they've sat in your risk assessment matrix. You do risk assessments at the start of every project to decide what you need to do.

If risk assessment is a central part of your safety management practice, your risk matrix is at the core of that, and you have people doing these assessments, then this research should really make you rock back in your chair and ask some really big questions about how do you know you're doing the best things in your organization to manage safety.

Drew: Thanks, David. I think that probably leaves us in a good position for practical takeaways. We've each inserted a couple of things into this list. I'll go first, if you don't mind.

The first one I've got is that at their very best, their study shows that risk scores are a proxy for something else. If we're trying to use risk to help us make decisions and if people are coming up with different risk scores, then really what's making those decisions is whatever the underlying reason is that's causing people to have those differences.

We need to have robust conversations about what's causing the disagreement and make our final decisions based on that instead of using risk as this Cold War type thing.

There's one person that doesn't want to go ahead with the project or someone else who wants to go ahead with the project. We're making it a conversation about risk instead of a conversation about why one person thinks it's worthwhile and the other person doesn't think it's worthwhile.

David: Drew, I'm happy for you to keep going with your practical takeaways. I think I’ll throw one in at the end.

Drew: Okay. The second one is we should be ready for the fact that maybe the reason for these differences genuinely is personal. That when we're talking about making decisions under uncertainty, that is a fundamentally very human process. We've got humans, they're going to make different decisions and there's no magical process that turns those different humans into robots or cogs in a machine that produces the same answer.

It's all about three choices. We trust the decision makers to use sound judgment or we don't trust them. We make them make the judgment together so that it's a consensus process, not an individual process. The worst case is where we force them to express their judgment in artificial terms like these risk tables. That doesn't make them have better judgment or worse judgment. It just makes the difference have inconsistencies.

Following on from that, my third one is that one of the reasons we use risk assessments, particularly in this sort of project, is to improve the transparency of decision making. These are public water companies or at least water companies where ultimately they're funded by taxpayer money. They're trying to make their decisions in ways which are transparent, that people can audit them, look at them, and see that the money is being used appropriately.

What this research shows is that risk assessment doesn't add transparency. It takes it away because we're turning these differences between human judgments and trying to force them into risk categories. We're saying we've made the decision because the risk is low or because the risk is high. This study just shows that those are purely arbitrary categories. They're not giving us more information about how and why the decision was made like it was made.

David: That takes me back into episode 8 where you said, when you use a risk matrix and you give a risk a score like high, medium, or low, you always end up with less information than what you started with. You're dumbing down that information.

The practical takeaway I had is I want to say two related words, one word’s uncertainty and the other word’s assumption.I think they follow each other. When you've got uncertainty of information, when you're trying to do a risk assessment, then you have to make some assumptions.

In my opinion, the most important column in your risk register is the one that lists all of the assumptions that have been made in the risk assessment. If you don't have an assumptions column in your register template, then I'd probably say practically insert one and teach your organization how to use it.

The only way to know why a person assessed something as high or low is to know what they were thinking. You need to know what they were thinking when they filled in the gaps around the information that they didn't have.

When you get, as you say, a risk assessment across your desk from a project manager, if you don't have that information in front of you and you're having a conversation with him, then the question to ask them are: what information didn't they have that they had to fill in for themselves? What are they uncertain about with their assessment? What assumptions did they make that aren't listed in the controls?

Without that dialogue, there's almost no point caring what the number is based on this research. It could be low, it could be high, but there's no reason to trust what that number is unless you know more information.

Drew: David, I'll agree with that and go a step further to say that the conversation we have around that column is more valuable than anything else we're doing in the risk assessment, because it's that conversation. Half of those assumptions are things that one person is assuming will be put in place to keep the task safe. That's why they've given it a low score. The reason the other person's given the high score is they’re assuming those things aren't in place.

When we have that conversation, that's when we are more likely to actually change the way the task happens to fit in with the safer set of assumptions.

David: Drew, the invitation for our listeners this week is a bit of a challenge.

Drew: Yes. Normally, we ask listeners things that we'd like to know we'll hear from you. This week, it's more of a mission. Go out and find two or more risk assessments in your own organization that are as identical as possible. It might be a risk assessment for a particular hazard, like you're finding two JSAs that have the same hazard listed or two risk assessments for similar projects. Just compare them and tell us what you find. Do you find that people assessed the same risk in the same way or do you find this divergence that this project found?

David: Yeah, I'm really looking forward to those answers. I'm disappointed that I'm not inside an organization anymore and I can't just go pick up a couple of JSAs from different work groups about the same tasks or something. Please share what you find.

Drew, today we asked the question, how subjective is technical risk assessment. The answer?

Drew: Varies. Scores ranging from 1-25 out of a possible range of 1-25.

David: That's it for this week. We hope you found this episode thought-provoking and ultimately useful in shaping the safety of work in your own organization. Join us in the discussion on LinkedIn or send any comments, questions, or ideas for future episodes to us at feedback@safetyofwork.com.