The Safety of Work

Ep. 104 How can we get better at using measurement?

Episode Summary

Welcome to our first episode of 2023. In this episode, we’ll discuss the paper entitled, “Measurement Schmeasurement: Questionable Measurement Practices and How to Avoid Them” authored by Prof. Jessica Kay Flake and Assoc. Prof. Eiko I. Fried.  It was published in 2020 in the journal Advances in Methods and Practices in Psychological Science.

Episode Notes

You’ll hear some dismaying statistics around the validity of research papers in general, some comments regarding the peer review process, and then we’ll dissect each of six questions that should be asked BEFORE you design your research.

 

The paper’s abstract reads:

In this article, we define questionable measurement practices (QMPs) as decisions researchers make that raise doubts about the validity of the measures, and ultimately the validity of study conclusions. Doubts arise for a host of reasons, including a lack of transparency, ignorance, negligence, or misrepresentation of the evidence. We describe the scope of the problem and focus on how transparency is a part of the solution. A lack of measurement transparency makes it impossible to evaluate potential threats to internal, external, statistical-conclusion, and construct validity. We demonstrate that psychology is plagued by a measurement schmeasurement attitude: QMPs are common, hide a stunning source of researcher degrees of freedom, and pose a serious threat to cumulative psychological science, but are largely ignored. We address these challenges by providing a set of questions that researchers and consumers of scientific research can consider to identify and avoid QMPs. Transparent answers to these measurement questions promote rigorous research, allow for thorough evaluations of a study’s inferences, and are necessary for meaningful replication studies.

 

Discussion Points:

 

Resources:

Link to the paper

The Safety of Work Podcast

The Safety of Work on LinkedIn

Feedback@safetyofwork

Episode Transcription

David: You're listening to The Safety of Work Podcast episode 104. Today we're asking the question, how can we get better at using measurement? Let's get started.

Hey, everybody. My name is David Provan and I'm here with Drew Rae. We're from the Safety Science Innovation Lab at Griffith University in Australia. Welcome to The Safety of Work Podcast.

In each episode, we ask an important question in relation to the safety of work or the work of safety, and we examine the evidence surrounding it. Drew, this is our first episode for 2023. You picked this paper, so tell us a little bit about how it came about recording this one.

Drew: Sure. Happy New Year, David. Happy New Year listeners. David, I think you mentioned in the last episode that we both have services we used that track and recommend papers to us similar to Amazon zero, if you like. To this paper, you might like to read this paper, too.

This one, to be honest, just came across my desk and I liked the title. It was during the Christmas break, so I had enough time to do some reading. I thought it might be interesting for us and interesting for the listeners.

David: Yeah, great. I think I've mentioned on a few podcasts, one of the things I really enjoyed about co-authoring with you is the effort that you put into titles of papers.

Drew: And I appreciate it that we have the authors do that. No one likes reading boring stuff. Just attention to the craftsmanship of producing the work so that people can actually read it and enjoy the reading experience (I think) is always a value. I think this paper definitely delivers.

David: I think it's a great paper. Do you want to introduce it and end the suspense?

Drew: Okay. Our paper is called Measurement Schmeasurement: Questionable Measurement Practices and How to Avoid Them. It was published in 2020, so pretty recent. It was published in a journal called Advances in Methods and Practices in Psychological Science.

As you might guess from the title, this is really a journal by researchers for researchers about research. It's pretty clear that most journals are really made to be read only by other researchers, but this journal is specifically talking about doing the research. 

I quite like these journals not just for myself, but I think it's good for non researchers, because it's where people reveal the dirty truths of research. It's often very helpful in just understanding how research works and getting some ideas on how to critique research, getting some ideas on what works well, what doesn't work well, getting a sense of how research actually happens. 

In this particular paper, it was actually published in a couple of workshops. They obviously have been developing these ideas in conversations with other researchers about what's going well, what's not going well in the way we're currently doing research, in this case, in psychology.

David, I don't know if you've had a chance to look up the authors. days Professor JK Flake from McGill, who appears to me to be a mid-career academic. She actually focuses on measurement as her research. 

The first couple of sentences of her research bio, how are we measuring what we think we're measuring? This is a foundational question for psychological scientists because we often study phenomena that are difficult to observe and measure. That really appeals to me as a safety scientist. I think it's a foundational question for safety scientists because we often study these phenomena that are difficult to observe and measure.

The second author, as usual, I'm going to mangle the name. This is associate professor Eiko Fried from Leiden University. He's got a mixed body of work. It's half about mental health and half about how we measure stuff in order to research mental health.

Both people are already widely published in this idea of how we measure and what makes good measurement. In this paper, they're playing method police pointing out what goes wrong with measuring things and one of the common problems they see in other papers to give advice to authors on how to improve their papers and advice on readers on how to understand and critique papers.

David: I think as much as it is a paper by researchers, for researchers, about research, I think, reading through this, it's very true of how practitioners approach their role inside organizations and what they were. I guess we're going to talk about quantification of phenomena that we're trying to study. 

They give the example that if you're standing next to someone, you can quite objectively measure how tall they are. If you want to understand the extent to which they are anxious, depressed, or some other psychological characteristic, trait, or other aspects, you need a way of actually measuring that. We then go to surveys and we then go to quantification. 

I think this is a really important paper because it's not just researchers that use these types of research methods. Organizations use them all the time.

Drew: Yeah, that's absolutely true. Cards on the table, David. My answer to problems with measurement is usually don't measure. I'll probably have a few snarky comments to make at the start of this paper because they often go into it just with the assumption that measurement is the only way to do research, it's the only way to do investigation.

I would say, yes, measure the person's height. If you want to ask how someone's feeling, don't give them a quantified survey. Ask them how they're feeling, actually listen to what they're saying. There are times when in order to do the things we want to do, in order to answer the questions we want to answer, we do need to turn how they're feeling into a number so that we can do things like measure.

Does that number go up or down when we use certain treatments? Does a drug make that number go up and down? Does therapy make that number go up and down? Those are things that we can't do with qualitative research, we have to have measurements for.

David: To have this conversation today, we're in three parts. What do these authors say the problems are in relation to measurement? Clearly, by the title of the paper, they see some real problems with what's going on particularly in psychological research. Why do these problems matter? Why should these problems with measurement matter to researchers and practitioners? And then they lay out six questions that we can ask in relation to any research, and researchers should actually ask and have good answers for themselves.

I think you and I think that these are quite a good guide for not reading research papers, definitely, but anytime you see a measurement and a conclusion in relation to that measurement about some aspect of your organization, then these six questions are good questions to have answers to.

Drew: They're all questions that if an author doesn't answer them in the paper, they're questions that peer review is really should ask in their peer review. As a reader sees a paper that doesn't have answers to these questions, then there are multiple layers of either sloppiness or lack of transparency that have gone on. 

People have been making decisions, they might be good decisions, they might be doing the research well, but they're just not being clear and transparent about what they're doing and why they're doing it. That always leaves the possibility for things to slip through the cracks.

David: To load you up for maybe your first snarky comment, the authors introduce this idea of questionable measurement practices and go into the paper saying that a foundational part of science is defining and measuring what has been studied. Before one can study the etymology of something like depression or the efficacy of an intervention to treat it, one must define the construct under investigation and build instruments that can measure it accurately.

Drew: David, I think that's a fairly accurate characterization of one particular approach to science. This is how some people approach things. We start off, we observe a phenomenon. We then put boundaries around it and labels on top of it. Then we measure it, and then we look at what makes the number go up or down. What causes it? How does that relate to other measurements?

I think that description puts measurement at the center of things, whereas I'm not very happy with the idea that that's the only way science happens. Particularly, it underestimates the amount of work that's necessary in that defining bit. 

They talk about defining and measuring as if they're the same thing, whereas I reckon there can be multi-year research programs just in understanding and defining something before you get to the point of being ready to put measurements on it. Maybe we'll talk a little bit later about how that particularly applies in safety. Certainly, measurement is a common and important part of science.

David: I think people maybe see this narrower definition of the scientific method, which is, what are we studying? What do we think's going to happen? How do we create a measurement to know if that's happening like that? I didn't measure anything in my PhD. It was entirely descriptive. There was, more often than not, peer review comments back from the journals about there's no measurement in this study.

Drew: Yup. Just to illustrate the strengths and weaknesses of this, if we talk about our own safety clutter research, I know you and your consulting have tried to do a little bit of clutter measurement, but I think it'd be fair to say that there's currently no generally accepted way to measure clutter. There's no universal measure of how cluttered your organization is on a scale of 1 to 10.

It might be great to have that measurement. Particularly if you're going to say, okay, hey, is our organization getting more cluttered or less cluttered? You might like to have a measurement. You don't have to have a measurement for clutter to be a real concept or for clutter itself to be real and important to think about and talk about.

There are lots of ways of understanding clutter other than by counting numbers of pages or measuring how cluttered people think they are. Those measurements might in fact be deceptive. They might decrease your understanding of clutter if you don't get the measurement right.

David: I think the authors also say that these quantitative measurements are a useful tool, but they aren't the only tool, and then they go on to just exclude that whole qualitative inquiry approach to research from this paper. Everything we're talking about in this paper is specifically in relation to where a researcher does quantify a measurement in relation to their phenomena.

Drew: I really hate the way they start the first page of the paper by looking at how important measurement is. Let's take that for granted and say, okay, we've agreed that we want to measure. What are the issues that we need to worry about? I think this is the point at which I first started agreeing with what they were saying.

They say, collecting evidence that the instrument scientists build actually measure the construct scientists claim they measure. It's a difficult and necessary part of the research process. That's one of those truths universally acknowledged, but so often glanced over and shoved to one side as people rush to find the survey that they can use. It's genuinely difficult and generally important if you're going to measure anything to make sure that the way you're measuring it actually measures what you want to measure.

David: Do you want to lead us into the next part of the paper now then?

Drew: Okay, let's go through some of the main points they make in the early parts. First thing they say is that this is really common. This thinking about, are you measuring what you're measuring, is just left entirely out of lots and lots of papers.

They cite a few meta studies. One that says that between 40% and 93% of measures used for educational behavior lack any evidence that the measure is valid. I don't know quite how that I haven't read the underlying paper they're citing there to how they got the range between 40% and 93%, but I would have thought that 40% was pretty bad given that there's no good reason why it shouldn't be 100%.

Another one said that of a whole bunch of measurement instruments in review of a motion research, 69% of them had no reference to where the measurement had come from, either to any prior research into the measurement or any process even that the authors themselves had followed to develop the thing that they were measuring. 

They were all just like, hey, here's a scale we're going to use to measure anger. I've invented some questions. Here's the scale we're going to use to measure jealousy. You have some questions with no mention of how they came up with that set.

Another study said that in research about the relationship between technology use and wellbeing. Those are things like, does social media make you sad? Does playing video games make you violent? Researchers pick and choose between questionnaires, making the constructs more of an accessory than a guide for analysis. People are just spouting their own ideas and then just plucking a survey that they can use to claim that what they're doing is a little bit more rigorous than it actually is.

The authors say that when you don't have this validity, then how do you know anything about the actual conclusions of the study? If you're measuring the relationship between social media use and depression, and you don't have a trusted way of measuring social media use, and you don't have a trusted way of measuring depression, then nothing you say about the relationship between the two of them can be trusted either. Having good measurement is necessary just for any validity.

They say there are lots of reasons it could come from. It could be just under reporting. People do the stuff, they just don't bother to tell us. It could be ignorance, students who don't know better; negligence, people who should know better; misrepresentation, people who actually aren't doing the right thing and they're trying to hide that they're not doing the right thing.

They say, look, we don't have to judge what it is. We don't have to say it's evil. We don't have to say it's done. We don't have to say it's negligent. All of those problems will be solved if people were only transparent. If at least people told us what they're doing, then we don't have to worry about the underlying causes for why they're making particular choices, if they just explain to us why they're doing what they're doing.

They throw in this term, questionable measurement practice, which is a riff off the idea of questionable research practices, which is a label that's been stuck just with the general problem of validity and psychological research. They're taking that general problem and they're focusing on this particular category of what we are choosing to measure. David, do you have thoughts about that initial setup?

David: Clearly, this is a really important research aspect. I guess the conclusions of any paper stem from the analysis of the measurement. Often, I think there's just an assumption that the research method matches the research question and can answer the research question.

I guess what I've learned in the last 104 episodes or something is that the peer review process of research helps, but it doesn't stop bad research from being published. I think what we're talking about here is one of those things that actually turn research into bad research.

Drew: That's fair enough. The authors then have a section on why you should even care about this problem. This is something that I personally struggled to adequately communicate on things like social media. You get a lot of people who think that weaknesses in research is just something that research method geeks care about. It would be nice if the research was done better. That might earn you more academic points, but what we've got is what we've got, and at least it's good enough. It's an indication. 

The point they make is that without this fundamental validity, the answer could be entirely backwards, and you wouldn't know. We shouldn't just think of lack of validity is a theoretical problem. When we don't have validity, the answer could be the opposite to the one that the researchers have published. 

It's not nearly good enough. It's not partially good enough. It's not a hint towards the right direction. It's just playing a black box. We don't know what the answer is, unless we've got enough validity to trust the answer. If a study has poor validity, you shouldn't treat it as tentative, you should treat it as no information. 

They referenced something here, David. Have we talked about The Garden of Forking paths on the podcast before?

David: Not that I recall, no, having read it in this paper.

Drew: Can I spend a little moment to explain this? I find it interesting, and I think it is important. This is a theory that was generated to explain why research can be bad without being negligent or unethical. The idea is that you're walking through a garden. Every time you come to a junction, you can either turn left or right. When you come to the next junction, you can turn left or right, next junction, left or right.

That's what designing research is like. You're going for a garden, you're making lots and lots of decisions, and the answer you get to your research question, is that coming because it's fundamentally part of the data, or is it coming because of all of those little decisions you've made along the way?

You could come out of this garden in lots and lots of different places. If all of the paths converge to only one exit, then maybe you can trust it. But if there are 10 exits, and your decision of the exit comes down to the decisions you made along the way, then you can't really claim that that's the correct exit from the garden. That doesn't have to be that you're being unethical because every researcher has to make all these decisions.

Just to give you a quick example, say we're trying to measure whether safety climate affects injury rates. You have to make lots and lots of decisions to study that. What time period, when do you start collecting data, when do you stop collecting data? Are you going to include injuries to and from work? Are you going to include subcontractors? Are we going to survey people directly, use the company's reporting system, or use the data from work cover?

Do you include every single record that's in the system or just the ones where you can confirm that there was actually a serious injury? What do you do if there are duplicate records? Do you count both or just one? What do you do if a record is incomplete? Do you count it or throw it out? Do we measure management workers together or do we separate them out for our analysis?

Of the dozens of different safety climate measurement surveys, which one are we going to use? Are we going to break it up into dimensions or are we just going to use a total count from the survey? If you're going to do the research, you've got to make every one of those decisions. The answer to, is there a relationship between safety climate and injury, could be different depending on how you make those decisions.

If you're an unethical researcher, what you actually do is you try every combination of decisions until you get the answer you want, and then you publish just that one. If you're a naive researcher, you just make every decision, get an answer and assume that it's the correct one, and just ignore all the other possibilities.

The way the authors of his paper put it, some researchers may intentionally explore every forking path in the garden to find a significant result. Those are the unethical ones. Whereas others may simply get lost taking a few wrong turns. They're the ones who just got a result by chance as a result of the decisions they've made.

This is why the choice of measurement instruments is so important because if the answer is going to be yes there is a relationship, or no there's not a relationship, and that answer is going to depend on which measurement tool you used, it's really important that you're clear about why you picked that particular measurement tool.

David: I think that's a really helpful example and I think just to also make it directly practical. I was thinking a lot about KPIs that we use in safety and measurements, just measurements generally, that we use in so-called leading indicators that we use in safety. The research problem that we're talking about is the practical equivalent of counting the number of leadership visits that happen in your organization, and at the end of the month, that going down and saying, well, our safety leadership has gotten worse over the last month.

If you want to understand safety leadership, counting the number of safety leadership visits that management does, it probably doesn't give you any insight into the construct that you're trying to understand. I think that's the research equivalent of measuring using a measurement approach which doesn't actually match what you're trying to understand.

Drew: In this paper, I think they're going always a step weaker than that. They're complaining about people who say we measured safety leadership and don't bother telling you that the way they measured safety leadership was by counting the number of leadership visits or why they thought that was a good way to measure it. The catch is someone tells you that leadership is going up or down, you think, oh, that's a problem, until you learn that all I did is count the number of visits. Okay, maybe I don't trust this claim anymore.

David: Are we ready to step through these six questions?

Drew: Yes, we've got six questions. Let's go through them in order.

David: Kick us off.

Drew: Okay, the first question is, what is your construct? A construct is just a research way of saying, what is the actual thing that you're trying to measure? This is something that you and I actually talked a fair bit about in our manifesto for reality-based safety science, because it's something that definitely happens a lot in safety. People are trying to measure things before they've even really pinned down what it is that they're trying to measure, and they don't have a good theoretical definition of it in the first place.

What we said in the manifesto, if you don't mind me reading it as a direct quote, "Most quantitative research in safety involves the measurement of attributes of phenomena that are interesting because they're believed to be related to safety." In fact, we said safety researchers usually adopt the jargon of behavioral psychology by referring to these measurements as constructs. Safety scientists measure constructs relating to leadership, safety climate, safety culture, worker behavior, individual perception of behavior. But most of those constructs, we started measuring them before we actually had good theories about them, at least good theories within safety.

We've got these surveys, but if you compare the survey to the theory, they don't really line up. If you read a paper that just tries to qualitatively describe what safety culture is, it talks about things like cultural artifacts, symbols, representations, what meaning we attribute to safety. Then you'll get a safety culture survey and it asks questions like, how good is the safety management system? It's theoretically incoherent to ask that question based on the definition.

The concern they have is just unless we've adequately described what we're trying to measure and made sure that our measure matches that thing, then it's not a good measure. Maybe going back to your example, if you've described a good leader as someone who shows up on site often, then it makes total sense to measure leadership by the number of site visits. If you describe a good leader as anything else, then that's a bad measure.

David: Yeah, perfect. This first question about, what is your construct is, what is the thing that you are interested in? I think just to build on that last example, just to reinforce that point, I think we should do a lot more of that inside our organizations as well. We're actually really clearly defining what we're actually trying to understand before we start just slapping measures around.

Drew: Actually, my organization in Griffith University has gone through a process of describing desirable performance attributes at different levels. What does a lecture do? What does a senior lecturer do? What's an associate professor do? What does a professor do?

I haven't got to the point yet of attaching numbers to those things. What that does is at least creates conceptual clarity of what the expectation is. We've been having a year-long conversation as an organization about even without numbers, just these reasonable expectations, is that in fact the job description? Is that in fact what we expect from people? Is that what we want?

If someone was to actually just go by the letter of the performance expectations, would they actually be doing the job we want them to be doing, or are they missing things, or would they be focusing on the wrong things? If we're going to measure performance, then we have to have that conversation first.

David: Assuming now that we understand the thing that the researcher was trying to investigate, what's the next question?

Drew: The next question is simply, why and how did you select your measure? Typical example there, we've decided that our construct is safety climate. Why did we pick a particular safety climate measurement survey? Why that one?

In something like safety climate, there are dozens and dozens of different survey instruments for safety climate. Why did you pick that particular one? Is it because it aligns with a particular theory? Is it because it's been previously tested in a particular way that makes it better than the other ones?

One of the things they point out is a lack of conceptual clarity. In other words, if we don't really understand that underlying concept, then we just get lots and lots of different measures. That's definitely what's happening with safety climate. All these measures of safety climate seem to be measuring almost totally different things. Why did you pick this particular one and call it safety climate?

David: I'm going to give some examples, obviously, because it's psychological research or a research journal. They said there are 280 scales for measuring depression and 65 different scales for emotions. Like you said, we've probably got 30, 40, or even 50 safety culture or safety climate tools.

They also mentioned that some of them have five or six different links like a 15-item scale, a 25-item scale, a 42-item scale. Some translate better or less well into different languages. In some of these areas, you really need to have a good answer for why you chose the measure that you chose.

Drew: One example I really want to call out in safety is that lots of researchers in safety tend to use these different things from psychology as part of their work. It's definitely a problem in safety research that people seem to just pick one survey at random as if that's the first one they've looked at. Somebody in safety wants to measure anger, so they find an anger survey. They don't even seem aware that there are 19 different surveys that are specifically devoted to measuring how angry you are.

The one that really annoys me is people want to measure personality. They use Myers-Briggs, but they never say, why did I pick Myers-Briggs? There's no good answer to that question, because no one who has done their homework would use Myers-Briggs. It is simply that they've picked one. They were transparent if they were like, why? Here are the ones I considered. Here's the different measures of validity. Here's the theory behind it.

I happen personally to be a deep subscriber to the early 20th century theories of Jung, and that's why I picked Myers-Briggs. That is the only reason why you think Myers-Briggs, but they never say that. They just say, oh, this is a commonly used survey. This is a popular survey, so they picked it.

There's one particular concept. This is actually the first I've heard this language, but I think it's beautiful, so I'm going to use it repeatedly. They call this the jingle-jangle fallacy. Jingle is when two instruments measure the same. You think two instruments measure the same construct because they have similar names. You might put safety climate and safety culture into that category or even two different safety climate surveys.

Just because they've got the same name on them, you think they're measuring the same thing. Whereas one is like a measure of individual attitudes towards safety, and the other is a totally different measurement of how good the company's safety management processes are or leadership commitment to safety. But they've been given the same name, so people just think they're both surveys about the same thing because people call that safety climate.

Jangle is assuming that two measures assess different constructs because they have different names. I think we do that as well sometimes. Some people talk about pro social safety and then some people talk about citizenship behavior. You look at the items in the survey, and they're exactly the same. We don't consider them the same.

We haven't done our homework because we were just looking for other papers on pro social safety. We never thought to look at citizenship. They say that basically transparency helps readers, reviewers, and consumers of research spot jingle-jangle. I might have called that formally an equivocation fallacy, but jingle-jangle is a much better way of explaining it.

David: I like that description. It's nice to have some new terms. I think it's so common, I'm just thinking of all the papers that I've read, which is we're measuring safety climate, we've gone with this generally, widely used, accepted, and available tool. That's it with no other information about the validity and the alignment to the construct.

I think these researchers point out any of the detail, like what were the actual questions? Even if you know what the title of the instrument was, you rarely ever get any of the detail inside of paper of what the actual individual questions were.

Drew: That is, in fact, the next question they have. What measure did you use to operationalize the construct? Which is a really fancy way of just saying, tell us the damn questions. You did a survey, why the heck didn't you publish the survey as a table in your paper, as an appendix in the paper, and supplemental materials in the paper?

Not just the questions, but did you ask them with a five-point in Likert scale? Did you ask them electronically? Did you have an interview or ask the questions? How many items did you have? What language was it in? These are basic things that should be really easy to report.

David: I think often in psychological research, I don't know if it is actually often, but it feels to me like it's quite common for there to also be batteries of measures as well, I want to understand the relationship between multiple things, so I'll put together a little battery of three or four different measurement instruments and maybe deliver them in one go. Fifteen questions about anger and then 15 questions about personal circumstance, so putting this all together.

They even say in these papers, what order did you sequence that battery in? Did you prime any responses to subsequent instruments by the previous instrument that had just been completed? I don't recall ever, I think, seeing that level of detail about how the construct was operationalized.

Drew: No, not at all. Back when Twitter was still working, a trending hashtag called overly honest research methods, which is, what would papers look like if people were transparent about genuinely why they selected things? Why was this the best way of measuring anger? I asked my supervisor and my supervisor said, here's an anger survey.

There were 19 measures, but 18 of them were copyright. Our library doesn't subscribe to them, but there's a 19th one. There was a web page where I could download it from. I didn't think those would often be the answers to why we use this particular measure. How do we operationalize it? We cut and pasted it.

David: If you're reading papers and you're seeing conclusions in relation to a particular construct or issue, what we're talking about now is just understanding the measure before you run off. Otherwise, you just put blind faith in the researchers' decisions around all these things we're talking about.

Drew: That's a good practice when you're reading a paper. If someone says they've done a survey, the first thing should do is just skip to the part of the paper where they list out the survey questions. Just read the questions, then you know what they asked. If you can't find the questions, toss out the paper and find another one that's a little bit more transparent about what they did.

The next question is a little getting down into the details now. How did you quantify your measure? Just because you've asked a survey, doesn't mean that you know how someone actually processed it. 

There are three different things they say there that matter. There are transformations of the responses. Those are things like some items are reverse scored, you score it backwards. If you've got one question, how angry are you, and then the second question is, how calm are you, you've got to score one of those backwards in order to be able to mean the same thing.

How do you select which items or stimuli form the scores? Often, what people do is they do a first path and they exclude certain questions because the results on those questions weren't consistent, so explaining just exactly how that was done and why that was done. Or they group items together and they say these three items measure the same thing, so I'm going to add them together.

There are different ways you can add them together. You can average them, you can standardize them, you can group your participants into categories, you can put them in order and pick the top 10% and the bottom 10%. Lots of different ways of doing it, but it's important to know what methods were used. How did you turn this survey into a score or a set of scores?

David: I have my undergraduate psychology degree, it was in the 90s. It's the last time I played around with quantitative research and statistics. I just see tables and tables of numbers in these types of research papers, and that's your job to explain that to me. It doesn't mean anything to me.

I am one of those people who would tend to go to the findings and discussion part of the paper and go through the researchers’ conclusions. I've done that a number of times since we've been podcasting. I'll send you a paper because of some really interesting conclusions and you'll say, come back to me, this is rubbish research, we're not talking about this.

Drew: I think people reading a published paper should have a right to be able to do that. Peer review has lots of weaknesses and things that it can't pick up. But basic things like the summary and conclusions of the paper should be a fair representation of what the data actually said. You should be allowed to take that for granted. You shouldn't have to dive into the weeds of the table.

I think some of these questions give you an alternative approach rather than having to actually check the numbers, because someone who has been transparent about each of these things, you then know, okay, the peer reviewers have had a chance to clearly read these things as well, and they've done the checking for me. Whereas if you try to answer these broader questions and the paper doesn't have answers, then you know that the peer reviewers haven't really done their job either.

The next one is, did you modify the scale? If so, how and why? This is something that happens a lot in safety. I think it's a good thing, it has to be done. You think of a survey that was intended for measuring locus of control around illness, and you want to change that to measure locus of control around wearing PPE.

You pick a standard survey for locus of control, you just change all the words from my health to my PPE. There's nothing inherently wrong with that as long as you've got a good reason to do it and as long as you're transparent about what you've done. It's just one of those things that if you don't explain what you've done.

I know I've read a lot of papers where they say, oh, I used a standard locus of control instrument and I adapted it for safety, but that's all. They don't tell you what the changes were, how they adapted it, or why they made the changes they did. If they're transparent, you can just look at what they've done and say, yeah, well, that makes total sense. I think that's really all we need to say about modification.

David: The last question, I guess, did you create the measure on the fly? I guess it just made up your measurement.

Drew: Okay. This is the simplest one and is the one that I get most angry about. The authors say, sometimes there are no scales available to study a construct of interest, or there may be good reasons not to use existing scales, fantastically. Sometimes there are no ways of measuring what you want to measure. Sometimes all the existing ways are crap, and you need something else.

How often does that actually happen? One of the most common things I have to say when I'm reviewing a paper, including reviewing someone's entire PhD, why the hell did you make up your own way of measuring safety climate? Was it that you weren't aware that there were already dozens of safety climate surveys? Did you go through every one of those existing ones and have good reasons for directing all of them so that it was just absolutely essential you make up your own?

I'm asking those questions in the knowledge that the answer is no. You weren't even lazy. You did one type of work that was a lot of work because you were too lazy to do the other type of work, which was finding out that someone had already done it. Ninety-nine times out of 100, an existing measure, they've already made all of those decisions. You're not at risk of the garden of forking paths. You're not at risk if your process is determined by the decisions you've made.

Ninety-nine times out of 100, they've already validated it. They've already screened the items. They've already done the work to make it theoretically consistent. If you're going to make it up yourself, you got to do all of that work for yourself. You have to theoretically define the construct. You have to have a set of test items, then screen those test items, then rerun the survey, then validate it again, then test that survey, and then roll it out to measure what you want to measure.

You got to explain all of that in your paper so that we understand how you came up with this new survey. By the time you've done that, you've written two other papers before your current paper, one proposing the measure, another paper validating it, and the third paper is the one where an entire PhD could develop a measure, validate the measure, and use the measure.

David: A literature review at the start explaining why there are no measures that exist that do what you need it to do.

Drew: Yeah, perfect. There's the design of someone's PhD if no one else has already done all that work in your field. But if someone's already done it, the literature you should find it, and then change of plans.

David: And use it, yeah. It's a great point. I guess it does happen in research, you make up a measurement. If we take these researchers, maybe there is something that you want to understand, then they basically just said, you need to go through these whole six questions and explain how you met each of those ones in the design of your own measurement instrument.

Drew: I think this one is very important for organizations. For almost anything an organization wants to measure, academics have already done the work for you to have seriously well-developed and rigorously validated measures. You still need to do a little bit of work to check that it's actually suitable, that the construct it measures matches what you as an organization care about.

You're still probably going to have to pay to use the survey to get a license for it, but all of that is way less than the cost and effort of developing something new. Or worse, paying a consultant to pretend to have done the work, when all they've done is just make up a set of questions without doing all of the validation.

David: Or end up using something that isn't actually helpful for what you're trying to understand, and making decisions based on that.

Drew: I just realized by saying that you are a consultant. I'm very well aware that you have developed new measures and have seen the amount of work you have put into developing and validating the measures before you use them with your clients. We know I'm aligned with consultants, #notallconsultants.

David: No need. I feel exactly the same way.

Drew: I think what annoys me is seeing the effort you put in, and then seeing other people who clearly haven't who would claim to do exactly the same thing.

David: For me, it's the lesser of two evils than having to get a real job inside an organization again. 

I think these questions, what we do in quantitative research is we seek to measure something usually by measuring something that isn't readily quantifiable, so we need to have some kind of instrument that gives us an insight into what we're trying to understand. I think this paper really highlights it.

When you read through his paper, it's really accessible. The language is really accessible, too, and it's open access, I think, as well. It's a good paper to read and just go, how critical are we as consumers of research? If we translate that into our organizational practice, what does it mean for the things that we measure inside our companies? How critical should we be with some of those things? Do you want to move into some practical takeaways?

Drew: Sure. Just before we do, should we just give a list of those six questions again? The six questions to ask are, what is your construct? How and why did you select the measurement instrument? The survey being careful of jingle-jangle. What did you use to operationalize the construct? That is, what exactly are the items you use? How many items are long form, short form? How did you quantify your measures? How did you group them, average them, include, exclude? Did you modify? And if so, how and why? And did you measure on the fly? We're looking for answers to those six questions. Takeaways.

David: The first one, I think, for researchers is, I guess, the paper tends to conclude on the way through that we don't know if people are ignorant, misrepresenting, or what. But if we all just be more transparent around these six questions, then people who are consuming the research can just make up their own mind about their faith in the conclusion. I think the practical takeaway for researchers is to maybe expand out your method sections in your papers a little bit more than you currently do.

Drew: I would add to that based on some feedback I've had about our own manifesto. When you've got a paper like this which is common pitfalls to avoid in research, ask the questions and think about them before you design the research. Don't pick up a paper like this when you're finished and then use them to postdoc justify what you did. You should be transparent about this. But if you're honest with yourself, the answer to some of these questions should cause you to go back and do it differently.

David: It's more of a checklist or research design rather than a checklist for research write-up.

Drew: Exactly.

David: Perfect. I guess as consumers of research which we all are in one way or another, either directly or indirectly, as much as the peer review process of research publication helps with a level of quality, we in quantitative measurement processes, particularly around individual and social type phenomena like we're talking about in psychological journals, you can't just do what I said earlier. We can't just go to the abstract and the findings and take the researchers' conclusions at face value. We need to understand this level of detail about the measurement process and the associated validity of the measurement.

Drew: What I like about this is validity here, we're not talking about understanding some statistical analysis of the results. We're not talking about statistical validity. We're talking about read the sentence where they say why they picked that particular survey and don't accept an answer from the researcher that you wouldn't accept from your own employee.

If the answer is, this is the first one we came to or this is common, other people use this a lot, just don't accept that as an answer. Look for good answers about, what were they trying to measure? Why was this a good survey? They should be able to explain that to you in plain English, not using maths.

David: The final takeaway for practitioners—I guess many of our listeners are practitioners—is just thinking about the things we've spoken about today in this paper and thinking about how you research and report measures in your own organization. Every measurement that you have in your organization in relation to an aspect of safety or work and then how you represent that measurement in your organization, think about whether you can answer these six questions for each of those.

Do I know what I'm seeking to understand? Why do I think that this measurement is a good way of doing it? Am I clear with everyone in the organization, exactly how we operationalize it and how we analyze it, and doing all of that? Are we just making stuff up ourselves?

Drew: If you're too embarrassed to share those results, that's probably a good clue that you maybe should have put some more thought into it.

David: The question we asked this week was, how can we get better at using measurement?

Drew: Normally, we need a sentence to answer this and I take a paragraph. The authors in this paper actually have a single word answer for us that occurs throughout the paper, which is transparency. If we're honest about what we're doing, that's the starting point for all other improvements.

David: Excellent. That's it for this week. We hope you found this episode thought-provoking and ultimately useful in shaping the safety of work in your own organization. Send any comments, questions, or ideas, for future episodes to feedback@safetyofwork.com.