The Safety of Work

Ep. 125: Does ChatGPT provide good safety advice?

Episode Summary

Today Drew and David scrutinize ChatGPT's ability to deliver safety recommendations, questioning the fairness of expectations placed on the AI and considering appropriate benchmarks for its performance. Their analysis is shaped by the framework of the article, “The Risks Of Using ChatGPT To Obtain Common Safety-Related Information And Advice” from the Journal of Safety Science from November 2023.

Episode Notes

From discussing mobile phone use while driving to the challenges of giving advice to older adults at risk of falls, this episode covers ChatGPT’s responses to a wide range of safety topics - identifying biases, inconsistencies, and areas where ChatGPT aligns or falls short of expert advice. The broader implications of relying on ChatGPT for safety advice are examined carefully, especially in workplace settings. While ChatGPT often mirrors general lay understanding, it can overlook critical organizational responsibilities, potentially leading to oversimplified or erroneous advice. This episode underscores the importance of using AI-generated content cautiously, particularly in crafting workplace policies or addressing complex safety topics. By engaging with multiple evidence-based sources and consulting experts, organizations can better navigate the limitations of AI tools.

Discussion Points:

Drew and David discuss their own recent experience with generative AI
The multiple 15 authors are all experts, discussing the methods used
Examining the nine different question scenarios
‘Mobile phone use while driving’ results
Crowd/crush safety advice
Advice for preventing falls in older adults
Analyzing ChatGPT response formats
Exercising outdoors near traffic with asthma
Questioning ChatGPT about how to engage a distressed person who may commit suicide
Safety working ‘under high pressure’ and job demands, burnout prevention
Lack of nuance in ChatGPT
The safety of sharing personal data on fitness apps, how can it be shared safely?
Is it safe to operate heavy machinery when fatigued? Testing several ways to ask this question - sleepy, tired, fatigued
Conclusions and takeaways
The answer to our episode’s question: “AI is not currently a suitable source for writing safety guidelines or advice”
Like and follow, send us your comments and suggestions!

Quotes:

“This is one of the first papers that I've seen that actually gives us sort of fair test of ChatGPT for a realistic safety application.” - Drew

“I quite like the idea that they chose questions which may be something that a lay person or even a generalist safety practitioner might ask ChatGPT, and then they had an expert in that area to analyze the quality of the answer that was given.” - David

“I really liked the way that this paper published the transcripts of all of those interactions

with ChatGPT. So exactly what question the expert asked it, and exactly the transcript of what ChatGPT provided.”- David

“In case anyone is wondering about the evidence based advice, if you think there is a nearby terrorist attack, chat GPT's answer is consistent with the latest empirical evidence, which is run. There they go on to say that the rest of the items are essentially the standard advice that police and emergency services give.” - Drew

“[ChatGPT] seems to prioritize based on how frequently something appears rather than some sort of logical ordering or consideration of what would make sense.” - Drew

“As a supplement to an expert, it's a good way of maybe finding things that you might not have considered. But as a sole source of advice or a sole source of hazard identification or a sole position on safety, it's not where it needs to be…” - David

Resources:

The Article - The Risks Of Using ChatGPT to Obtain Common Safety-Related Information and Advice

DisasterCast Episode 54: Stadium Disasters

The Safety of Work Podcast

The Safety of Work on LinkedIn

Feedback@safetyofwork

Episode Transcription

Drew: You're listening to the Safety of Work podcast Episode 125. Today, we're asking the question, does ChatGPT provide good safety advice? Let's get started.

Hey, everybody. My name is Drew Rae. I'm here with David Provan, and we're from the Safety Science Innovation Lab at Griffith University in Australia. Just to answer the question we get most often by email, I am not recording from an aviary. I'm in Tambourine Mountain, Australia, in the middle of a forest instead. David, I believe, is down in Melbourne. Welcome to my mountain and David's garage.

David: Yeah, it is a bit like a garage, but there's not as much birdlife here as where you are, Drew.

Drew: Welcome all to the Safety of Work podcast. In each episode, we ask an important question in relation to the safety of work or the work of safety, and we look at the evidence around it. Today, David, we're going to, I think for the first time, talk about any form of AI, but particularly we'll be talking about generative AI and ChatGPT.

David: Yeah, I guess a little bit late to the party with doing an episode on ChatGPT and AI. There's still a bunch of papers that are coming out. I saw Ben Hutchinson just reviewed another one only a week or two ago. I'm really looking forward to this discussion today, Drew. I personally, having experimented probably as much as most with AI and the use of AI, just some minor applications related to the work that I do. How about yourself, Drew? How much have you had to play with generative AI?

Drew: We've been pretty concerned with its use for academic misconduct. We do a lot of playing whenever we write an assessment. We try to work out how you would generate an answer to that response using ChatGPT and then how do we redesign the assessment so that you can't do that.

In the safety space, I think one of the reasons we haven't talked about it is, ChatGPT 3.5 came out in around November, 2022. That's when the big hype of it was. There was this whole rush of people racing to get out AI papers, but they were all just really low quality conference style. Hey, we did a little play around with it and this is what we found. There haven't been that many really solid looks at the generative side of AI and what it can and can't do. This is one of the first papers that I've seen, David, that actually gives a fair test of ChatGPT for a realistic safety application.

David: I think this is a fun paper, Drew, an interesting method not just the findings, but also the way that the paper's written. I found it quite helpful. Should we get stuck into it? Would you like to introduce the paper and then we can talk about the study?

Drew: Yeah, let's get right into the paper. The title of the paper is The Risks of Using ChatGPT to Obtain Common Safety-Related Information and Advice. A nice, clean paper title. Not exciting, but the paper does exactly what it says on the tin. I'm not going to read all the authors. This paper's got a lot of authors, and we'll explain why in a moment. The lead author is Oscar Oviedo-Trespalacios. Oscar's a professor at Delft University, which is one of the international powerhouses of safety science.

Prior to Delft though, Oscar was right here in Brisbane at the Cars Road Safety Research Centre. His specialty so far has been mainly in things like mobile phone use while driving, but with some really interesting branches out into broader concepts in safety, relying on his very specific expertise. The reason why there are so many authors though is that's what they've done in this paper. They've got nine different authors, each of whom is a specialist in some particular aspect of safety. They're going to use these experts as both the writers of the paper and as their expert participants in conducting the study.

Date on the paper is November, 2023. In this case, the journal is Safety Science, which doesn't matter that much, but the date is really important. November 23, 2023, late 2023 based on the cycle of publishing, is the first time you'll really start to get papers that have been through the full writing and review process after ChatGPT came out. They will be continuing to come out over the next couple of years, stuff that is from that early generation, but this is one I'd like the first things to come out that is also solid and has gone through the proper peer review cycle.

David: Drew, I think there is actually 15 authors by my count. I think there's a few other colleagues from Delft that I recognize and elsewhere. A lot of authors, but Drew, should we talk a little bit about the method and what the research was?

Drew: Sure. I'm wondering now what half these authors did because we know exactly what nine of them did. The idea is they took nine experts. These are experts who have got direct experience and leadership in a particular safety concern. Basically, the test here appears to be that they have published multiple papers directly on that topic. All of these authors, they're practicing researchers. Each expert went into a conversation with ChatGPT about a particular topic. By conversation, we just really mean here one main question followed up with a couple of follow up questions if the main question didn't elicit really the right response to analyze.

Just to explain the questions they asked, they were trying to ask questions which are about an issue that a lay person might ask about and something that is talked about in the media and talked about in safety science research. They're going to analyze those responses based on the expert's own knowledge and what the expert considers to be reputable advice and a full answer to the question.

To be precise, they're using ChatGPT 3.5 as it was available on the public OpenAI interface in January 2023. For those who aren't familiar, ChatGPT regularly gets updated. It goes through big version number iterations though, so it typically jumps in half number increments. ChatGPT 4 is substantially different from ChatGPT 3.5, is substantially different from ChatGPT 3. But there are lots of minor iterations along the way. They might, for example, put in particular rules or filters that change its behavior in between those major version numbers. David, what do you think of this as a reasonable use case and test of ChatGPT? I'm asking you the question and getting an expert to evaluate the answer.

David: Yeah. It's an interesting study design. It's not one that I would know what to call it. Normally, when we start getting the opinions of experts on things, it's more of a Delfty study. I think there's a lot of limitations of trying to get experts to agree on something and call it research. In this case, I actually quite liked actually testing ChatGPT for what it's intended use case is. The generative AI's intended use case is that people can ask a question and get a good answer, a useful answer. Maybe not necessarily a good answer, but a useful answer.

I quite like the idea that they chose questions, which may be something that a layperson or even a generalist safety practitioner might ask ChatGPT, and then they had an expert in that area to analyze the quality of the answer that was given. I actually don't know what to call it from a research methodology point of view, but it seemed like a good way to test exactly what they were trying to understand.

Drew: Yeah, I'd agree with that. The only thing that occurred to me as we were getting deeper into the paper is I would like for there to have been some fair comparison. I thought it at various times that they were being a bit harsh on what they expected out of ChatGPT. I was thinking, okay, so what would be fair to compare this to? Would it be the Wikipedia article, would it be the first answer that comes up on Google, or would it be random user on Reddit? I would have liked them to think a little bit about if you weren't using ChatGPT, what you would be using instead, and how well is ChatGPT performing? Because it'll never be perfect, but nothing will ever be perfect.

David: Yeah, and I think because these researchers, they're academics. I think their basis for comparison was like a state of the art systemic literature review on the topic. In the way that this paper is published, it's almost as if each of the nine experts almost were told, write a 500-word analysis for this, because even the way that they structured their own analysis and feedback on each of the nine different cases, it felt like nine different people were writing their response of what they thought about it.

Drew: Yeah, I actually quite liked that style because they all noticed different things. It did seem sometimes as if they were comparing it to have you included everything that would be included in a systematic literature review, and have you written something that is concise and clear as a government advisory notice? That's a very high standard to meet.

David: The last couple of things that I want to mention and get your thoughts on as well, one thing that I was concerned about with asking an expert to evaluate a generative AI answer given that people may be wanting to use generative AI to replace the need for experts is that the expert might be overly critical of the AI. I think like what you've mentioned, in a lot of ways, sometimes they were. But if we had a concern that the expert was maybe misrepresenting certain aspects of what ChatGPT actually provided, I really like the way that this paper actually published the transcripts of all of those interactions with ChatGPT, so exactly what question the expert asked it and exactly the transcript of what ChatGPT provided. You could actually see the raw information and where they were drawing their conclusions from. I actually really like the ability to have the transparency over that.

I also think as we talk through soon the nine different studies, why nine? I don't know why nine. Like what you said there about the different experts noticing different things, I continue to learn more about the nuances and the limitations of ChatGPT in each of the nine cases. I think if this paper only had two or three examples, we wouldn't have been able to maybe feel as confident with the conclusions that the paper draws.

Drew: Yeah, I love the use of nine very different topics with a range of experts looking at it and the full transparency. One tiny little caveat, which is that ChatGPT is context specific. If they started those transcripts after asking it a couple of questions, that would have primed the results. They don't specifically say in the method that they used a totally fresh ChatGPT session before asking their first question. That's a tiny little niggle. The amount of transparency here is way above what I've seen in similar papers, and it really gives you confidence that they're not being unfair in their characterization or in their criticisms.

David: They're fun examples, so should we talk for a few minutes on each of these now?

Drew: Yeah. Let's just take one each and go through the nine.

David: Lead us off.

Drew: The first one is mobile phone use while driving. I'll give you the example of the exact question they asked here just to get a sense of how they prompted it. David, do you have it immediately in front of you?

David: No, not right immediately because they did ask two questions. (1) What are the safety risks? (2) How can I do this activity safely?

Drew: Yes. The first question they ask about mobile phone use is, is it safe to use a mobile phone while driving? Then ChatGPT gives an answer, and then they say, how can I use my phone while driving safely? ChatGPT gives an enumerated list of suggestions. This is what the expert evaluation says of this advice. Basically, it is reliable and reasonably complete. It doesn't say anything that's stupid, and it picks up on some things that other sources might not pick up on.

The expert did have a few niggles. The first one is they noticed that it immediately uses a US source for how dangerous mobile phone use is without any caveats or considerations. It uses the source as evidence of the absolute risk of mobile phones. It's also overly confident on a couple of nuanced things. When talking about whether it's bad to text while stopped at red lights, the literature on this is currently ambivalent, whether there's actually evidence that it's dangerous. ChatGPT just says, no, flat out, it's dangerous.

On the other hand, there's a lot of recent literature on different apps that are used to manage phone use while driving. The expert considered that to be an important topic that you really should cover in a modern suggestion about mobile phone use, and ChatGPT didn't pick up on that. Overall, good. Overly generalized to the US, one thing that it was confident where it should not have been confident, and one subtopic that it should have talked about that it missed.

David: Drew, I think when we think about the safe use of mobile phone when driving and what the generative AI is actually doing in terms of the body of information available on the internet, is you can imagine there's quite a lot of things written about mobile phone use and driving. It's obviously not just academic literature. There'd be stuff from regulators all through that. There'd be just general articles in magazines, papers, and things like that. I think for me that some of that response was predictable, but I think the nuance is really, really important there.

Drew: Yeah, I think it's fairly important that this is not a topic where there's a lot of misinformation. It's a topic where there's lots of things telling you exactly what ChatGPT told you. It's successfully synthesizing something that is talked about a lot all in the same direction, but it's missing where there's a significant academic contribution to the topic.

David: Yeah, it's like not texting while you're at a red light. I'm sure all over the internet, there's very strong advice from regulatory authorities that under no circumstances is it safe to do this. That's what's giving ChatGPT its confidence to be absolute in its advice around that without obviously understanding the credibility of any of those sources.

Drew: Just quickly before we move on, David. We'll come back to this later in the recommendations. If you were a safety person tasked with creating a mobile phone use while driving policy for your organization, this would not be up to scratch. It would be basically telling you what you could find from your first search on the internet, doing it not inaccurately, but you're not adding value as a safety professional here if you're using this answer. You're missing out on the latest research, you're missing out on some new interventions that could be quite useful. If you're a teenager saying, can I get away with using my mobile phone, it's probably enough.

David: Which is what the internet's for. The second scenario, we'll call them scenarios or cases, was about how to keep children safe around water and the risk of children drowning. Basically, the ChatGPT was queried about how can I, I'll just get the question right here.

Drew: The first question was, are children at risk of drowning? And then the second question is, how can I keep children safe around the water?

David: Yeah. Drew, do you want to pick up from there? The expert picked up on some of the things that ChatGPT mentioned, but other sources often miss, like not just talking about swimming pools, but talking about things like natural waterways. ChatGPT started to get into a broad drowning risk conversation, but then I think the expert said that it started that, but then really stopped going as far as the expert would go about things like flood situations, a bunch of other scenarios, or contexts where children may be at risk of drowning.

Drew: Yeah. What it seems to be doing is tracking where there is the most stuff published. There's obviously a lot of stuff about pools that picked up on pools. There's a separate bunch of stuff about natural waterways, which it picked up on, that often people who are just talking about pools tend to forget about natural waterways, but there isn't as much written about drowning in floodwater. It didn't prioritize talking about flood specific advice.

Also, most of the stuff that's published is about high income countries, particularly the US. Most of the stuff that's published talks about pools and on prevention methods that really only work in an enclosed body of water like a pool, and that's what ChatGPT chose to talk about. The question didn't mention pools, but that's where ChatGPT went because that's where most of the stuff is. If you were in a country where the biggest risk is actually big bodies of open water, which is the case for most low and medium income countries, it wouldn't be wrong. It would just not be the most applicable and useful advice.

David: Again, Drew, like you said there that I just emphasized, that point is ChatGPT is putting credibility on things that are said most often and isn't really able to distinguish the accuracy or validity of the information. Drew, the third one that I'm going to throw to you to talk about, I'm confident that there's a disaster cast episode on a stadium disaster to do with a crowd crushing incidents. I'm going to let you talk to this third one.

Drew: Okay, you put me on the spot, and I'm not going to be able to give you the disaster cast episode number. There was an episode where I talked through several different stadium disasters. I like the way they framed this question. They first asked if I go to a concert, am I at risk of being crushed in the crowd? To which ChatGPT said, yes, which I think is probably an overstatement of the risk here. But then they actually asked for advice.

If I'm caught in a crowd crush, what can I do to survive? This is where GPT is clearly crowdsourcing crowd crush advice. Its advice ranged from the valid to the useless, to the actually incorrect. Its most valid advice was stay calm. Its second piece of advice was stay upright to which the expert response was, well that's easy for you to say. The problem in crowd crush is that you can't control whether you're staying upright. Its next advice was keep moving forward, which the expert rated as contrary to safety. And shout and wave your arms, which the expert noticed is entirely contradictory to the stay calm advice.

Basically, none of this advice was really about stuff that would help you. It was missing the advice on how to notice danger signs, how to notice increased crowd density, how to notice accessibility of exits, and how to get out of the situation while you are still able to navigate. It was missing the one piece of advice that does actually work if you're directly actually in a crush, which is don't try to go sideways, look for opportunities to escape vertically by climbing or getting yourself up above the crowd.

Its general conclusion was it's accurate about the problems, but not at all accurate or specific in giving advice. Don't look while you're in the middle of a crowd being crushed. Don't pull up ChadGPT and try to follow its instructions.

David: I think we can start to see, and this is where each of the nine case studies is going to give us an extra bit of nuance, because in the first two examples where we've talked about mobile phone use, driving, and water safety for children, there's a lot of information on the internet that is mostly consistent, but now we're starting to get into quite a nuanced question about crowd crushing and advice. I think there'll be less generally available information. ChatGPT is probably drawing from adjacent type information that's not really about crowd crush but might be about just walking around busy streets or something like that.

Drew: I had an interesting question here, David. The question about child drowning was, how can I keep children safe around the water? The question they asked for the crowd crush was, if I am caught in a crowd crush, what can I do to survive? I'm wondering if there's a nuance in the prompting there that if we had asked for, how can I prevent crowd crushes, or how can I keep crowds safe or keep people safe in the event of a crowd crush, whether it might have led to different advice. The attempt to personalize it might actually have taken ChatGPT down an area, that ChatGPT doesn't really know how to reinterpret the prompt to give you what you actually want. It's more giving you directly what you asked for, where there isn't a lot of stuff.

David: Yeah. We're going to talk in the recommendations and conclusions about a lot of the safety advice being quite individual, quite behavioral, and maybe some of the prompting is specifically prompted that that response will have. Another point that we'll look at with our ninth example that we'll talk about is different responses when we use different language that a layperson might use to mean the same thing, but ChatGPT sees them as very different words. Again, what I like in this study is that they've actually given you the transcripts. If we're curious about those things, we can be curious about it, and that can do what all good research should do, is generate the next research question that we might want to go and research.

Drew: Absolutely. This isn't a criticism of the research. The method has enabled us to notice these nuances.

David: There was a part B to this crowd crush scenario which I've called case study 3B, which was where ChatGPT was prompted to give recommendations about the best possible response in the case of external danger like a terrorist attack, a mass shooting event, or something like that which might generate a crowd crush.

Drew: Yeah. In case anyone is wondering about the evidence based advice if you think there is a nearby terrorist attack, ChatGPT's answer is consistent with the latest empirical evidence, which is run. They go on to say that the rest of the items are essentially the standard advice that police and emergency services give.

The interesting thing was when they asked follow up questions, ChatGPT didn't stay consistent. When pressed to give more information, it wasn't able to elaborate in ways that didn't provide really quite contradictory things. It basically ended up saying, oh, both stay still, be calm, and run. It didn't notice that those two things are not possible at the same time.

David: Yeah, it was telling you to run, it was telling you to hide, and it was telling you to lock or block an entry and stay where you are. It was telling you to call for help and be quiet at the same time. If you look at the six, I'm just looking at the transcript of the six things to do to increase your chances of survival, it's hard. I think that was one of the things we'll talk about in the recommendations. If you just lift up the response and try to do it as it's provided, you'll run into problems. Case study four, Drew.

Drew: Case study four was older adults at risk of falling at home and what interventions are available. The exact question here was, is walking inside the home safe for older adults? How can I reduce my risk of falling and fracturing a bone as I get older? Again, the questions are steered a little bit towards giving personal advice, but that was the brief that the experts were given, is what might a person actually type in.

David: Interestingly, Drew, when you look at the 100 or 150 or so words that ChatGPT generated, it's very biased towards tripping hazards around the home. I don't know the research and the literature in this space, but it would be my assumption that a lot of falls around the home are in wet areas due to slippery floors in getting out of bathrooms, getting up off toilets, or moving around kitchens or things like that. There's no mention of wet areas, slippery floors, or anything like that. ChatGPT just seen fall and thought trip over.

Drew: Yeah. We're not the experts, but they had an expert reviewing this who tended to say that there are things other than the immediate environment that affects like talking to your health practitioner, having a suitable exercise program in a safe way to maintain your fitness, resilience, and walking ability. A lot of it is like location specific, that there are particular areas around the home, David, as you said like bathrooms, where you're going to want more specific advice rather than just avoid trip hazards.

David: I like the fall prevention program too and education on fall prevention. I can see a person in their own home putting posters up on the wall. A bit of a stand down for fall prevention month.

Drew: The way they summarized it is that although ChatGPT could synthesize most of the information accurately, it didn't really have a rationale for the order of suggestions or any prioritization. Again, they had that same conclusion that it seems to prioritize based on how frequently something appears rather than some logical ordering or consideration of what would make sense. It's a bit as equivalent if you asked it to give me in order the things which are most commonly mentioned in fall prevention guidelines, it would give you this list.

David: I want to pick up on an interesting note that you mentioned there in our notes that ChatGPT finished the response to what can I do to reduce the risk of me falling around the house. The last sentence that the ChatGPT response gave was that taking steps to reduce your risk and maintaining your independence improves your quality of life. It finishes on quite a positive tone. The experts said that's consistent with research on how to present information to older people in a way that motivates behavior change. We've talked about missing some nuances of the content of the answer, but ChatGPT in this case has picked up the nuance of I'm providing advice to an older adult about their behavior and has framed the response in that way.

Drew: That's the way the reviewer gives it credit and says, this is well structured advice in that format. I think it's actually a coincidence here that ChatGPT tends to structure its answers. It starts off generic and bland, goes specific, and then finishes generic and bland. That's the way ChatGPT works, and that's the way you spot a ChatGPT written answer. I think in that case, the generic and bland was these things about, oh, it's good for your quality of life if you can do these things.

David: Again, it's just an interesting follow up research question about, did ChatGPT actually mean to do that or not?

Drew: We've actually got an honors student at the moment who is studying disaster warnings like if a hurricane is coming, looking at what makes a good communication message, and whether ChatGPT outperforms a human writing those messages. It is actually like a really interesting and important research question. If people are using these answers, then how well-written they are does actually matter and whether they follow a clear format that lets people take action? That's what this reviewer is commenting on is the expert saying, when there's minor inconsistencies, that could actually really confuse an older person who is trying to read and follow these instructions. The fact that ChatGPT doesn't have an internal consistency check of its own answers is a problem, but its general formula might actually be really quite helpful in motivating people to follow the advice.

David: Okay, Drew, number five, exercising outdoors if you've got near road traffic if you've got asthma.

Drew: If you're worried about exercising, is exercising near traffic bad for you? The conclusion is that what they provide is credible and authority of pretty much what you get from a scientific organization. The difference, which the experts consider to be really important, is it doesn't state its evidence underpinning the various things that it makes. It doesn't include its citations, most things on the internet don't, but that is important if ChatGPT is also not picking up on recent evidence. If it doesn't tell you the evidence, it is missing evidence, and it is always sounding very confident, that's a problem.

David, they don't talk about this in the paper, but I'm immediately worried about this world where safety professionals start writing their advice using ChatGPT. It starts off as really good advice because it's simply copying advice that's out there. But once we get five years down the line and we're using ChatGPT to copy ChatGPT written answers, they're just going to get more and more out of date because it's going by vole, not on credibility, recency, or it's not able to synthesize the updates to its own advice.

David: Yeah. Also, this was one of the first time we started seeing an answer from ChatGPT around like, according to the World Health Organization, things are seemingly very credible. I'm sure that the WHO have published guidance on air pollution and asthma, but then once it pulls from that a source, it starts introducing words like incidence about the frequency of certain types of conditions. This question didn't even ask about that. It was asking about if you already have asthma, how safe is it to exercise outdoors next to traffic? Incidence, if you've already got, it becomes irrelevant. It's missing that nuance, but it's really confidently putting forward seemingly very credible advice. Does that makes a little bit of sense?

Drew: The other thing that the authors note is that the advice is politically skewed. They say that there's advice about the best thing you can do for clean air is actually to advocate for clean air rather than just trying to worry about whether you can go out in the smog. Generally, government agencies don't include that because it's like the government criticizing the government. It tends to be academic and non government organizations that give advice that includes these structural changes to reduce the risk. I love the way we have to guess what ChatGPT is doing because it doesn't cite its sources, particularly not when it's synthesizing, but we have to guess that ChatGPT here is generally prioritizing these simple government advisories rather than academic information

David: The fact that most of that academic information is behind paywalls, again, you can't really blame ChatGPT for not drawing on it. I assume ChatGPT has the same difficulty getting papers as some of our listeners might.

Drew: The good news there being that Reddit has now put its API behind a paywall too, so maybe we'll get a little bit less Reddit going into ChatGPT in the future.

David: The world balances itself out. Let's ready to move on, Drew. Case six?

Drew: Yeah, let's talk about suicide. ChatGPT accurately tells us that if someone is considering suicide, it is appropriate to engage with them and to ask them directly about whether they're considering harm. This is the first area where we know there is lots of misinformation out there. ChatGPT gets it right and right on something that's really quite important about advice if you're thinking about how to talk to someone who you think might be suicidal.

David: Yeah. What this case was about is asking ChatGPT about whether it's safe to ask and how to safely ask a distressed colleague if they're having suicidal thoughts. Like you said, ChatGPT covered a range of considerations and steps for how to engage with a distressed individual and suggested that that was consistent with leading mental health and suicide prevention organization advice. Like you said, the key strengths there when we kick this off by saying, let's talk about suicide, was just affirming that it is safe to engage with a distressed person. The research suggests that this long held belief that you shouldn't raise these topics because it further motivates people to harm themselves is not true in the research. I guess we don't know if ChatGPT got lucky, got this intentionally right, or found a particular source that led to this answer because based on everything we've talked about so far on the podcast, it might have just got lucky.

Drew: Yeah. That's a different experiment. You could take one of these nine topics and try to ask ChatGPT multiple times in multiple ways to see how well it retained consistency with the good advice. They're trying to do something which is ecologically valid, which is, what would a normal user do? A normal user isn't going to ask it ten times, a normal user is going to ask it once. Its first hit on pretty much all of these topics has been broadly correct, and the trouble's been in some of the details.

David: Also, the expert in this case said, ChatGPT weren't able to see the bigger picture. When asked about whether it's safe to ask the person, it really thought about whether it's safe for the person who may be distressed. It didn't provide any consideration for the actual supportive person, the one who's going to be raising the questions, doing the asking, inconsistent with a lot of the research about that supportive person's own well-being, and accessing psychological support for themselves. It's particularly in a workplace situation, because we talked about a colleague when these people are going to have ongoing interactions after this conversation, the support for them both.

Also in the nuance, this answer, ChatGPT also brought up contacting a US based suicide hotline. The expert suggested, look, that's not applicable globally, not just because that the US hotline isn't available globally, but because research suggested in lower middle income countries, helplines are either not available nor the best alternative in these types of situations. Again, it's got this very high income Western US centric position on advice to the world.

Drew: On the one hand, it's US centric. On the other hand, it doesn't consider the nuances that if this is in a workplace, there probably is a workplace specific counseling service or helpline you can access. It doesn't even think of that. It's advice is just on one hand, too specific, on the other hand, too generic to be fully useful there.

David: Number seven.

Drew: We could make a whole podcast episode about this. Is it safe to work under high pressure? And how do you deal with high job demands?

David: Workload. Maybe we should find a paper on workload. I'm sure many of our listeners, maybe our hosts have the occasional workload pressure. This was really about burnout prevention. The question was asked is, is it safe to work under high pressure? The second question was, how can I deal safely with high job demands? I guess the initial generic response was correct again that working under high pressure can be both motivating or harmful, depending on specific circumstances. However, ChatGPT didn't really detail much about all of those specific circumstances, really just then talked about an individual's own limits and the importance for them to take breaks when they need to. Getting it generally right at a topic sentence level and then not really following through with anything detailed and consistent.

Drew: David, do you want to talk at all about the strategies that suggested there?

David: Yeah, let's do that. The strategies that were suggested by ChatGPT were aligned with the current literature. I talked about proactively managing job resources, job demands, getting support, managing time effectively, importance of recovery activities and experiences like breaks, relaxation, and work life balance. The experts said,the research shows that these strategies can effectively preserve health and avoid burnout.

One of the things that ChatGPT wasn't able to talk about was any of the nuances in research and interconnectedness of any of these factors. We know that stress and burnout, there are escalating aspects, there are interdependencies around some of these factors. ChatGPT really just saw this as a list of standalone issues, almost like do this, do this, do this, do this, but didn't really talk to the nuance of the relationships between these. Do you want to add anything to that Drew?

Drew: No, except that that does seem to be a bit of a pattern here. ChatGPT likes to give lists as part of its answers, but it doesn't intelligently consider the connections of items between those lists if they turn out to be contradictory, not prioritized in the right way, or they're interlinked and really should be talked about together. It just spells it out as if they're totally separate items.

David: Yeah. We've got two more to talk about, Drew. The eighth case example was where ChatGPT was asked about the safety of sharing personal data on fitness apps. We're now away from the risk of safety accident through mental health, well-being, and now it's more broadly into privacy, data security, and individual safety. Again, a bit of a broken record here, but overall, the response was what might be expected.

If you share personal information on mobile apps or websites, here are the things to consider. I'll just get the exact questions there, Drew. The exact question was, is it safe to share personal data in fitness apps? The second question was, how can I share my data safely inside fitness apps? Drew, do you want to jump in?

Drew: I don't know about you, David, but I read this one as the issues that the expert identified were bigger than the issues for some of those other things. Some of those other things could be seen as nitpicking, or the answer is a bit too specific or a bit too generic. Here, it's really missing some of the really big ticket items that you'd want to see in a good answer.

For example, talking about the vagueness of the privacy policies that fitness apps have, the fact that they might sell their data onwards to third parties, and the way apps tend to link with other apps. You can't just consider the first app, you need to consider other apps it's linking to. How things interact with regulations and whether those regulations do or don't apply to fitness devices or the data on devices as compared to apps. This is one answer. It's not wrong. It's just falling short in what would be necessary to make an actually useful answer.

David: Yeah, but I think it's a good point. We're talking about a lay person saying, how can I be safe sharing my data? But if you are in a company, you're responsible for writing the data privacy policy or something, and you went and got a ChatGPT answer, some of these things are quite significant that privacy policies aren't required to do certain things. Like you said, companies have policies that once they share your data with someone else and you've given your permission for them to share it, there's no restrictions over what that third party can then do with your data.

Particularly, like you said, Drew, this idea of regulation. If a doctor takes your heart rate, then patient doctor confidentiality applies. But if a wearable device takes your heart rate, then there's no legislation that covers that as being private data because you've had a generic I consent to my data going places. Like you mentioned, I saw quite some nuance here that if we're actually looking for an organizational application, like you talked earlier about, if you were responsible for designing a policy for mobile phone use when driving, these are bigger gaps in that scenario, I think.

Drew: Should we move on to our final one, David, and talk about fatigue?

David: Yeah, I like this one. I thought this was the most interesting one for me anyway, so let's talk about that.

Drew: Okay. You've got most into this one, David, so I'll let you lead the discussion.

David: There was a few bits that were interesting to me. This is about fatigue. We're talking about fatigue and operating heavy machinery. The question is, is it safe to operate heavy machinery when fatigued? The second one was, how can I operate heavy machinery when I feel fatigued? Is it safe to operate?

What basically the expert was interested in right off the bat was, oh, look, I'm curious because the word fatigue means something to some people, but I'm really curious about if I use the word tired and if I use the word sleepy. The questions were repeated by substituting, is it safe to operate heavy machinery when fatigued? Is it safe to operate heavy machinery when tired? Is it safe to operate heavy machinery when sleepy?

Each of these two questions were tested under the three different scenarios. We may think of these things as interchangeable. I'm tired, sleepy, or fatigued. It has the same implications for operating heavy machinery, but ChatGPT gave quite some different responses depending on which word was used.

Drew: I was just going to jump in and say that I thought it was fascinating that if you're fatigued, you should take regular breaks. If you feel tired, you should take regular breaks. If you feel sleepy, stop the machinery immediately.

David: It was interesting because that was one of the interesting nuances. When the word sleepy was used, it provoked the strongest safety response by ChatGPT with the greatest confidence. It basically said, stop your machine if you're feeling sleepy. But if you're fatigued, it just says take breaks. The expert was saying, well, no, taking a break doesn't actually do anything for the fatigue. You need to have a sleep. Interestingly for fatigue and sleep, ChatGPT never mentioned caffeine.

One of the strongest research findings about combating fatigue, sleepiness, and tiredness in the short term is fatigue for both its ease of use and the immediacy of its impact. It was interesting that ChatGPT really thought about these things as different questions, whereas the expert could really see through the different language and provide a very useful response.

Drew: Yes. Based on the structure here, it's not a hundred percent clear whether this is triggered by the different terms or just triggered by asking ChatGPT. If you ask ChatGPT the same question three times, it will give you three different generated answers. You can't read too much into the specific terms, but I think we can draw a general conclusion here that differences in language and asking the same question different times can give markedly different advice, and that should cause you to question how confident ChatGPT is when it's giving that advice.

David: I guess it's interesting. One of the examples here is that the sleepy search returned advice for getting air, physical activity, and things as a countermeasure for sleepiness. The expert said in the driving research, it's shown that opening a window has no effect on sleepiness, but when I'm tired driving a car before I manage to safely stop, I'll typically wind the window down and try to get cold air on my face. It's interesting that that's not really consistent with the research, but both myself and ChatGPT think that it's a good thing to do.

Drew: Yes. This is one again where it's not giving bad advice, but if the general lay understanding is different from the research, it's more likely to give the general lay understanding than the latest research.

David: The biggest limitation from this expert said, the biggest limitation compared with best practice in safety science is no one tells you to report this as a risk in the workplace setting. The only time it said contact a supervisor or take any immediate action was when the word sleepy was used. More broadly, there was this really heavy emphasis on the need for the individual to manage their own fatigue.

The expert's saying, fatigue is a complex issue. There's shared responsibility across the workplace system for effective fatigue management. If you were writing a policy and you just picked and choose from the this list, write your company policy, it's going to be heavily focused on individual behavior and not really addressing any of the organizational causes of fatigue. Drew, should we talk about some conclusions then?

Drew: Yes. The authors love to make life easy for us. They've given us both a list of conclusions and a list of takeaways. Let's go through the conclusions and see if we, from our own reading of this, agree broadly with their criticisms.

David: Just for our readers to get a sense of my world right now, Drew has done a great job of critiquing each of the conclusions of them. I've got Drew's comment to this. I'll let you lead off with what the authors concluded and how we might think about their conclusions.

Drew: Okay. These are basically a list of risks of using ChatGPT 3.5 for getting safety advice. Number one is the provision of oversimplified and erroneous advice on safety issues. David, my impression there was I think that's a little bit strong, particularly since their recommendation in response is it's important to seek out multiple evidence based sources of information and consult with experts in the field. Who the hell is going to do that if they need safety advice? I don't think that's a fair comparison.

If this is a task where you should be seeking out multiple evidence based sources and consulting with experts, i.e. if you're like a safety manager writing a policy or you're a government person writing a regulation, this is absolutely not fit for purpose. If you're a casual user wondering whether it's better to type it into ChatGPT or type it into Google, I don't think it's so bad.

David: Yeah. I think they're overstating a little bit the erroneous advice. We pulled out one or two examples in a few of the cases, where the evidence might be inconsistent with research, but really the criticism is about not having nuanced, context specific, or systemic responses. Again, I'm not sure that that's a fair critique of what ChatGPT is designed to do. But in those situations that you mentioned, it does mean that it's not a great tool for workplace policy or anything like that.

Drew: Yeah. I was actually surprised that there was nothing in there that was clearly identified as a ChatGPT hallucination. I don't think they directly checked any of the references that it may have been vague as to reference or as hallucinating references. It didn't come up with any advice that you just thought, woah, ChatGPT's got off the deep end there.

David: Yeah. Drew, the second issue that they raised was about the lack of warnings about evidence that it's developing, disputed, or fabricated. ChatGPT never really says, well, in this situation, it could be this or it could be that, but actually it's quite inconclusive about what the position is. ChatGPT lands on a position and confidently puts it out there.

Drew: Yeah, and that's a completely fair criticism. ChatGPT sometimes hedges its bets when it's talking about the benefits of exercise versus tripping over. It says it's important to balance these two things, but it never hedges its bets when it says, this is a question that's under dispute or experts disagree about.

David: The third issue is about the lack of transparency to users, which is, where did this information come from? We don't really know.

Drew: No, and that's consistently been a problem with ChatGPT. It doesn't tell you where the information is coming from. Even if it does give you a particular source, often it's just making up that source. If you try to pin it on sources, that's when it just starts extremely hallucinating.

David: Yeah, it sees our respiratory disease and a few things and decides that the World Health Organization, the CDC, or something might be a good reference to throw into this sentence. Two more issues. I think we got two more. No, we got a few more than that.

The response content variation based on keyword use and querying behavior. Drew, you mentioned that, where it was a fresh session being used, the idea of tired, sleepy, and fatigue, where you mentioned around the pools about how can I do this versus how can this be avoided or something. The response varies obviously. I guess it's obvious based on word use and how the query is written.

Drew: Yeah, I think that's a fair criticism of ChatGPT. They didn't really explore the extent to which it's true in this study. You could imagine a totally different study, where you did actually ask the same question nine times and then did a map of how often did various things get included or left out. That would be an interesting one to do, but that wasn't what they were testing here.

David: I think there's enough with the transcripts in here and obviously enough for the authors to be confident, and that is an issue or a conclusion. If you ask a particular question or use a particular word, by design, by the way this is built, is going to generate different types of responses. I guess you have to be really able to ask exactly what you want to know.

The next point here is this emphasis on individual responsibility, this idea that whether we're talking about workplace fatigue, mental health in the workplace, or other types of things for prevention in homes. It's this real focus on the individual behaviors and the individual responsibility.

Drew: Yeah. This is why I don't think it's a fair criticism because most of the time they asked, what can I do to keep myself safe? That is totally fair for ChatGPT to answer that question based on individual responsibility. It'd be very different if they asked, what could be done about pool safety? And all the answers were individual.

David: Again, this would be my hypothesis, but I think that most of the information on the internet in these spaces would actually emphasize individual responsibility. You can imagine how many websites publish the rules to follow when swimming in pools. Don't run, don't do this, don't do that. A lot of the content on the internet just broadly, probably emphasizes individual responsibility.

Drew: Yeah. This is probably a fair thing to raise if we're imagining the other use case of safety professionals coming up with policies or advice. Even if it was caused by the prompts, that's an easy trap to get into. It's generating a set of advice that is all focused on one stakeholder and one individual. Lack of applicability to minorities, groups, or certain contexts. This almost understates it because all of their examples had a first world US bias, sometimes right down to the phone numbers that was provided.

Last one, potential to overwhelm users with recommendations. They're actually like a county. They say that most of the recommendations tended to give you around seven things in a list without prioritization or sanitization of the lists. It's probably a bit too much to give people good advice.

David: Like you, Drew, when I get responses from students in some of the industry-facing programs that I run, I can very easily spot the ChatGPT generated response. One of the ways that I spot it usually is when things are included in the response that I know and not in the subject material or the module, which is I'll be doing something about contemporary safety theory and accident causation. I'll get a response that talks about the importance of compliance, discipline, punishment and things like this that. I'm pretty confident that that wasn't in any of the material in the course.

Drew: Yeah, it's a bit harder at uni where we expect students to go outside the course material. We tend to do the opposite and say, you must cite the course material in your answer. ChatGPT isn't very good at doing that.

David: Okay. We'll move on to practical takeaways, where you're going to give away all of the ways that you can beat the system if you're writing university assignments. The authors give their takeaways, like you mentioned. Do you want to get us finish us off here?

Drew: Yup, okay. They give a fairly strong conclusion that they think that because it's risky if you follow safety advice and it's not complete, they don't think it's a good idea for individual users to use ChatGPT for safety related information and advice. They think the problem is serious enough that there should be safeguards to prevent that happening, presumably in the same way that ChatGPT now has filters to prevent you asking it to hack systems for you or asking it for terrorist advice. They seem to think there should be some regulations or policies around either ChapGPT itself or uses of it.

I personally don't think in this case, it is at that level of seriousness. Yes its advice is not complete, but given what it's compared to, I don't think we're in that realm of this is a dangerous thing and it's dangerous that ChatGPT is out there that people might be using it for this purpose.

David: I guess for our listeners and we'll ask again when we publish the link on LinkedIn, but if people are interested in more AI related research, because I have spoken to a few people researching in this area, the practical takeaway seems to be consistent. As a supplement to an expert, it's a good way of maybe finding things that you might not have considered. But as a sole source of advice, a sole source of hazard identification, or a sole position on safety, it's not where it needs to be to say, I can go to ChatGPT, I don't need a safety advisor anymore.

Drew: Yup. The second one, policymakers such as your risk managers should refrain from using ChatGPT as a source of expert safety information and advice. Hell, yes. Do not use ChatGPT to write a training course, to write a policy document, or something like that. The lack of traceability, the inability to synthesize knowledge, the inability to spot nuance when there's a debate or conflicting evidence, the inability to update it with recent evidence, not only are you going to be immediately generating stuff that is bad, you'll be creating an ecosystem of data which is going to continue to get worse over time if you were to use it for that. Please don't.

David: I think one of the earliest professions we've seen that in over the last 12 months has been the legal profession and just the examples in the media of legal professionals who have used ChatGPT to create their defense for individual cases, the challenges and issues, and therefore some of the safeguards that that profession is starting to apply to legal research. I think all professionals need to not use it. I'm trying to find a way to say, not use it if you want to double check, get a starting point, or think about things you might not think about but really question how much credibility you put on anything that it generates.

Drew: Yeah, use it as a secondary source. Write your answer and then run ChatGPT to see if it spots anything that you missed. Don't use it to write your first answer and then try to correct it.

The use case that I'm really worried about because I know that this is going to happen, and I'm not picking on anyone in particular, is RTOs training organizations that run a whole heap of courses and that people don't have enough pay, don't have enough time, and under the pump to generate a new course. I know those people hop on the internet to quickly find information, to quickly find slides. It's going to be so tempting to hop onto ChatGPT. Steal someone else's slides, please. Don't try to get ChatGPT to do it for you.

David: Yeah. Drew, the final takeaway here.

Drew: The final one was about overconfidence. ChatGPT seems like it knows what it's talking about. The fact that it's right most of the time is the big problem. If you're dealing with someone who seems pretty reliable, they say the obvious things correctly. They're going to suddenly mention a detail, and you're not going to stop and think hold on, maybe it's wrong in this detail. We know that the advice is out of date. We know that ChatGPT hallucinates.

This is really a problem at the second level. If you're using ChatGPT, you know that stuff. But if you use ChatGPT to write something and then someone else reads it not knowing it's generated by ChatGPT, they're not going to expect the hallucinations, they're not going to expect the problems, and they're going to be very easily caught by an answer that seems really authoritative. Imagine all these people who are using ChatGPT to write LinkedIn posts, for example. If you can spot that it's written by ChatGPT, you're fine. But if you think it's actually that person, it sounds good, it sounds reputable, and it's got a high degree of confidence, it suddenly slips in this stuff that's dangerous.

David: Drew I think just before I ask you the final question, one of the things that I think might be worth a clarification here is that we're talking about Chat GPT 3.5 and generative AI that's sourcing internet data. We're seeing a new emergence of generative AI. For example, here in our part of the world, the Australian Institute of Health and Safety has published an AI assistant, which there's been some walls put around. I haven't been involved with the project, but there's some walls around. It's given it the OHS body of knowledge from the OHS, it's given it a bunch of OHS event related information, and it's limiting the reference pool of material that that assistant actually has.

They put in all of the Australian based safety legislation and things like that. Very quickly, if you jump on board from the US into that Australian, it doesn't know the US rules and regulations, it knows the Australian legislation and so on. I think as we go forward, it's important that what is the reference pool of information that this generative AI has access to, because there may be some applications that spring up and you go, well, if that's the reference data, then maybe I can put some more credibility in what comes back.

Drew: Yeah, that's a good point, David, but I think here, we need to be really careful that we understand what we're doing. You don't put a wall around generative AI. You give it supplemental training data that it absorbs and prioritizes, but the language model itself has still been trained on the larger set of data. The way it works, it could still be using that larger set and hallucinating because it can't really tell the difference between what is content and what is language. There are some interesting ways that we can improve it, but this is an experimental space, not an industry ready space.

David: Yeah, great. That's helpful for my own understanding, Drew. It sounds like we might need to do a few more episodes if for nothing else than my own learning. Drew, the question we asked this week was, does ChatGPT provide good safety advice?

Drew: Yes. Generally the advice it gives is good advice, but it's advice that comes with some systemic problems. It's not complete advice, it's not current advice, it's not well-prioritized advice, it's not well-localized advice. It's good advice, but that doesn't mean that ChatGPT is as suitable as an advice writing tool.

David: Great. That's it for this week. We hope you found this episode thought provoking and ultimately useful in shaping the safety of work in your own organization. Send us any comments, questions, or ideas for future episodes to feedback@safetyofwork.com.