When AI Fails: Errors You’ll Never See in a Scientific Journal

This is the fourth in a series on AI in Emergency Medicine. You can find the previous posts here.

AI is changing medicine, research, and just about everything else. But it’s not perfect, and when it messes up, it doesn’t make the same kinds of mistakes we see in traditional science. Here’s a look at six ways AI can fail (in ways human researchers usually don’t), why these things happen, how we can prevent them, and some real-world examples from medicine, especially emergency medicine.

1. Hallucination (a.k.a. Making Stuff Up)

What it is:

Sometimes AI just flat-out invents facts, confidently. It sounds right, but it’s not.

Why it happens:

Large language models (LLMs) like ChatGPT are built to guess what words sound good together, not to fact-check. They don’t know if something is real unless you connect them to outside information.

How to prevent it:

Set them up to search real databases (this is called retrieval-augmented generation, or RAG).
Make them cite their sources.
Warn users when answers might not be verified.

Medical example:

A 2023 study in Cureus found that ChatGPT fabricated references with incorrect titles, authors, and journal names, highlighting the risks of relying on AI-generated citations without verification.¹

2. Bias Baked Right In

What it is:

AI can pick up society’s ugly biases from its training data and carry them into its answers.

Why it happens:

These models read everything — good, bad, biased, offensive — from the internet. They learn patterns without understanding if they’re fair or right.

How to prevent it:

Test the AI for bias across different groups.
Make sure the training data is more balanced.
Use bias-correction techniques after training.

Medical example:

One study found AI models trained on less-diverse data underdiagnosed pneumonia in minority patients.² In emergency medicine, that could mean missed cases.

3. Missing the Point (Context Failures)

What it is:

AI often misses subtle details, the things human experts naturally pick up on.

Why it happens:

AI is pattern-matching words, not truly understanding them. It can miss the big picture when multiple pieces of information need to be connected.

How to prevent it:

Use chain-of-thought prompting, basically asking it to think through things step-by-step.
Be explicit in prompts about what matters.

Medical example:

A 2024 study evaluating ChatGPT’s triage skills using the Korean Triage and Acuity Scale (KTAS) found that ChatGPT’s decisions had lower agreement with expert assessments compared to human paramedics.³

4. Echo Chambers from Human Feedback

What it is:

AI can end up sounding like a small group of trainers, even if that group isn’t representative.

Why it happens:

In reinforcement learning from human feedback (RLHF), AI gets rewarded with higher scores when it gives answers humans like. If your human reviewers are all similar, the model just copies their biases.

How to prevent it:

Get a more diverse group of human reviewers.
Let users tweak the AI’s style, tone, or assumptions.

Medical example:

A 2024 study showed that an AI model developed in the United Kingdom for COVID-19 triage performed significantly worse when applied to hospitals in Vietnam, highlighting how models trained in one region often struggle to generalize to different healthcare environments.⁴

5. Multimodal Misalignment (a.k.a. Getting Mixed Up)

What it is:

When AI handles images and text together (multimodal data), it sometimes makes strange connections.

Why it happens:

Sometimes the images and captions in training don’t match very well. The model learns the wrong lessons about what’s connected to what.

How to prevent it:

Train on clean, properly matched image-text pairs.
Apply penalties during training when the AI guesses wrong, giving it negative feedback so it learns better associations.

Medical example:

A study on ultrasound classification showed that deep learning models’ performance heavily depended on the quality of labeled training data, and poor labeling could lead to misclassifications.⁵

6. Falling Behind (Temporal Staleness)

What it is:

AI doesn’t stay up-to-date unless you specifically retrain it.

Why it happens:

Most models are trained once on a giant dataset, and they don’t automatically keep learning unless someone builds a system to update them.

How to prevent it:

Connect AI to real-time databases.
Plan regular updates or retraining.

Medical example:

A 2020 review in BMJ found that most COVID-19 prediction models were poorly validated and became outdated quickly as medical understanding evolved.⁶

Summary

AI is powerful, but it stumbles in ways traditional scientific research usually doesn’t. Instead of careful peer-reviewed mistakes, it fabricates facts, picks up hidden biases, misreads important context, and falls out of date. If we want to use AI safely in medicine (and anywhere else), we have to know where it tends to trip up and actively design around those weak spots. When we do, AI can be a game-changing partner, not a silent liability.

References

Alkaissi H, McFarlane SI. Artificial Hallucinations in ChatGPT: Implications in Scientific Writing. Cureus. 2023;15(2):e35179. https://www.cureus.com/articles/138667-artificial-hallucinations-in-chatgpt-implications-in-scientific-writing
Seyyed-Kalantari L, Zhang H, McDermott M, Chen IY, Ghassemi M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat Med. 2021;27(12):2176-2182. https://www.nature.com/articles/s41591-021-01595-0
Kim DW, Kim SY, Cho SM, Lee S, Park JH, Kim YJ. Performance of ChatGPT on the Korean Triage and Acuity Scale: A Comparative Study With Emergency Medical Personnel. Digital Health. 2024;10:20552076241224518. https://pubmed.ncbi.nlm.nih.gov/38250148/
Yang J, Thanh Dung N, Ngoc Thach P, et al. Generalizability assessment of AI models across hospitals in a low-middle and high income country. Nature Communications. 2024;15(1):8270. https://www.nature.com/articles/s41467-024-52618-6
Cheng PM, Malhi HS. Transfer Learning with Convolutional Neural Networks for Classification of Abdominal Ultrasound Images. J Digit Imaging. 2017;30(2):234-243. https://link.springer.com/article/10.1007/s10278-016-9929-2
Wynants L, Van Calster B, Collins GS, et al. Prediction models for diagnosis and prognosis of COVID-19: systematic review and critical appraisal. BMJ. 2020;369:m1328. https://www.bmj.com/content/369/bmj.m1328

Sam Ashoo, MD

Sam Ashoo, MD, FACEP, is board certified in emergency medicine and clinical informatics. He serves as EB Medicine’s editor-in-chief of interactive clinical pathways and FOAMEd blog, and host of EB Medicine’s EMplify podcast. Follow him below for more…

When AI Fails: Errors You’ll Never See in a Scientific Journal

1. Hallucination (a.k.a. Making Stuff Up)

2. Bias Baked Right In

3. Missing the Point (Context Failures)

4. Echo Chambers from Human Feedback

5. Multimodal Misalignment (a.k.a. Getting Mixed Up)

6. Falling Behind (Temporal Staleness)

Summary

References

Like this:

Related

Leave a Reply Cancel reply

1. Hallucination (a.k.a. Making Stuff Up)

2. Bias Baked Right In

3. Missing the Point (Context Failures)

4. Echo Chambers from Human Feedback

5. Multimodal Misalignment (a.k.a. Getting Mixed Up)

6. Falling Behind (Temporal Staleness)

Summary

References

Share this:

Like this:

Related

Leave a Reply Cancel reply