AI in EM: Understanding Bias and Errors Through the Lens of Scientific Studies

This is the third in a series on AI in Emergency Medicine. You can find the previous posts here.


You wouldn’t trust a study with terrible methodology—so why trust an AI tool built on the same flaws?  

AI in emergency medicine is often marketed as precise, fast, and objective. But look closer, and you’ll see familiar problems: selection bias, overfitting, black-box decisions. Sound like bad research? That’s because it is, just in a different form. If you know how to spot flaws in a clinical trial, you already have what it takes to evaluate AI tools in the ED. In this post, we’ll draw parallels between common research pitfalls and how they show up in AI, so you can use that critical eye on the latest tech in your department.  

What Happens When AI Models Memorize Instead of Learn?

Algorithmic Errors and Statistical Model Errors

AI models are just math, and math can go wrong. Poor training data, bad assumptions, or over-engineered complexity can cause misleading results.  

  • Overfitting: Overfitting happens when a model learns its training data too well and loses its ability to generalize. You’ve seen this before in a perfect research model that totally fails in the real world.1
    • Example: A sepsis prediction tool flags almost every ICU patient with sepsis, but when used in triage, it over-alerts on stable patients with minor diagnoses. That’s like using an ICU-only study to guide floor-level careit just doesn’t work.
  • Black Box Models: Black box models are AI tools that make accurate predictions without any transparency. That’s no different than a convoluted regression model with 15 variables and 3 interactions. You get a result, but no clinical insight.2
    • Example: A deep learning algorithm recommends CT angiography for a patient with vague symptoms. You ask why and get no answer. Would you accept that from a research study?  

Can You Really Trust That AI Alert?

Human-AI Interaction and Clinical Bias

AI tools don’t make decisions in isolation. Humans interpret them, and we’re not immune to bias.  

  • Automation Bias: When a tool tells us something, we tend to believe it, even if it’s wrong. Same as a researcher who sees what they expect to see.3
    • Example: A resident sees “NSTEMI unlikely” on the AI dashboard and downgrades their concern despite a rising troponin and worrisome story. The tool influenced the resident’s interpretation, not the data.  

Is Your AI Tool Trained on the Right Patients?

Data Bias and Selection Bias

When training data doesn’t reflect your patient population, results don’t apply. We already reject clinical trials that exclude key demographics, AI should be no different.4

  • Example: A thrombolytic study with only young, healthy participants? Not helpful in the real-world ED. Same goes for an AI model trained solely on tertiary care center data.  

What If Your Labels Are Just Wrong?

Label Bias and Measurement Bias

If you train AI on subjective human decisions, it learns human flaws, not objective truth.7

  • Example: Teaching AI to diagnose pneumonia based on what clinicians documented means it might just learn to mimic their biases, not find actual disease. It’s the same issue as using a flawed survey instrument in a research study.  

Is Your AI Optimizing for the Wrong Outcome?

Outcome Bias and Publication Bias

Some AI tools aim to maximize outcomes that don’t matter clinically, or worse, reinforce existing disparities.8

  • Example: A model trained to predict admissions starts flagging patients with frequent prior visits. But what it’s really picking up is socioeconomic status, not severity. Just like a journal that only publishes “positive” drug trials, this AI skews toward pre-determined success. And it can have serious consequences: one commercial algorithm used to manage millions of patients showed a 50% reduction in referral rates for Black patients compared to White patients, despite similar health status.7

Is There a Hidden Bias in the Data?

Algorithmic Bias and Confounding

AI can latch onto non-causal correlations just like a confounded study might.9

  • Example: A model finds that insured patients survive more often and builds this into predictions. But it’s picking up on access, not biology. That’s no different from a statin study that forgets to control for smoking.  

What Happens When the Charting Is a Mess?

Data Quality and Research Integrity

Poor data ruins both research and AI.  

  • Garbage In, Garbage Out: Train on bad charting, get bad outputs.
    • Example: A triage tool trained on messy ICD-10 codes ends up confusing GERD for ACS.
  • Missing Data: AI can’t guess what it hasn’t seen.
    • Example: A trauma mortality model overlooks out-of-hospital deaths, so suddenly your survival rates look great, but only because you lost patients. And the issue is widespread: more than 20% of key variables in many EHR datasets are missing, often in non-random ways.10,11

Why This Matters in the ED

As emergency physicians, we already question new studies before we change practice. We need to do the same with AI tools.  

  • Patient Safety: Faulty predictions can misdiagnose, misguide, and delay care.  
  • Equity: AI can bake in bias if it’s not critically evaluated.  
  • Responsibility: You’re still the clinician. If an AI tool gives bad advice, it’s still your name on the chart.  

How to Keep AI in Check—Lessons from Evidence-Based Medicine

  • Diverse, Representative Datasets: Like multicenter trials, broader datasets improve generalizability.  
  • Transparency and Interpretability: If the model can’t explain itself, be skeptical.  
  • Humans in the Loop: Let AI assist, not dictate. You bring context AI can’t.  
  • Oversight: Like IRBs for trials, we need systems to review AI for safety and fairness.  

AI tools aren’t magic. They’re just the latest in a long line of decision aids. Like any tool, they’re only as good as their design and oversight. The good news? If you can critically appraise a study, you can critically appraise an AI model. Same biases, new packaging. Keep asking the hard questions.  

References

  1. Harrell, Frank E., Jr. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. 2nd ed., Springer, 2015. https://link.springer.com/book/10.1007/978-3-319-19425-7  
  2. Doshi-Velez, Finale, and Been Kim. “Towards A Rigorous Science of Interpretable Machine Learning.” arXiv preprint arXiv:1702.08608, 2017. https://arxiv.org/abs/1702.08608  
  3. Goddard, K., and Roudsari, A. “Automation Bias: A Systematic Review of Frequency, Effect Mediators, and Mitigators.” Journal of Cognitive Engineering and Decision Making, vol. 16, no. 4, 2022, pp. 297-324. https://pmc.ncbi.nlm.nih.gov/articles/PMC3240751/
  4. Rothman, Kenneth J., Sander Greenland, and Timothy L. Lash. Modern Epidemiology. 3rd ed., Lippincott Williams & Wilkins, 2008.    
  5. Obermeyer, Ziad, et al. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science, vol. 366, no. 6464, 2019, pp. 447-53. https://www.science.org/doi/10.1126/science.aax2342  
  6. Hopewell, Sally, et al. “Publication Bias in Clinical Trials Due to Statistical Significance or Direction of Trial Results.” Cochrane Database of Systematic Reviews, no. 1, 2009. https://www.cochranelibrary.com/cdsr/doi/10.1002/14651858.MR000010.pub3/full  
  7. Obermeyer, Ziad, et al. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science, vol. 366, no. 6464, 2019, pp. 447-53. https://www.science.org/doi/10.1126/science.aax2342  
  8. Hopewell, Sally, et al. “Publication Bias in Clinical Trials Due to Statistical Significance or Direction of Trial Results.” Cochrane Database of Systematic Reviews, no. 1, 2009. https://pmc.ncbi.nlm.nih.gov/articles/PMC8276556/
  9. Greenland, Sander, Judea Pearl, and James M. Robins. “Causal Diagrams for Epidemiologic Research.” Epidemiology, vol. 10, no. 1, 1999, pp. 37-48. https://journals.lww.com/epidem/Abstract/1999/01000/Causal_Diagrams_for_Epidemiologic_Research.8.aspx  
  10. Sterne, Jonathan A. C., et al. “Risk of Bias in Randomised Clinical Trials: A Proposed Tool for Systematic Reviews.” Journal of Clinical Epidemiology, vol. 56, no. 5, 2003, pp. 455-63. https://pubmed.ncbi.nlm.nih.gov/31462531/
  11. Weiskopf, N. G., and Weng, C. “Methods and Dimensions of Electronic Health Record Data Quality Assessment: Enabling Reuse for Clinical Research.” Journal of the American Medical Informatics Association, vol. 20, no. 1, 2013, pp. 144-151. https://pmc.ncbi.nlm.nih.gov/articles/PMC3555312/

New EB Medicine AI Tool

Sign up now to become a beta tester of our new AI tool designed to answer your clinical questions based on Emergency Medicine Practice educational content.
 
As our thank you, you will receive 3 FREE months of Emergency Medicine Practice
(existing subscribers will get an extension on their current subscription; non-subscribers will get a new 3-month subscription). 
 
Contact us today and we’ll reach out with the next steps.

Leave a Reply

Your email address will not be published. Required fields are marked *