How AI Development Mirrors Scientific Research

This is the second in a series on AI in Emergency Medicine. You can find the first post here.

Introduction

Artificial intelligence (AI) is changing emergency medicine, making diagnosis faster and improving patient care. But for many clinicians, AI feels like a black box. To make it easier to understand, let’s compare AI development to something you’re already familiar with, a research study. Just like a clinical trial follows a structured process—from hypothesis generation to peer review—AI development follows a similar path.

1. Data Collection: The AI Version of Study Recruitment

In research, the first step is defining a study population and collecting data—whether through patient charts, clinical trials, or observational studies. AI follows the same principle, but on a larger scale. Data sources include electronic health records (EHRs), imaging, lab results, and real-time patient monitoring. Just as a study with a biased sample (eg, only including young, healthy patients) would produce unreliable results, an AI trained on limited or non-representative data won’t generalize well. Ethical considerations, like informed consent in research, parallel AI’s need for patient privacy compliance (eg, HIPAA). A poor dataset leads to flawed conclusions in both AI and research.

2. Data Preprocessing: The AI Equivalent of Data Cleaning in Research

Once you’ve gathered data for a study, the next step is cleaning it—removing outliers, handling missing values, and standardizing measurements. AI follows the same process. Raw medical data is messy, with missing patient vitals, inconsistent units, and even typos. Before an AI model can learn from the data, it must be cleaned and formatted correctly, just like researchers clean datasets before running statistical analyses. If errors aren’t caught here, they can lead to incorrect conclusions, much like how a poorly prepared dataset can invalidate a clinical study. Over-cleaning, however, can remove meaningful variations—just as excessive exclusion criteria in a study can limit its real-world applicability.

3. Model Training: Hypothesis Testing in AI

In research, you develop a hypothesis and test it using statistical methods. AI does something similar during model training. The model is “hypothesizing” relationships in the data—for example, that certain vital sign patterns predict sepsis—and adjusting its parameters to minimize prediction errors. Just like different statistical models (e.g., logistic regression vs. survival analysis) are chosen based on study design, different AI architectures (eg, deep learning vs. decision trees) are chosen based on the problem.

Overfitting can also occur, leading to a model that performs well on training data but fails in the real world, much like a study that finds statistical significance by chance but lacks reproducibility. It is the AI equivalent of p-hacking, a statistical technique that misuses data analysis to find patterns that support a researcher’s hypothesis.

4. Model Validation: AI’s Version of Peer Review

In research, before publishing, you validate findings by testing them on new populations or performing sensitivity analyses. AI undergoes a similar validation process—once trained, it’s tested on an unseen dataset to ensure it generalizes beyond the training data. If an AI model performs well on training data but poorly on new cases, that’s akin to a study that looks promising in a small cohort but fails in a larger clinical trial. Biases often emerge here. Just as studies with poor external validity fail to generalize, an AI trained on biased data may underperform in diverse populations.

When evaluating AI models, three important performance metrics help us understand how well the model is doing: accuracy, precision, and recall. Each one tells us something different about how the model makes predictions.

Accuracy (ie, how often the model gets it right) is simply the percentage of correct predictions out of all predictions made. It’s a good overall measure when you have a balanced dataset. But if your dataset is skewed (eg, 95% of patients don’t have a disease), a model that always predicts “no disease” will still seem highly accurate—while actually being pretty useless.

Precision is similar to positive predictive value (ie, how many of the “yes” predictions were actually right). It tells us, “Out of all the cases the model labeled as positive, how many were actually positive?” This is important when false positives are a big deal—like diagnosing someone with a disease they don’t have and subjecting them to unnecessary treatments.

Recall is similar to sensitivity (ie, how many of the actual positives the model caught). It measures how well the model finds all the positive cases. If recall is low, that means the model missed a lot of real positive cases. This is a big problem in areas like cancer detection, where missing a diagnosis (false negatives) can be life-threatening.

Understanding accuracy, precision, and recall helps us see the strengths and weaknesses of an AI model—whether it’s better at avoiding false alarms or at catching every possible case. The key is finding the right balance for the situation, ensuring the model performs well where it matters most in real-world decision-making.

5. Deployment: AI’s Version of Clinical Implementation

In research, once a study passes peer review, it moves into clinical guidelines and real-world practice. AI follows a similar path—once validated, it’s deployed into emergency settings, where it assists in diagnosing conditions, triaging patients, and predicting deteriorations. However, just like new clinical protocols require monitoring to assess their real-world effectiveness, AI models need ongoing surveillance. If errors arise (e.g., an AI misclassifying stroke severity in certain populations), adjustments are needed. Just as clinicians use guidelines as tools rather than rigid rules, AI should be viewed as an assistive system, not a replacement for medical judgment.

6. Continuous Learning: AI’s Version of Ongoing Medical Research

Medicine evolves, and so do AI models. In research, new findings continuously update clinical guidelines—AI requires similar ongoing learning. AI models can be retrained with new patient data, ensuring they stay relevant as medical knowledge advances. This is similar to a systematic review incorporating new studies. However, just as incorporating low-quality studies can weaken a meta-analysis, blindly updating AI models without proper validation can introduce errors. Some AI systems use real-time learning (online learning), akin to adaptive clinical trials that adjust protocols based on emerging data.

7. Multi-Agent Systems: AI as a Multicenter Study

Single-center studies have limitations, which is why multicenter trials provide broader, more reliable insights. Similarly, AI in medicine often uses multiple models working together. One model might analyze x-rays, another predict sepsis risk, and a third monitor vitals—just like how different subspecialists collaborate on complex cases. However, coordinating multiple AI agents requires careful design, just as multicenter studies need standardized protocols to ensure consistency across sites. A poorly integrated AI system is like a multicenter study where different locations use different diagnostic criteria, leading to inconsistent outcomes.

Conclusion: AI and Research Follow the Same Scientific Principles

Understanding AI through the lens of medical research makes it clear: AI isn’t magic, it’s a structured scientific process. Just like a well-conducted study, AI requires rigorous data collection, validation, and ongoing refinement. And just as no single study is definitive, AI is not infallible—it must be used alongside clinical expertise. AI’s future in emergency medicine depends on responsible development, continual learning, and thoughtful integration into clinical workflows.

References

Piliuk, Konstantin, and Sven Tomforde. “Artificial intelligence in emergency medicine: a systematic literature review.” International Journal of Medical Informatics., 2023;180:105274. doi: 10.1016/j.ijmedinf.2023.105274

Karami, Mahtab, and Ali Hosseini Shahmirzadi. “Applying agent-based technologies in complex healthcare environments.” Iranian Journal of Public Health. 2018;47(3):458-459. PMID: 29845039

Last Updated on April 3, 2025

Sam Ashoo, MD

Sam Ashoo, MD, FACEP, is board certified in emergency medicine and clinical informatics. He serves as EB Medicine’s editor-in-chief of interactive clinical pathways and FOAMEd blog, and host of EB Medicine’s EMplify podcast. Follow him below for more…

How AI Development Mirrors Scientific Research

Introduction

1. Data Collection: The AI Version of Study Recruitment

2. Data Preprocessing: The AI Equivalent of Data Cleaning in Research

3. Model Training: Hypothesis Testing in AI

4. Model Validation: AI’s Version of Peer Review

5. Deployment: AI’s Version of Clinical Implementation

6. Continuous Learning: AI’s Version of Ongoing Medical Research

7. Multi-Agent Systems: AI as a Multicenter Study

Conclusion: AI and Research Follow the Same Scientific Principles

References

Like this:

Related

Leave a Reply Cancel reply

Introduction

1. Data Collection: The AI Version of Study Recruitment

2. Data Preprocessing: The AI Equivalent of Data Cleaning in Research

3. Model Training: Hypothesis Testing in AI

4. Model Validation: AI’s Version of Peer Review

5. Deployment: AI’s Version of Clinical Implementation

6. Continuous Learning: AI’s Version of Ongoing Medical Research

7. Multi-Agent Systems: AI as a Multicenter Study

Conclusion: AI and Research Follow the Same Scientific Principles

References

Share this:

Like this:

Related

Leave a Reply Cancel reply