When AI Fails: Lessons from Pharmacovigilance Case Studies

Alaa Elgnainy
| March 27, 2026

The rapid evolution of artificial intelligence (AI) systems offers tremendous potential for pharmacovigilance, particularly in recognizing rare but clinically significant events. However, the decreasing cost and effort required to develop AI models create a risk that organizations may invest insufficient time in understanding their limitations and sources of error. This comprehensive review summarizes a landmark paper from the Uppsala Monitoring Centre (UMC) that outlines a framework for critically appraising AI models for rare-event recognition.

The paper introduces key dimensions for evaluation, including problem framing, test set design, prevalence-aware statistical metrics, robustness assessment, and integration into human workflows. It also proposes a novel approach called Structured Case-Level Examination (SCLE) to complement statistical performance evaluation and provides practical considerations for procuring or developing AI models in this space. Drawing on three pharmacovigilance case studies—rule-based retrieval of pregnancy-related reports, machine learning-based duplicate detection, and automated redaction of person names using large language models—the paper demonstrates how these principles can be applied in practice.

1. Introduction: The Promise and Peril of AI in Pharmacovigilance

Artificial intelligence (AI) is transforming pharmacovigilance. From automated case processing to signal detection and duplicate identification, AI systems promise to enhance efficiency, improve quality, and enable entirely new capabilities. However, the rapid proliferation of AI tools—coupled with decreasing development costs—creates a critical challenge: how do we know which AI systems actually work?

As the authors of this landmark paper from the Uppsala Monitoring Centre (UMC) note, “the effort and expertise required to develop modern AI decrease, there is a risk that organizations devote too little time to understanding their limitations and sources of error.” This is particularly problematic in pharmacovigilance, where errors can have serious consequences for patient safety.

The paper focuses on rare-event recognition—a common challenge in pharmacovigilance where the events of interest (e.g., specific adverse drug reactions, duplicates, pregnancy exposures) occur infrequently. In such settings, traditional performance metrics can be highly misleading, and careful evaluation is essential.

2. The Unique Challenges of Rare-Event Recognition

2.1 Why Rare Events Are Different

In pharmacovigilance, many tasks involve recognizing rare events:

Task	Event of Interest	Prevalence
Signal detection	New or emerging safety signals	Very low
Duplicate detection	True duplicate reports	Low
Pregnancy identification	Reports involving pregnancy exposure	Low
Name redaction	Person names in narratives	Very low

When events are rare, traditional performance metrics can be deceptive:

Apparent accuracy can be high even if the model performs poorly on the events that matter
Test sets may not contain enough positive examples for reliable evaluation
Precision estimates are highly sensitive to the prevalence of positives in the test set
Specificity must be extremely high to achieve acceptable precision

2.2 The Asymmetric Cost of Errors

In rare-event recognition, the cost of different types of errors is often asymmetric:

Error Type	Cost in Pharmacovigilance
False positive (flagging a non-event)	Wasted reviewer time, alert fatigue, potential bias in downstream analyses
False negative (missing a real event)	Missed safety signals, delayed action, patient harm

Understanding these costs is essential for selecting appropriate decision thresholds and performance metrics.

3. Key Dimensions for Appraising AI Models

The paper outlines several key dimensions for critical appraisal of AI models for rare-event recognition.

3.1 Problem Framing and Test Set Design

Key Questions:

What is the intended use case? (Efficiency, quality, or capability?)
Does the test set reflect the deployment domain?
Are positive and negative controls clearly defined?
Are ambiguous cases handled appropriately?

Key Insight: The nature of the test set should align with the desired deployment domain. If an AI model is intended for broad use across collections of adverse event reports, the test set should include reports from diverse sources, drugs, and events.

3.2 Prevalence-Aware Statistical Evaluation

The paper emphasizes that standard performance metrics must be interpreted with care in rare-event settings.

3.2.1 Recall (Sensitivity)

$Recall = \frac{True Positives}{True Positives + False Negatives}$ Recall=True Positives+False NegativesTrue Positives

Recall measures how many of the events of interest are correctly identified. It requires a representative set of positive controls. If test sets are enriched with positive controls (to increase their number), recall may be overestimated if hard-to-recognize positives are also less likely to be included.

3.2.2 Precision (Positive Predictive Value)

$Precision = \frac{True Positives}{True Positives + False Positives}$ Precision=True Positives+False PositivesTrue Positives

Precision measures how many of the predicted positives are actually correct. It is highly dependent on the prevalence of positives in the test set. If test sets are enriched with positives, precision estimates will be optimistic.

Critical Warning: The expected precision of random guessing equals the prevalence of positive controls. For a balanced test set (50% positives), random guessing gives 50% precision—which may appear impressive but is worthless in practice.

3.2.3 Specificity (True Negative Rate)

$Specificity = \frac{True Negatives}{True Negatives + False Positives}$ Specificity=True Negatives+False PositivesTrue Negatives

Specificity measures how many negatives are correctly classified. It is independent of prevalence, making it useful for comparing models. However, for rare events, specificity must be extremely close to 1 to achieve acceptable precision, requiring very large test sets.

3.2.4 Composite Metrics

Metrics like F1-score (harmonic mean of precision and recall) inherit the limitations of their components. Receiver Operating Characteristic (ROC) curves and area under the curve (AUC) are dominated by portions of the decision curve that are irrelevant for rare events.

Recommendation: Focus on recall and precision, using model-specific precision test sets drawn from the actual deployment population.

3.3 Robustness Assessment

Robustness refers to the stability and reliability of AI model performance under varying conditions.

Aspect	Considerations
Stability	For non-deterministic models (e.g., generative LLMs), assess consistency across repeated executions
Data drift	Monitor performance over time as input data changes
Subgroup performance	Evaluate whether performance varies across different data subsets (e.g., by country, source, demographics)
Fairness	Ensure the model does not under-serve or bias against specific groups

3.4 Benchmarks and Comparisons

Comparison to relevant benchmark methods provides essential context. When possible, use publicly available benchmark test sets to enable standardized comparisons across studies.

3.5 Integration into Human Workflows

Many AI systems are designed for intelligence augmentation—supporting human decision-making rather than replacing it. Evaluation should consider:

Team-level outcomes (efficiency, accuracy)
Decision concordance (alignment between AI and human judgment)
Override rates (how often humans disagree with AI)
Usability and trust (user experience, cognitive ergonomics)
Workflow integration (how outputs are presented and acted upon)

4. Structured Case-Level Examination (SCLE)

A key contribution of the paper is the proposal of Structured Case-Level Examination (SCLE) as a complement to statistical performance evaluation.

4.1 What is SCLE?

SCLE involves systematic review of representative examples of an AI model’s classifications to understand its strengths and limitations. It extends and formalizes a practice that the authors have applied informally in their previous work.

4.2 What to Examine

The paper recommends examining a stratified random sample of:

Category	Purpose
False positives	Understand what causes the model to make errors; assess whether errors are acceptable
False negatives	Understand what events are being missed; identify patterns of failure
True positives	Assess whether correctly identified events are non-trivial and meaningful
Disagreement cases (when a benchmark exists)	Understand the nature of differences between models

4.3 Diagnostic Tags

Reviewers can assign diagnostic tags to each case to identify patterns:

Tag	Description
Never event	Errors that would be unacceptable or severely undermine trust
Unexpected error	Points to opportunities for improvement
Input data issue	Data quality issues or information not available to the AI
Test set issue	Incorrect or ambiguous labels in the reference standard
Triviality	Whether the correct classification was trivial or genuinely challenging

4.4 Benefits of SCLE

Contextualizes statistical metrics by showing real examples
Identifies patterns that aggregate metrics may hide
Reveals issues with test sets or reference standards
Builds trust by demonstrating both strengths and limitations
Guides improvements through targeted interventions

5. Three Pharmacovigilance Case Studies

The paper illustrates its framework using three real-world pharmacovigilance applications.

5.1 Case Study 1: Rule-Based Retrieval of Pregnancy Reports

Method: An expert-defined rule-based algorithm to identify adverse event reports involving pregnancy exposure in VigiBase, the WHO global database of adverse event reports.

Test Set Approach:

Annotated a random sample of 7,874 reports (restricted by age and sex to reduce negatives)
Downsampled negatives but did not actively enrich for positives
Demonstrated that age/sex restriction did not exclude hard positives

Performance Evaluation:

Recall: Evaluated on known positive controls
Precision: Evaluated by annotating all predicted positives from a random sample
Structured Case-Level Examination: Revealed that preprocessing errors (incomplete mapping to MedDRA) were the most frequent source of false negatives

Key Insight: The algorithm could not process pregnancy information confined to free-text fields—a limitation that statistical metrics alone would not reveal.

5.2 Case Study 2: Duplicate Detection with SVM and Statistical Record Linkage

Method: A machine learning model combining a support vector machine (SVM) with statistical record linkage to identify duplicate adverse event reports.

Test Set Approach:

Used known duplicates identified by national regulators for recall evaluation
Used model-specific precision tests with ~100 predicted positives for each model
Ensured overlap between test sets for different models to reduce annotation burden

Performance Evaluation:

Recall: Evaluated on known duplicates (acknowledging that some true duplicates may be missed by human operators)
Precision: Estimated using model-specific test sets
Robustness: Assessed performance for individual countries to identify geographic variation
Structured Case-Level Examination: Revealed that during early development, models had unacceptably low precision in real-world settings due to low prevalence of negatives in training data

Key Insight: Careful reuse of random report pair sequences ensured maximal overlap between model-specific precision tests, reducing annotation burden while enabling fair comparison.

5.3 Case Study 3: Redaction of Person Names with Fine-Tuned LLM

Method: A BERT-based model fine-tuned to identify person names in case narratives for automated redaction.

Test Set Approach:

Annotated a random sample of 5,042 case narratives
Identified 179 name tokens in 71 narratives (no enrichment)
Conservative classification of edge cases (labeling as names when in doubt) to produce conservative recall estimates

Performance Evaluation:

Recall: 71% (conservative estimate)
Precision: 55% (due to low prevalence; specificity was 99.95%)
Structured Case-Level Examination: Systematic review of all false negatives classified into diagnostic tags (directly, indirectly, and non-identifiable narratives) to determine re-identification risk
Fairness Analysis: The only false negative containing a full name was of Indian origin, prompting follow-up analysis to ensure no bias against names of certain origins
Data manipulation experiments: Showed the model could redact the same name when inserted in other narratives, indicating the failure was due to specific narrative-name interaction

Key Insight: Despite 99.95% specificity, precision was only 55% due to low prevalence—but the structured case-level examination revealed that many false positives were not clinically meaningful (e.g., initials “AF” that could be atrial fibrillation), and no true full names were missed except one with a specific narrative interaction.

6. Critical Appraisal Considerations: A Practical Checklist

The paper provides a set of considerations to guide procurement or development of AI models for rare-event recognition. These are framed as descriptive prompts rather than binary criteria.

6.1 Test Sets

Describe the nature, scope, size, and composition of test sets
Explain how characteristics align with intended use
Describe selection, number, and characteristics of positive and negative controls

6.2 Bias

Describe known or suspected sources of bias
Describe how performance was assessed across varying conditions and data subsets
Identify possibly under-served groups

6.3 Annotation Process

Describe criteria for defining positive and negative controls
Describe measures to ensure annotation quality and consistency
Explain how edge cases and ambiguous instances are handled

6.4 Choice of Metrics

Describe performance metrics and their relevance to intended use
Explain which aspects of performance they capture (false positives, false negatives, stability)
Acknowledge which aspects may not be fully captured

6.5 Decision Thresholds

Describe thresholds considered and their relationship to intended use
Explain assumptions about relative costs of different error types

6.6 Evaluation of Recall

Describe how test sets reflect the spectrum of positive controls
Describe any enrichment and how it is accounted for

6.7 Evaluation of Precision

Describe prevalence of positives in test sets and how it relates to expected prevalence
Describe any enrichment and how it is accounted for

6.8 Benchmarks

Describe comparisons to relevant benchmark methods
Describe implementation and optimization of benchmarks
Describe availability of benchmark test sets

6.9 Performance Drift

Describe measures to identify, monitor, and respond to data, model, or performance drift
For models with third-party components, describe sensitivity to updates or version changes

6.10 Types of Errors

Describe nature and patterns of false positives and false negatives
Explain how error types relate to intended use
Identify concerns regarding validity, fairness, or downstream consequences

6.11 Human-AI Interaction

Describe intended human-AI interaction (how outputs are presented, processed, and acted upon)
Explain how this interaction was accounted for in performance evaluation

7. How This Framework Benefits Pharmacovigilance

7.1 Smarter Resource Allocation

By understanding the true performance of AI models, organizations can make informed decisions about where to invest. Models that appear impressive on aggregate metrics may fail in practice, wasting resources on implementation and review.

7.2 Better Signal Detection

For applications like duplicate detection and pregnancy identification, understanding recall and precision helps organizations assess:

Recall: Are we missing important events?
Precision: Is the time spent reviewing false positives worth the benefit?

7.3 Fair and Equitable AI

Subgroup analyses and fairness assessments ensure that AI models do not systematically underperform for specific populations, which is critical in global pharmacovigilance where data comes from diverse sources.

7.4 Trust and Adoption

Structured case-level examination builds trust by showing real examples of both successes and failures. When end users understand what the model does well and where it struggles, they can use it more effectively.

7.5 Regulatory Alignment

The framework aligns with emerging regulatory expectations for AI in healthcare, including requirements for transparency, robustness, and ongoing monitoring.

8. Future Directions

The paper highlights two developments that deserve special attention:

8.1 Using AI to Evaluate AI

There is growing interest in using generative LLMs as “judges” to annotate test sets or review output from simpler AI models. This could reduce the resource burden of evaluation but requires careful validation and human calibration.

8.2 Evolving Human-AI Interaction

As AI systems become more sophisticated, workflows may evolve from simple flagging to fluid back-and-forth exchange between humans and AI. This will require new evaluation paradigms that assess real-world decision-making by human-AI teams, accounting for user experience, cognitive ergonomics, and decision efficiency.

9. Conclusion

The UMC’s framework for critical appraisal of AI in rare-event recognition is an essential resource for pharmacovigilance professionals navigating the rapidly evolving AI landscape.

Key Takeaways:

Rare events are different. Standard performance metrics can be highly misleading when event prevalence is low.
Test set design matters. The nature of the test set must align with the intended use case, and enrichment strategies must be accounted for in interpretation.
Precision is prevalence-dependent. Precision estimates from enriched test sets will be optimistic and not reflect real-world performance.
Structured case-level examination is essential. Statistical metrics alone cannot capture the nuances of AI performance; reviewing real examples is critical.
Robustness and fairness must be assessed. Performance can vary across subgroups, and this variation must be understood and addressed.
Human-AI interaction matters. AI systems are often designed to augment human decision-making, and evaluation must account for team-level outcomes.
Transparency is key. Clear documentation of test sets, annotation processes, and performance metrics supports trust and enables meaningful comparison.

As the authors conclude: “What remains constant is the need for appraisal that balances efficiency with rigor, enabling organizations to harness the benefits of AI for rare-event recognition while safeguarding validity, robustness, and fairness.”

For pharmacovigilance professionals, this framework provides a practical guide for making informed decisions about AI adoption—ensuring that new technologies truly enhance patient safety rather than creating new risks.

Critical Appraisal of Artificial Intelligence for Rare-Event Recognition Download

References

Norén GN, Meldau EL, Ellenius J. Critical Appraisal of Artificial Intelligence for Rare-Event Recognition: Principles and Pharmacovigilance Case Studies. Drug Saf. 2026. https://doi.org/10.1007/s40264-026-01649-7
Sandberg L, Vidlin SH, K-Pápai L, et al. Uncovering pregnancy exposures in pharmacovigilance case report databases: a comprehensive evaluation of the VigiBase pregnancy algorithm. Drug Saf. 2025;48(10):1103-1118.
Barrett JW, Erlanson N, China JF, Norén GN. A scalable predictive modelling approach to identifying duplicate adverse event reports for drugs and vaccines. arXiv preprint arXiv:2504.03729; 2025.
Meldau EL, Bista S, Melgarejo-González C, Norén GN. Automated redaction of names in adverse event reports using transformer-based neural networks. BMC Med Inform Decis Mak. 2024;24(1):401.
Council for International Organizations of Medical Sciences (CIOMS). Artificial intelligence in pharmacovigilance – Report of the CIOMS Working Group XIV. Geneva: CIOMS; 2025.

Reporting Side Effects

AlVigiLance

Advancing Medication Safety Through Knowledge and Vigilance

Recently Posted

All Post
Drug Information
General
Latest Drug Alerts
News & Updates
Reporting Side Effects

When AI Fails: Lessons from Pharmacovigilance Case Studies

1. Introduction: The Promise and Peril of AI in Pharmacovigilance

2. The Unique Challenges of Rare-Event Recognition

2.1 Why Rare Events Are Different

2.2 The Asymmetric Cost of Errors

3. Key Dimensions for Appraising AI Models

3.1 Problem Framing and Test Set Design

3.2 Prevalence-Aware Statistical Evaluation

3.2.1 Recall (Sensitivity)

3.2.2 Precision (Positive Predictive Value)

3.2.3 Specificity (True Negative Rate)

3.2.4 Composite Metrics

3.3 Robustness Assessment

3.4 Benchmarks and Comparisons

3.5 Integration into Human Workflows

4. Structured Case-Level Examination (SCLE)

4.1 What is SCLE?

4.2 What to Examine

4.3 Diagnostic Tags

4.4 Benefits of SCLE

5. Three Pharmacovigilance Case Studies

5.1 Case Study 1: Rule-Based Retrieval of Pregnancy Reports

5.2 Case Study 2: Duplicate Detection with SVM and Statistical Record Linkage

5.3 Case Study 3: Redaction of Person Names with Fine-Tuned LLM

6. Critical Appraisal Considerations: A Practical Checklist

6.1 Test Sets

6.2 Bias

6.3 Annotation Process

6.4 Choice of Metrics

6.5 Decision Thresholds

6.6 Evaluation of Recall

6.7 Evaluation of Precision

6.8 Benchmarks

6.9 Performance Drift

6.10 Types of Errors

6.11 Human-AI Interaction

7. How This Framework Benefits Pharmacovigilance

7.1 Smarter Resource Allocation

7.2 Better Signal Detection

7.3 Fair and Equitable AI

7.4 Trust and Adoption

7.5 Regulatory Alignment

8. Future Directions

8.1 Using AI to Evaluate AI

8.2 Evolving Human-AI Interaction

9. Conclusion

References

AlVigiLance

Recently Posted

The EPVC Newsletter (Volume 20, Issue 4- April 2026)

WHO Patient Safety Rights Charter: A Comprehensive Medical Analysis of the 10 Fundamental Rights for Safer Healthcare

The 3 Myths About Pharmacovigilance That Cost Companies Millions

Category

About

Contact

Privacy Policy

Terms and Conditions