The rapid evolution of artificial intelligence (AI) systems offers tremendous potential for pharmacovigilance, particularly in recognizing rare but clinically significant events. However, the decreasing cost and effort required to develop AI models create a risk that organizations may invest insufficient time in understanding their limitations and sources of error. This comprehensive review summarizes a landmark paper from the Uppsala Monitoring Centre (UMC) that outlines a framework for critically appraising AI models for rare-event recognition.
The paper introduces key dimensions for evaluation, including problem framing, test set design, prevalence-aware statistical metrics, robustness assessment, and integration into human workflows. It also proposes a novel approach called Structured Case-Level Examination (SCLE) to complement statistical performance evaluation and provides practical considerations for procuring or developing AI models in this space. Drawing on three pharmacovigilance case studies—rule-based retrieval of pregnancy-related reports, machine learning-based duplicate detection, and automated redaction of person names using large language models—the paper demonstrates how these principles can be applied in practice.
1. Introduction: The Promise and Peril of AI in Pharmacovigilance
Artificial intelligence (AI) is transforming pharmacovigilance. From automated case processing to signal detection and duplicate identification, AI systems promise to enhance efficiency, improve quality, and enable entirely new capabilities. However, the rapid proliferation of AI tools—coupled with decreasing development costs—creates a critical challenge: how do we know which AI systems actually work?
As the authors of this landmark paper from the Uppsala Monitoring Centre (UMC) note, “the effort and expertise required to develop modern AI decrease, there is a risk that organizations devote too little time to understanding their limitations and sources of error.” This is particularly problematic in pharmacovigilance, where errors can have serious consequences for patient safety.
The paper focuses on rare-event recognition—a common challenge in pharmacovigilance where the events of interest (e.g., specific adverse drug reactions, duplicates, pregnancy exposures) occur infrequently. In such settings, traditional performance metrics can be highly misleading, and careful evaluation is essential.
2. The Unique Challenges of Rare-Event Recognition
2.1 Why Rare Events Are Different
In pharmacovigilance, many tasks involve recognizing rare events:
| Task | Event of Interest | Prevalence |
|---|---|---|
| Signal detection | New or emerging safety signals | Very low |
| Duplicate detection | True duplicate reports | Low |
| Pregnancy identification | Reports involving pregnancy exposure | Low |
| Name redaction | Person names in narratives | Very low |
When events are rare, traditional performance metrics can be deceptive:
- Apparent accuracy can be high even if the model performs poorly on the events that matter
- Test sets may not contain enough positive examples for reliable evaluation
- Precision estimates are highly sensitive to the prevalence of positives in the test set
- Specificity must be extremely high to achieve acceptable precision
2.2 The Asymmetric Cost of Errors
In rare-event recognition, the cost of different types of errors is often asymmetric:
| Error Type | Cost in Pharmacovigilance |
|---|---|
| False positive (flagging a non-event) | Wasted reviewer time, alert fatigue, potential bias in downstream analyses |
| False negative (missing a real event) | Missed safety signals, delayed action, patient harm |
Understanding these costs is essential for selecting appropriate decision thresholds and performance metrics.
3. Key Dimensions for Appraising AI Models
The paper outlines several key dimensions for critical appraisal of AI models for rare-event recognition.
3.1 Problem Framing and Test Set Design
Key Questions:
- What is the intended use case? (Efficiency, quality, or capability?)
- Does the test set reflect the deployment domain?
- Are positive and negative controls clearly defined?
- Are ambiguous cases handled appropriately?
Key Insight: The nature of the test set should align with the desired deployment domain. If an AI model is intended for broad use across collections of adverse event reports, the test set should include reports from diverse sources, drugs, and events.
3.2 Prevalence-Aware Statistical Evaluation
The paper emphasizes that standard performance metrics must be interpreted with care in rare-event settings.
3.2.1 Recall (Sensitivity)
Recall=True Positives+False NegativesTrue Positives
Recall measures how many of the events of interest are correctly identified. It requires a representative set of positive controls. If test sets are enriched with positive controls (to increase their number), recall may be overestimated if hard-to-recognize positives are also less likely to be included.
3.2.2 Precision (Positive Predictive Value)
Precision=True Positives+False PositivesTrue Positives
Precision measures how many of the predicted positives are actually correct. It is highly dependent on the prevalence of positives in the test set. If test sets are enriched with positives, precision estimates will be optimistic.
Critical Warning: The expected precision of random guessing equals the prevalence of positive controls. For a balanced test set (50% positives), random guessing gives 50% precision—which may appear impressive but is worthless in practice.
3.2.3 Specificity (True Negative Rate)
Specificity=True Negatives+False PositivesTrue Negatives
Specificity measures how many negatives are correctly classified. It is independent of prevalence, making it useful for comparing models. However, for rare events, specificity must be extremely close to 1 to achieve acceptable precision, requiring very large test sets.
3.2.4 Composite Metrics
Metrics like F1-score (harmonic mean of precision and recall) inherit the limitations of their components. Receiver Operating Characteristic (ROC) curves and area under the curve (AUC) are dominated by portions of the decision curve that are irrelevant for rare events.
Recommendation: Focus on recall and precision, using model-specific precision test sets drawn from the actual deployment population.
3.3 Robustness Assessment
Robustness refers to the stability and reliability of AI model performance under varying conditions.
| Aspect | Considerations |
|---|---|
| Stability | For non-deterministic models (e.g., generative LLMs), assess consistency across repeated executions |
| Data drift | Monitor performance over time as input data changes |
| Subgroup performance | Evaluate whether performance varies across different data subsets (e.g., by country, source, demographics) |
| Fairness | Ensure the model does not under-serve or bias against specific groups |
3.4 Benchmarks and Comparisons
Comparison to relevant benchmark methods provides essential context. When possible, use publicly available benchmark test sets to enable standardized comparisons across studies.
3.5 Integration into Human Workflows
Many AI systems are designed for intelligence augmentation—supporting human decision-making rather than replacing it. Evaluation should consider:
- Team-level outcomes (efficiency, accuracy)
- Decision concordance (alignment between AI and human judgment)
- Override rates (how often humans disagree with AI)
- Usability and trust (user experience, cognitive ergonomics)
- Workflow integration (how outputs are presented and acted upon)
4. Structured Case-Level Examination (SCLE)
A key contribution of the paper is the proposal of Structured Case-Level Examination (SCLE) as a complement to statistical performance evaluation.
4.1 What is SCLE?
SCLE involves systematic review of representative examples of an AI model’s classifications to understand its strengths and limitations. It extends and formalizes a practice that the authors have applied informally in their previous work.
4.2 What to Examine
The paper recommends examining a stratified random sample of:
| Category | Purpose |
|---|---|
| False positives | Understand what causes the model to make errors; assess whether errors are acceptable |
| False negatives | Understand what events are being missed; identify patterns of failure |
| True positives | Assess whether correctly identified events are non-trivial and meaningful |
| Disagreement cases (when a benchmark exists) | Understand the nature of differences between models |
4.3 Diagnostic Tags
Reviewers can assign diagnostic tags to each case to identify patterns:
| Tag | Description |
|---|---|
| Never event | Errors that would be unacceptable or severely undermine trust |
| Unexpected error | Points to opportunities for improvement |
| Input data issue | Data quality issues or information not available to the AI |
| Test set issue | Incorrect or ambiguous labels in the reference standard |
| Triviality | Whether the correct classification was trivial or genuinely challenging |
4.4 Benefits of SCLE
- Contextualizes statistical metrics by showing real examples
- Identifies patterns that aggregate metrics may hide
- Reveals issues with test sets or reference standards
- Builds trust by demonstrating both strengths and limitations
- Guides improvements through targeted interventions
5. Three Pharmacovigilance Case Studies
The paper illustrates its framework using three real-world pharmacovigilance applications.
5.1 Case Study 1: Rule-Based Retrieval of Pregnancy Reports
Method: An expert-defined rule-based algorithm to identify adverse event reports involving pregnancy exposure in VigiBase, the WHO global database of adverse event reports.
Test Set Approach:
- Annotated a random sample of 7,874 reports (restricted by age and sex to reduce negatives)
- Downsampled negatives but did not actively enrich for positives
- Demonstrated that age/sex restriction did not exclude hard positives
Performance Evaluation:
- Recall: Evaluated on known positive controls
- Precision: Evaluated by annotating all predicted positives from a random sample
- Structured Case-Level Examination: Revealed that preprocessing errors (incomplete mapping to MedDRA) were the most frequent source of false negatives
Key Insight: The algorithm could not process pregnancy information confined to free-text fields—a limitation that statistical metrics alone would not reveal.
5.2 Case Study 2: Duplicate Detection with SVM and Statistical Record Linkage
Method: A machine learning model combining a support vector machine (SVM) with statistical record linkage to identify duplicate adverse event reports.
Test Set Approach:
- Used known duplicates identified by national regulators for recall evaluation
- Used model-specific precision tests with ~100 predicted positives for each model
- Ensured overlap between test sets for different models to reduce annotation burden
Performance Evaluation:
- Recall: Evaluated on known duplicates (acknowledging that some true duplicates may be missed by human operators)
- Precision: Estimated using model-specific test sets
- Robustness: Assessed performance for individual countries to identify geographic variation
- Structured Case-Level Examination: Revealed that during early development, models had unacceptably low precision in real-world settings due to low prevalence of negatives in training data
Key Insight: Careful reuse of random report pair sequences ensured maximal overlap between model-specific precision tests, reducing annotation burden while enabling fair comparison.
5.3 Case Study 3: Redaction of Person Names with Fine-Tuned LLM
Method: A BERT-based model fine-tuned to identify person names in case narratives for automated redaction.
Test Set Approach:
- Annotated a random sample of 5,042 case narratives
- Identified 179 name tokens in 71 narratives (no enrichment)
- Conservative classification of edge cases (labeling as names when in doubt) to produce conservative recall estimates
Performance Evaluation:
- Recall: 71% (conservative estimate)
- Precision: 55% (due to low prevalence; specificity was 99.95%)
- Structured Case-Level Examination: Systematic review of all false negatives classified into diagnostic tags (directly, indirectly, and non-identifiable narratives) to determine re-identification risk
- Fairness Analysis: The only false negative containing a full name was of Indian origin, prompting follow-up analysis to ensure no bias against names of certain origins
- Data manipulation experiments: Showed the model could redact the same name when inserted in other narratives, indicating the failure was due to specific narrative-name interaction
Key Insight: Despite 99.95% specificity, precision was only 55% due to low prevalence—but the structured case-level examination revealed that many false positives were not clinically meaningful (e.g., initials “AF” that could be atrial fibrillation), and no true full names were missed except one with a specific narrative interaction.
6. Critical Appraisal Considerations: A Practical Checklist
The paper provides a set of considerations to guide procurement or development of AI models for rare-event recognition. These are framed as descriptive prompts rather than binary criteria.
6.1 Test Sets
- Describe the nature, scope, size, and composition of test sets
- Explain how characteristics align with intended use
- Describe selection, number, and characteristics of positive and negative controls
6.2 Bias
- Describe known or suspected sources of bias
- Describe how performance was assessed across varying conditions and data subsets
- Identify possibly under-served groups
6.3 Annotation Process
- Describe criteria for defining positive and negative controls
- Describe measures to ensure annotation quality and consistency
- Explain how edge cases and ambiguous instances are handled
6.4 Choice of Metrics
- Describe performance metrics and their relevance to intended use
- Explain which aspects of performance they capture (false positives, false negatives, stability)
- Acknowledge which aspects may not be fully captured
6.5 Decision Thresholds
- Describe thresholds considered and their relationship to intended use
- Explain assumptions about relative costs of different error types
6.6 Evaluation of Recall
- Describe how test sets reflect the spectrum of positive controls
- Describe any enrichment and how it is accounted for
6.7 Evaluation of Precision
- Describe prevalence of positives in test sets and how it relates to expected prevalence
- Describe any enrichment and how it is accounted for
6.8 Benchmarks
- Describe comparisons to relevant benchmark methods
- Describe implementation and optimization of benchmarks
- Describe availability of benchmark test sets
6.9 Performance Drift
- Describe measures to identify, monitor, and respond to data, model, or performance drift
- For models with third-party components, describe sensitivity to updates or version changes
6.10 Types of Errors
- Describe nature and patterns of false positives and false negatives
- Explain how error types relate to intended use
- Identify concerns regarding validity, fairness, or downstream consequences
6.11 Human-AI Interaction
- Describe intended human-AI interaction (how outputs are presented, processed, and acted upon)
- Explain how this interaction was accounted for in performance evaluation
7. How This Framework Benefits Pharmacovigilance
7.1 Smarter Resource Allocation
By understanding the true performance of AI models, organizations can make informed decisions about where to invest. Models that appear impressive on aggregate metrics may fail in practice, wasting resources on implementation and review.
7.2 Better Signal Detection
For applications like duplicate detection and pregnancy identification, understanding recall and precision helps organizations assess:
- Recall: Are we missing important events?
- Precision: Is the time spent reviewing false positives worth the benefit?
7.3 Fair and Equitable AI
Subgroup analyses and fairness assessments ensure that AI models do not systematically underperform for specific populations, which is critical in global pharmacovigilance where data comes from diverse sources.
7.4 Trust and Adoption
Structured case-level examination builds trust by showing real examples of both successes and failures. When end users understand what the model does well and where it struggles, they can use it more effectively.
7.5 Regulatory Alignment
The framework aligns with emerging regulatory expectations for AI in healthcare, including requirements for transparency, robustness, and ongoing monitoring.
8. Future Directions
The paper highlights two developments that deserve special attention:
8.1 Using AI to Evaluate AI
There is growing interest in using generative LLMs as “judges” to annotate test sets or review output from simpler AI models. This could reduce the resource burden of evaluation but requires careful validation and human calibration.
8.2 Evolving Human-AI Interaction
As AI systems become more sophisticated, workflows may evolve from simple flagging to fluid back-and-forth exchange between humans and AI. This will require new evaluation paradigms that assess real-world decision-making by human-AI teams, accounting for user experience, cognitive ergonomics, and decision efficiency.
9. Conclusion
The UMC’s framework for critical appraisal of AI in rare-event recognition is an essential resource for pharmacovigilance professionals navigating the rapidly evolving AI landscape.
Key Takeaways:
- Rare events are different. Standard performance metrics can be highly misleading when event prevalence is low.
- Test set design matters. The nature of the test set must align with the intended use case, and enrichment strategies must be accounted for in interpretation.
- Precision is prevalence-dependent. Precision estimates from enriched test sets will be optimistic and not reflect real-world performance.
- Structured case-level examination is essential. Statistical metrics alone cannot capture the nuances of AI performance; reviewing real examples is critical.
- Robustness and fairness must be assessed. Performance can vary across subgroups, and this variation must be understood and addressed.
- Human-AI interaction matters. AI systems are often designed to augment human decision-making, and evaluation must account for team-level outcomes.
- Transparency is key. Clear documentation of test sets, annotation processes, and performance metrics supports trust and enables meaningful comparison.
As the authors conclude: “What remains constant is the need for appraisal that balances efficiency with rigor, enabling organizations to harness the benefits of AI for rare-event recognition while safeguarding validity, robustness, and fairness.”
For pharmacovigilance professionals, this framework provides a practical guide for making informed decisions about AI adoption—ensuring that new technologies truly enhance patient safety rather than creating new risks.
References
- Norén GN, Meldau EL, Ellenius J. Critical Appraisal of Artificial Intelligence for Rare-Event Recognition: Principles and Pharmacovigilance Case Studies. Drug Saf. 2026. https://doi.org/10.1007/s40264-026-01649-7
- Sandberg L, Vidlin SH, K-Pápai L, et al. Uncovering pregnancy exposures in pharmacovigilance case report databases: a comprehensive evaluation of the VigiBase pregnancy algorithm. Drug Saf. 2025;48(10):1103-1118.
- Barrett JW, Erlanson N, China JF, Norén GN. A scalable predictive modelling approach to identifying duplicate adverse event reports for drugs and vaccines. arXiv preprint arXiv:2504.03729; 2025.
- Meldau EL, Bista S, Melgarejo-González C, Norén GN. Automated redaction of names in adverse event reports using transformer-based neural networks. BMC Med Inform Decis Mak. 2024;24(1):401.
- Council for International Organizations of Medical Sciences (CIOMS). Artificial intelligence in pharmacovigilance – Report of the CIOMS Working Group XIV. Geneva: CIOMS; 2025.



