Clinical Relevance
Where this fits in a pathology lab workflow
Pathologists review every patch of every whole-slide image manually. This system functions as a first-pass triage filter — not a replacement for human judgment.
Digital pathology labs are now scanning whole-slide images at scale — a single sentinel lymph node biopsy produces thousands of 96×96 tissue patches requiring review. At busy labs processing 500+ slides per day, this creates a throughput bottleneck. Manual patch-by-patch review is time-intensive, subject to fatigue-related variability, and creates delays in cancer workup.
This system operates as a Computer-Aided Detection (CADe) first-pass screener: it assigns a cancer probability to every patch in seconds, automatically deprioritizes high-confidence benign patches, and surfaces high-confidence malignant patches for immediate pathologist attention. The pathologist reviews all flagged cases — the AI handles triage, not diagnosis.
Validated Performance
Test set results — 32,998 held-out patches
Trained on 153,988 patches. Validation used for early stopping only. Test set never seen during training.
In clinical terms: AUC-ROC 0.921 means the model correctly ranks a cancerous patch above a non-cancerous patch 92.1% of the time when given one of each. Sensitivity of 82.2% means the system detects 822 of every 1,000 truly cancerous patches. For a screening system deployed as a triage layer — where missed cases still reach pathologist review through the gray zone — this discrimination performance is operationally meaningful.
Why AUC matters more than accuracy here: The test set is 59% non-cancerous / 41% cancerous. A model predicting "non-cancerous" for all patches would achieve 59% accuracy — meaningless. AUC-ROC measures the model's ability to rank cancerous above non-cancerous regardless of threshold, which is what matters for a triage system where the operating threshold is a clinical policy decision.
Subgroup analysis — performance by confidence quartile
Test set stratified by model confidence (output probability p). The quartile breakdown reveals how performance scales with model certainty — a critical signal for setting auto-clear thresholds.
| Confidence Quartile | Mean p | N Patches | Accuracy | F1 | AUC-ROC | Miss Rate |
|---|---|---|---|---|---|---|
| Q1 — Most uncertain | 0.245 | 8,250 | 64.8% | 0.545 | 0.672 | 42.5% |
| Q2 | 0.633 | 8,249 | 84.6% | 0.775 | 0.848 | 23.2% |
| Q3 | 0.849 | 8,249 | 93.3% | 0.895 | 0.929 | 13.1% |
| Q4 — Most confident | 0.966 | 8,250 | 98.3% | 0.986 | 0.989 | 1.8% |
Q4 accuracy of 98.3% at miss rate 1.8% validates the auto-clear concept at the high-confidence end. Q1 (most uncertain) shows 42.5% miss rate — confirming that mandatory human review is necessary for borderline cases. The confidence score is not just a performance metric; it is a genuine signal of model uncertainty that directly informs clinical routing decisions.
Clinical Decision Zones
Three operating zones — thresholds as clinical policy
The threshold setting is not a technical decision. It is a clinical and institutional risk decision that should be made collaboratively with pathologists and risk committees.
AI auto-clear — no mandatory review
At p≤0.10, the model clears 24.5% of all patches (8,088 of 32,998 in test set) with 95.6% accuracy. Only 352 of 13,367 truly cancerous patches fall below this threshold — a 2.6% miss rate. Operating like a "low-risk screening lane": high-confidence benign patches are de-prioritised, freeing pathologist time for cases that warrant scrutiny.
Mandatory pathologist review
20.8% of patches (6,879) fall in the gray zone where model confidence is insufficient for automated decisions. Accuracy in this zone is only 63.0% — the model genuinely cannot reliably classify these patches. Analogous to a borderline insurance claim requiring an experienced adjuster: the AI identifies the cases that need human judgment, then steps back.
Immediate escalation — flag for priority review
At p≥0.90, the model flags cases with 97.2% precision — only 2.8% of flagged patches are false positives. These patches are escalated for immediate pathologist attention and possible biopsy planning. At the p≥0.99 threshold, precision reaches 99.7% — near-certainty — though at the cost of recall (only 16.3% of all cancers are caught at this threshold).
Auto-clear threshold sensitivity — choosing the operating point
| Auto-clear threshold | Patches cleared | % of test set | Cancer missed | Miss rate |
|---|---|---|---|---|
| p ≤ 0.05 | 4,353 | 13.2% | 115 | 0.86% |
| p ≤ 0.10 | 8,088 | 24.5% | 352 | 2.63% |
| p ≤ 0.15 | 10,821 | 32.8% | 596 | 4.46% |
| p ≤ 0.20 | 12,720 | 38.5% | 798 | 5.97% |
| p ≤ 0.25 | 14,224 | 43.1% | 1,020 | 7.63% |
| p ≤ 0.30 | 15,538 | 47.1% | 1,269 | 9.49% |
The auto-clear threshold should not be set by the model developer. At p≤0.10, the system clears 24.5% of patches with a 2.6% cancer miss rate — directionally consistent with published PathAI productivity gains (20%, 2023) and the Dutch lab case study showing 84% case handling time reduction (Histopathology, 2018). However, the acceptable miss rate is a clinical, ethical, and institutional decision. The system makes this decision transparent and auditable; it does not make it unilaterally.
Explainability
GradCAM spatial explanation — post-course addition
Added after course completion to address a key gap: a classifier without spatial explanation is a black box that pathologists cannot interrogate or trust.
The course project produced accurate predictions with no indication of where in the tissue patch the model was attending. This is insufficient for clinical use: a pathologist cannot act on a probability alone — they need to know whether the model is attending to genuine cellular abnormalities (nuclear enlargement, irregular chromatin, increased mitotic figures) or artefactual signals. GradCAM was implemented as a post-course addition specifically to close this gap.
GradCAM Implementation — conv5 Target
Gradient-Weighted Class Activation Mapping applied to the final convolutional layer (conv5, 512 filters, 1×1 spatial). Backpropagates the cancer probability gradient to generate a spatial heatmap showing which regions of the 32×32 centre crop drove the prediction. Implemented from the Selvaraju et al. (ICCV 2017) algorithm directly — no external CAM library. Produces overlaid heatmaps for all six prediction categories: true/false positive, true/false negative cancerous and non-cancerous cases.
Clinical Interpretation of Heatmaps
For true positive predictions (correctly identified cancer), GradCAM heatmaps show concentrated attention on the central patch region — consistent with the PCam labelling convention that cancer presence is defined by the central 32×32 pixel area. For false positive predictions, heatmaps reveal model attention to tissue artefacts or staining heterogeneity at patch boundaries — exactly the failure mode a pathologist needs to see to calibrate trust. The GradCAM output is the primary audit mechanism supporting CADe regulatory compliance: every prediction is spatially attributable.
Why GradCAM is necessary for regulatory compliance: Under FDA CADe and CAP/AMA guidance, AI tools in pathology must be auditable and reversible. A prediction without spatial attribution is not auditable — a pathologist cannot verify whether the model's reasoning aligns with histological ground truth. GradCAM provides the spatial bridge between the model's probability output and the tissue features that drove it, making every auto-clear and escalation decision reviewable.
Technical Pipeline
From raw WSI patch to clinical triage decision
Post-course rebuild: full pipeline from noise filtering through GradCAM to confidence-tiered routing.
Noise filtering — blank patch removal
220,025 raw patches scanned for blank/artifact slides where ≥95% of central pixels are identical. 41 corrupted patches identified and removed before any model sees them. Final clean dataset: 219,984 patches. Production data hygiene at clinical scale — not an afterthought.
Stratified train/val/test split
Train: 153,988 (62,382 cancerous / 91,606 non-cancerous). Val: 32,998. Test: 32,998 — held out, never seen during training or hyperparameter selection. Class ratio maintained across all splits. This split structure mirrors regulatory evidence generation requirements.
5-layer custom CNN — 1.83M parameters
Five convolutional blocks (32→64→128→256→512 filters), each with Batch Normalisation, Leaky ReLU, MaxPool, and Dropout2d. Two fully-connected layers (512→512→1) with sigmoid output. Trained from scratch on 32×32 centre-cropped patches. Adam optimiser (lr=1e-3), ReduceLROnPlateau scheduler, early stopping at epoch 21 (best val F1=0.8164).
Confidence tier assignment
Every patch receives a probability p∈[0,1] from the sigmoid output. Six confidence tiers are assigned: p<0.10 (very confident non-cancer, auto-clear candidate), 0.10–0.30, 0.30–0.50, 0.50–0.70, 0.70–0.90, p≥0.90 (very confident cancer, immediate escalation). At p<0.10: 8,088 patches cleared with 95.6% accuracy. At p≥0.90: 6,863 patches flagged with 97.2% precision.
GradCAM spatial explanation
For every reviewed patch (gray zone and escalated), GradCAM backpropagates the cancer probability gradient to conv5 and produces a spatial heatmap overlaid on the original H&E stain. Pathologist sees: the patch, the probability, and the spatial attribution — all in one view. GradCAM is generated for all six prediction outcome categories (TP, TN, FP, FN across both classes) to characterise failure modes.
Data scaling study (course component)
Independent four-subset experiment (20/40/60/100% of training data) showing non-monotonic AUC scaling: Subset 1 (20%)=0.81, Subset 2 (40%)=0.79, Subset 3 (60%)=0.81, Full (100%)=0.82. Key finding: AUC at 40% was lower than 20% before recovering — demonstrating that data volume and model behaviour interact non-linearly, a result directly informing how AI-assisted screening pipelines should be validated before deployment.
Governance & Model Card
Model Card — formal documentation
Model Card structured to meet regulatory documentation requirements for CADe clinical AI.
Intended use and limitations
Intended use: Research-grade first-pass patch-level screener for lymph node metastasis detection from H&E-stained slides. Operates as a CADe (detection) aid — every output reviewed by a qualified pathologist before clinical action.
Out of scope: Autonomous diagnosis, other tissue types (only validated on lymph node), other staining protocols, slide-level diagnosis (patch-level only — no WSI aggregation implemented), other cancer types.
Training data limitations: Camelyon16 derives from 2 Dutch medical centres. Performance on scanners, staining protocols, and patient populations outside these centres is unvalidated. Multi-site calibration required before production deployment.
Known failure modes: Q1 confidence quartile (most uncertain) shows 42.5% miss rate — these cases require mandatory human review. Model can attend to tissue artefacts and staining heterogeneity rather than genuine malignant features (visible in GradCAM heatmaps for FP cases). No slide-level aggregation — patch-level predictions cannot substitute for whole-slide diagnosis.
Regulatory framing
This system falls into the FDA CADe (Computer-Aided Detection) category for digital pathology — software that identifies regions of interest on a whole-slide image for subsequent review by a qualified pathologist. Under FDA guidance (SW/AI Guidance, 2021) and CAP/AMA statements on AI in pathology, CADe tools must be:
- Auditable — every decision traceable to model inputs (GradCAM satisfies this)
- Reversible — pathologist can override any AI triage decision
- Not autonomous — no final diagnostic report without pathologist review
Current status: research-grade only. Prerequisites before clinical submission: prospective multi-centre validation, CLIA compliance for laboratory implementation, IRB approval for patient data use, and a clinical evidence package. CPT code 88342 (immunohistochemistry interpretation) provides the billing framework under which AI-assisted pathology interpretation currently operates.
The business case for this system is strongest when deployed not as a standalone classifier but as a triage layer — routing high-confidence cases away from manual review while concentrating pathologist attention on the uncertain and high-risk cases where human judgment adds the most value.
What the course built vs what was added post-course
| Capability | Course Version | Post-Course Addition |
|---|---|---|
| Full dataset training | ✓ All 4 subsets | ✓ Full 153,988 patches, epoch 21 best |
| Explainability | None | GradCAM on all 6 prediction categories |
| Confidence tiers | None | 6-tier system with clinical decision zones |
| Threshold sensitivity analysis | None | Full auto-clear table with miss rates |
| Subgroup analysis | Confidence quartile only | Q1–Q4 with per-quartile metrics |
| Pathologist workflow integration | None | Auto-clear thresholds, gray zone analysis, cost model |
| Model Card | Partial inline markdown | Full formal Model Card with regulatory framing |
| Deployment | FastAPI code (not hosted) | GitHub — no live deployment (see note below) |
On deployment: The course included FastAPI and Dockerfile code for local deployment. After evaluation, the decision was made to invest the remaining time in extending the clinical analysis (GradCAM, confidence tiers, Model Card) rather than hosting a cloud demo. A patch-level prediction without slide-level aggregation would not represent a genuine clinical workflow — the clinical analysis additions are higher signal than a hosted demo of an incomplete pipeline.