Histopathologic Cancer Detection

Clinical Relevance

Where this fits in a pathology lab workflow

Pathologists review every patch of every whole-slide image manually. This system functions as a first-pass triage filter — not a replacement for human judgment.

Digital pathology labs are now scanning whole-slide images at scale — a single sentinel lymph node biopsy produces thousands of 96×96 tissue patches requiring review. At busy labs processing 500+ slides per day, this creates a throughput bottleneck. Manual patch-by-patch review is time-intensive, subject to fatigue-related variability, and creates delays in cancer workup.

This system operates as a Computer-Aided Detection (CADe) first-pass screener: it assigns a cancer probability to every patch in seconds, automatically deprioritizes high-confidence benign patches, and surfaces high-confidence malignant patches for immediate pathologist attention. The pathologist reviews all flagged cases — the AI handles triage, not diagnosis.

🔬

WSI Scanning

96×96 H&E patch tessellation

🧹

Quality Filter

41 blank/artifact patches removed

🤖

AI Inference

P(cancer) assigned per patch

📊

Confidence Tier

Auto-clear / gray zone / escalate

👨‍⚕️

Pathologist Review

Reduced queue + GradCAM heatmap

📋

Report

Patch map + confidence distribution

24.5%

Patches auto-cleared

At p≤0.10 threshold — cleared with 95.6% accuracy and only 2.6% missed cancer rate

220K

Training patches

From Camelyon16 whole-slide images, 2 clinical centers, 40× objective

CADe

Regulatory category

Computer-Aided Detection — audit-ready, reversible, not autonomous

Validated Performance

Test set results — 32,998 held-out patches

Trained on 153,988 patches. Validation used for early stopping only. Test set never seen during training.

0.921

AUC-ROC

Strong discrimination

85.2%

Accuracy

Test set

0.819

F1 Score

Balanced precision/recall

82.2%

Sensitivity

Cancer detection rate

81.6%

Precision

Positive predictive value

In clinical terms: AUC-ROC 0.921 means the model correctly ranks a cancerous patch above a non-cancerous patch 92.1% of the time when given one of each. Sensitivity of 82.2% means the system detects 822 of every 1,000 truly cancerous patches. For a screening system deployed as a triage layer — where missed cases still reach pathologist review through the gray zone — this discrimination performance is operationally meaningful.

Why AUC matters more than accuracy here: The test set is 59% non-cancerous / 41% cancerous. A model predicting "non-cancerous" for all patches would achieve 59% accuracy — meaningless. AUC-ROC measures the model's ability to rank cancerous above non-cancerous regardless of threshold, which is what matters for a triage system where the operating threshold is a clinical policy decision.

Subgroup analysis — performance by confidence quartile

Test set stratified by model confidence (output probability p). The quartile breakdown reveals how performance scales with model certainty — a critical signal for setting auto-clear thresholds.

Confidence Quartile	Mean p	N Patches	Accuracy	F1	AUC-ROC	Miss Rate
Q1 — Most uncertain	0.245	8,250	64.8%	0.545	0.672	42.5%
Q2	0.633	8,249	84.6%	0.775	0.848	23.2%
Q3	0.849	8,249	93.3%	0.895	0.929	13.1%
Q4 — Most confident	0.966	8,250	98.3%	0.986	0.989	1.8%

Q4 accuracy of 98.3% at miss rate 1.8% validates the auto-clear concept at the high-confidence end. Q1 (most uncertain) shows 42.5% miss rate — confirming that mandatory human review is necessary for borderline cases. The confidence score is not just a performance metric; it is a genuine signal of model uncertainty that directly informs clinical routing decisions.

Clinical Decision Zones

Three operating zones — thresholds as clinical policy

The threshold setting is not a technical decision. It is a clinical and institutional risk decision that should be made collaboratively with pathologists and risk committees.

Clear Negative · p < 0.10

AI auto-clear — no mandatory review

At p≤0.10, the model clears 24.5% of all patches (8,088 of 32,998 in test set) with 95.6% accuracy. Only 352 of 13,367 truly cancerous patches fall below this threshold — a 2.6% miss rate. Operating like a "low-risk screening lane": high-confidence benign patches are de-prioritised, freeing pathologist time for cases that warrant scrutiny.

Gray Zone · 0.10 – 0.70

Mandatory pathologist review

20.8% of patches (6,879) fall in the gray zone where model confidence is insufficient for automated decisions. Accuracy in this zone is only 63.0% — the model genuinely cannot reliably classify these patches. Analogous to a borderline insurance claim requiring an experienced adjuster: the AI identifies the cases that need human judgment, then steps back.

Clear Positive · p > 0.90

Immediate escalation — flag for priority review

At p≥0.90, the model flags cases with 97.2% precision — only 2.8% of flagged patches are false positives. These patches are escalated for immediate pathologist attention and possible biopsy planning. At the p≥0.99 threshold, precision reaches 99.7% — near-certainty — though at the cost of recall (only 16.3% of all cancers are caught at this threshold).

Auto-clear threshold sensitivity — choosing the operating point

Auto-clear threshold	Patches cleared	% of test set	Cancer missed	Miss rate
p ≤ 0.05	4,353	13.2%	115	0.86%
p ≤ 0.10	8,088	24.5%	352	2.63%
p ≤ 0.15	10,821	32.8%	596	4.46%
p ≤ 0.20	12,720	38.5%	798	5.97%
p ≤ 0.25	14,224	43.1%	1,020	7.63%
p ≤ 0.30	15,538	47.1%	1,269	9.49%

The auto-clear threshold should not be set by the model developer. At p≤0.10, the system clears 24.5% of patches with a 2.6% cancer miss rate — directionally consistent with published PathAI productivity gains (20%, 2023) and the Dutch lab case study showing 84% case handling time reduction (Histopathology, 2018). However, the acceptable miss rate is a clinical, ethical, and institutional decision. The system makes this decision transparent and auditable; it does not make it unilaterally.

Explainability

GradCAM spatial explanation — post-course addition

Added after course completion to address a key gap: a classifier without spatial explanation is a black box that pathologists cannot interrogate or trust.

The course project produced accurate predictions with no indication of where in the tissue patch the model was attending. This is insufficient for clinical use: a pathologist cannot act on a probability alone — they need to know whether the model is attending to genuine cellular abnormalities (nuclear enlargement, irregular chromatin, increased mitotic figures) or artefactual signals. GradCAM was implemented as a post-course addition specifically to close this gap.

🎯

GradCAM Implementation — conv5 Target

Gradient-Weighted Class Activation Mapping applied to the final convolutional layer (conv5, 512 filters, 1×1 spatial). Backpropagates the cancer probability gradient to generate a spatial heatmap showing which regions of the 32×32 centre crop drove the prediction. Implemented from the Selvaraju et al. (ICCV 2017) algorithm directly — no external CAM library. Produces overlaid heatmaps for all six prediction categories: true/false positive, true/false negative cancerous and non-cancerous cases.

🔬

Clinical Interpretation of Heatmaps

For true positive predictions (correctly identified cancer), GradCAM heatmaps show concentrated attention on the central patch region — consistent with the PCam labelling convention that cancer presence is defined by the central 32×32 pixel area. For false positive predictions, heatmaps reveal model attention to tissue artefacts or staining heterogeneity at patch boundaries — exactly the failure mode a pathologist needs to see to calibrate trust. The GradCAM output is the primary audit mechanism supporting CADe regulatory compliance: every prediction is spatially attributable.

Why GradCAM is necessary for regulatory compliance: Under FDA CADe and CAP/AMA guidance, AI tools in pathology must be auditable and reversible. A prediction without spatial attribution is not auditable — a pathologist cannot verify whether the model's reasoning aligns with histological ground truth. GradCAM provides the spatial bridge between the model's probability output and the tissue features that drove it, making every auto-clear and escalation decision reviewable.

Technical Pipeline

From raw WSI patch to clinical triage decision

Post-course rebuild: full pipeline from noise filtering through GradCAM to confidence-tiered routing.

Noise filtering — blank patch removal

220,025 raw patches scanned for blank/artifact slides where ≥95% of central pixels are identical. 41 corrupted patches identified and removed before any model sees them. Final clean dataset: 219,984 patches. Production data hygiene at clinical scale — not an afterthought.

Stratified train/val/test split

Train: 153,988 (62,382 cancerous / 91,606 non-cancerous). Val: 32,998. Test: 32,998 — held out, never seen during training or hyperparameter selection. Class ratio maintained across all splits. This split structure mirrors regulatory evidence generation requirements.

5-layer custom CNN — 1.83M parameters

Five convolutional blocks (32→64→128→256→512 filters), each with Batch Normalisation, Leaky ReLU, MaxPool, and Dropout2d. Two fully-connected layers (512→512→1) with sigmoid output. Trained from scratch on 32×32 centre-cropped patches. Adam optimiser (lr=1e-3), ReduceLROnPlateau scheduler, early stopping at epoch 21 (best val F1=0.8164).

Confidence tier assignment

Every patch receives a probability p∈[0,1] from the sigmoid output. Six confidence tiers are assigned: p<0.10 (very confident non-cancer, auto-clear candidate), 0.10–0.30, 0.30–0.50, 0.50–0.70, 0.70–0.90, p≥0.90 (very confident cancer, immediate escalation). At p<0.10: 8,088 patches cleared with 95.6% accuracy. At p≥0.90: 6,863 patches flagged with 97.2% precision.

GradCAM spatial explanation

For every reviewed patch (gray zone and escalated), GradCAM backpropagates the cancer probability gradient to conv5 and produces a spatial heatmap overlaid on the original H&E stain. Pathologist sees: the patch, the probability, and the spatial attribution — all in one view. GradCAM is generated for all six prediction outcome categories (TP, TN, FP, FN across both classes) to characterise failure modes.

Data scaling study (course component)

Independent four-subset experiment (20/40/60/100% of training data) showing non-monotonic AUC scaling: Subset 1 (20%)=0.81, Subset 2 (40%)=0.79, Subset 3 (60%)=0.81, Full (100%)=0.82. Key finding: AUC at 40% was lower than 20% before recovering — demonstrating that data volume and model behaviour interact non-linearly, a result directly informing how AI-assisted screening pipelines should be validated before deployment.

Governance & Model Card

Model Card — formal documentation

Model Card structured to meet regulatory documentation requirements for CADe clinical AI.

Intended use and limitations

Intended use: Research-grade first-pass patch-level screener for lymph node metastasis detection from H&E-stained slides. Operates as a CADe (detection) aid — every output reviewed by a qualified pathologist before clinical action.

Out of scope: Autonomous diagnosis, other tissue types (only validated on lymph node), other staining protocols, slide-level diagnosis (patch-level only — no WSI aggregation implemented), other cancer types.

Training data limitations: Camelyon16 derives from 2 Dutch medical centres. Performance on scanners, staining protocols, and patient populations outside these centres is unvalidated. Multi-site calibration required before production deployment.

Known failure modes: Q1 confidence quartile (most uncertain) shows 42.5% miss rate — these cases require mandatory human review. Model can attend to tissue artefacts and staining heterogeneity rather than genuine malignant features (visible in GradCAM heatmaps for FP cases). No slide-level aggregation — patch-level predictions cannot substitute for whole-slide diagnosis.

Regulatory framing

This system falls into the FDA CADe (Computer-Aided Detection) category for digital pathology — software that identifies regions of interest on a whole-slide image for subsequent review by a qualified pathologist. Under FDA guidance (SW/AI Guidance, 2021) and CAP/AMA statements on AI in pathology, CADe tools must be:

Auditable — every decision traceable to model inputs (GradCAM satisfies this)
Reversible — pathologist can override any AI triage decision
Not autonomous — no final diagnostic report without pathologist review

Current status: research-grade only. Prerequisites before clinical submission: prospective multi-centre validation, CLIA compliance for laboratory implementation, IRB approval for patient data use, and a clinical evidence package. CPT code 88342 (immunohistochemistry interpretation) provides the billing framework under which AI-assisted pathology interpretation currently operates.

The business case for this system is strongest when deployed not as a standalone classifier but as a triage layer — routing high-confidence cases away from manual review while concentrating pathologist attention on the uncertain and high-risk cases where human judgment adds the most value.

— Workflow integration design principle

What the course built vs what was added post-course

Capability	Course Version	Post-Course Addition
Full dataset training	✓ All 4 subsets	✓ Full 153,988 patches, epoch 21 best
Explainability	None	GradCAM on all 6 prediction categories
Confidence tiers	None	6-tier system with clinical decision zones
Threshold sensitivity analysis	None	Full auto-clear table with miss rates
Subgroup analysis	Confidence quartile only	Q1–Q4 with per-quartile metrics
Pathologist workflow integration	None	Auto-clear thresholds, gray zone analysis, cost model
Model Card	Partial inline markdown	Full formal Model Card with regulatory framing
Deployment	FastAPI code (not hosted)	GitHub — no live deployment (see note below)

On deployment: The course included FastAPI and Dockerfile code for local deployment. After evaluation, the decision was made to invest the remaining time in extending the clinical analysis (GradCAM, confidence tiers, Model Card) rather than hosting a cloud demo. A patch-level prediction without slide-level aggregation would not represent a genuine clinical workflow — the clinical analysis additions are higher signal than a hosted demo of an incomplete pipeline.

Technical stack

PyTorch 2.10GradCAM (custom implementation) scikit-learnOpenCVPillow NumPy / Pandasmatplotlib / seaborn PatchCamelyon (Camelyon16)Kaggle GPU (Tesla P100) FastAPI (code written)Docker (code written)

Cancer Detection from
Histopathology Slides

Where this fits in a pathology lab workflow

Test set results — 32,998 held-out patches

Subgroup analysis — performance by confidence quartile

Three operating zones — thresholds as clinical policy

AI auto-clear — no mandatory review

Mandatory pathologist review

Immediate escalation — flag for priority review

Auto-clear threshold sensitivity — choosing the operating point

GradCAM spatial explanation — post-course addition

GradCAM Implementation — conv5 Target

Clinical Interpretation of Heatmaps

From raw WSI patch to clinical triage decision

Noise filtering — blank patch removal

Stratified train/val/test split

5-layer custom CNN — 1.83M parameters

Confidence tier assignment

GradCAM spatial explanation

Data scaling study (course component)

Model Card — formal documentation

Intended use and limitations

Regulatory framing

What the course built vs what was added post-course

Technical stack

View notebooks, GradCAM outputs, and Model Card.

Cancer Detection fromHistopathology Slides

Where this fits in a pathology lab workflow

Test set results — 32,998 held-out patches

Subgroup analysis — performance by confidence quartile

Three operating zones — thresholds as clinical policy

AI auto-clear — no mandatory review

Mandatory pathologist review

Immediate escalation — flag for priority review

Auto-clear threshold sensitivity — choosing the operating point

GradCAM spatial explanation — post-course addition

GradCAM Implementation — conv5 Target

Clinical Interpretation of Heatmaps

From raw WSI patch to clinical triage decision

Noise filtering — blank patch removal

Stratified train/val/test split

5-layer custom CNN — 1.83M parameters

Confidence tier assignment

GradCAM spatial explanation

Data scaling study (course component)

Model Card — formal documentation

Intended use and limitations

Regulatory framing

What the course built vs what was added post-course

Technical stack

View notebooks, GradCAM outputs, and Model Card.

Cancer Detection from
Histopathology Slides