Clinical Fit
Where this fits in an active diagnostic workflow
ASD diagnosis today is slow, subjective, and resource-intensive. This system is designed to address a specific gap in the pathway.
Current ASD diagnostic pathways rely heavily on behavioural assessments — ADOS-2, ADI-R — which require specialist time, are subjective, and create diagnostic delays of 18–24 months in many health systems. Structural MRI is routinely acquired in paediatric neurodevelopment referrals. The structural data already exists; it is just not being used algorithmically.
This system is designed as a pre-assessment triage layer: it processes the existing MRI acquisition, assigns a probability of ASD with uncertainty bounds, highlights the neuroanatomical regions driving the inference, and flags cases for expedited specialist review. The system does not replace the clinician — it ensures that the 30% of referrals with the clearest neuroimaging signatures are seen first, and that every specialist is presented with spatial evidence rather than a raw number.
The operational proposition: The system reuses MRI data clinicians already have. No additional acquisition time, no new infrastructure. The PDF report integrates into an existing referral workflow without requiring a new interface — a consultant opens the same patient file, sees the AI brief alongside standard radiological notes, and proceeds. This is designed from the workflow backward, not from the model outward.
Technical Pipeline
Eight-stage clinical inference pipeline
Every stage mirrors what a production medical imaging AI system requires — not a research notebook.
NIfTI ingestion + axial slice extraction
Accepts .nii and .nii.gz volumes — the standard clinical neuroimaging format. Handles 3D/4D volumes, variable slice counts, and non-standard orientations without manual preprocessing.
4-metric quality gate
Mean intensity (blank rejection) · Brain coverage fraction · Pixel standard deviation (uniform slice rejection) · Laplacian variance (blur/motion rejection) · Contour circularity ≥ 0.45 (non-brain shape rejection). Removes ~6.5% of slices — consistent across train/val/test. No manual slice curation required at inference time.
Brain ROI crop + normalisation
Contour-based bounding box extraction removes black background before XAI — eliminating the background superpixel artefact that commonly corrupts LIME explanations in neuroimaging pipelines. Resize to 224×224 with ImageNet-style normalisation.
CNN inference — all valid slices
5-layer skip-connected CNN (1.6M parameters, NAdam-trained) processes every quality-filtered axial slice independently. P(ASD) and quality score computed per slice. No slice-level labelling required at deployment.
Subject-level aggregation via quality-weighted voting
Weighted average of P(ASD) across all valid slices, where weight = quality_score × confidence. Subject-level probability derived from consensus — not from any single slice — matching how a radiologist integrates information across a volume.
XAI — GradCAM + LIME on top-K slices
Highest-quality slices selected for dual explanation. GradCAM gradient-weighted class activation maps identify spatially which regions drove the classification. LIME superpixel perturbation (500 samples) provides complementary causal evidence. Spatial centre-of-mass mapped to anatomical region for clinical narration.
MC-Dropout uncertainty quantification
30 stochastic forward passes with Dropout2d active at inference. σ < 0.02 = low uncertainty. σ > 0.08 = review flag displayed inline. Critically: false negatives show σ = 0.150 (correctly flagged); false positives show σ = 0.005 — confident-wrong mode documented in Model Card.
LLM-generated clinical narrative + PDF export
Claude Haiku synthesises all structured outputs — confidence, vote distribution, GradCAM energy, anatomical region, LIME status, uncertainty band, site reliability, demographics — into clinical language. Exported as formatted PDF. LLM output includes explicit model limitations framing per regulatory guidance.
Validated Performance
Validated across 18,814 quality-filtered test slices
Slice-level validation metrics. Subject-level clinical deployment uses aggregation pipeline above.
Well-calibrated probabilities are clinically essential. A Brier score of 0.027 means the model's stated confidence closely tracks actual outcomes — when it reports 87% probability, that represents a genuine 87% expected accuracy, not optimistic overconfidence. The calibration plot confirms near-diagonal reliability across the full probability range. This is what makes the uncertainty flags meaningful: the model's stated confidence is trustworthy.
Subgroup performance — sex and acquisition site
ABIDE-I is 85% male. Sex imbalance is a known limitation documented in the Model Card. Site variance is clinically significant and drives the most important operational constraint: site-specific calibration would be required before production deployment.
Site variance is the most operationally significant finding. PITT sensitivity of 88.5% vs UM_1 at 98.5% is a 10 percentage point gap driven by scanner heterogeneity — different field strengths, voxel sizes, and acquisition protocols. The deployed system displays a site-specific reliability badge inline on every output. In a production deployment, site-specific calibration layers or harmonisation preprocessing (ComBat or equivalent) would be required before clinical use.
Explainability
Dual-method spatial explanation — not just a number
GradCAM and LIME answer independent questions about the same prediction, then cross-validated for consistency.
A clinical AI system that produces a probability without spatial justification is not a decision support system — it is a black box that clinicians cannot interrogate or trust. Every output from this system includes at minimum two independent explanation methods and an explicit uncertainty quantification that distinguishes genuine confidence from model overconfidence.
Gradient-Weighted Class Activation Mapping (GradCAM)
Backpropagates classification score gradients to the final convolutional layer (conv5). Produces a spatial heatmap showing which axial-slice regions contributed most to the ASD classification. Computationally fast (<1s per slice). Documented limitation: gradient saturation at extreme confidence levels produces flat heatmaps — at 100% confidence, backpropagated gradients approach zero. The system flags this condition and falls back to LIME as the primary explanation source in high-confidence cases.
LIME — Local Interpretable Model-Agnostic Explanations
Segments the MRI slice into anatomical superpixels. Trains a local linear surrogate over 500 perturbed samples to identify which regions causally drive the prediction. Slower (~20s/slice on CPU). Independent of gradient flow — provides complementary causal evidence. Agreement between GradCAM and LIME across the top-25% activated regions (IoU) is computed and reported. Divergence between the two methods is itself a clinically meaningful signal flagged in the report.
MC-Dropout Uncertainty Quantification
Dropout2d layers remain active at inference. 30 stochastic forward passes produce a distribution over P(ASD), reported as mean ± σ. σ < 0.02 = low uncertainty (reliable prediction). σ > 0.08 = review flag. Critical finding: false negatives show σ = 0.150 (uncertainty correctly flags the miss). False positives show σ = 0.005 — the dangerous mode where the model is confidently wrong. Documented in the Model Card as a known failure mode requiring human override protocol.
Anatomical Region Labelling + Agreement Analysis
GradCAM centre-of-mass coordinates mapped to approximate anatomical region via heuristic z-position lookup (cerebellum/brainstem → temporal/basal ganglia → parietal/temporal → frontal). Labelled as heuristic in the clinical report. GradCAM-LIME cross-validation: Pearson r and IoU of top-25% activated regions computed across the test set. TC-correct predictions: r = 0.56, IoU = 0.44. FN case: r = −0.29 (anti-correlated explanation — high-priority clinical flag).
Governance & Regulatory Framing
Built for how clinical AI actually gets cleared
Model Card, regulatory pathway, and failure mode documentation built in — not added as an afterthought.
Regulatory pathway (current status)
This system falls into the SaMD Class II category under FDA Digital Health Center of Excellence guidance — software that analyses medical images to inform clinical decisions regarding a neurodevelopmental condition. De Novo pathway would likely be required given the novel indications-for-use in ASD neuroimaging.
Current status: research-grade only. Prerequisites before clinical submission: prospective multi-site validation on data independent of ABIDE-I, subject-level performance validation (not slice-level), neuroradiologist review of XAI saliency map anatomical claims, and a complete clinical evidence package. The system is designed with this pathway in mind — the documentation, subgroup analysis, and failure mode characterisation are structured to feed a regulatory submission, not just a conference paper.
What the Model Card documents
Intended use: research decision support only. Out-of-scope uses: standalone diagnostic decisions, non-structural MRI modalities, paediatric populations under 5. Training data limitations: ABIDE-I is retrospective, 85% male, and restricted to 17 academic sites. Known failure modes: confident false positives (σ near zero), GradCAM saturation at high confidence, site-specific performance gaps up to 10%. Sex performance gap: 2.4 percentage points, driven by training imbalance. Site reliability warnings are displayed inline on all outputs. Regulatory status is included verbatim in every exported PDF report.
The system is designed so that any clinician who downloads the report understands exactly what they are and are not holding. The PDF includes the regulatory framing, the model's sensitivity/specificity at the acquisition site, the uncertainty estimate, and an explicit statement that the output is decision support — not diagnosis. This is not a disclaimer added at the end; it is the primary purpose of the output layer.
Origins
The 2023 B.Tech foundation — and why this became something more
The original project established the direction. The 2026 rebuild asked what it would take for that work to matter clinically.
| Capability | 2023 B.Tech Baseline | 2026 Clinical System |
|---|---|---|
| Input format | PNG slices (pre-extracted) | NIfTI volumes (.nii, .nii.gz) |
| Data quality control | Raw slices including blank frames | 4-metric automated quality gate · ~6.5% removed |
| Evaluation metrics | Accuracy + AUC only | + AUPRC, Brier score, calibration plot, threshold analysis |
| Explainability | LIME attempt (incomplete) | GradCAM + LIME + spatial agreement cross-validation |
| Uncertainty quantification | None | MC-Dropout, 30 stochastic passes, σ reported on every output |
| Subgroup analysis | None | Sex-stratified + site-stratified · documented in Model Card |
| Clinical output layer | None | LLM clinical narrative, site reliability badge, PDF report |
| Deployment | Kaggle notebook | Streamlit app · HuggingFace Spaces · always-on |
| Governance | None | Model Card, regulatory framing, failure mode analysis |
Technical stack
The gap between a model that achieves 99.4% AUC in a notebook and a system a clinician would trust is not a performance gap. It is an engineering gap — explainability, uncertainty quantification, calibration verification, site reliability documentation, and a clinical report that states exactly what the result is and what it is not.