Live System — Hugging Face Spaces ↗

// Neuroimaging AI · Clinical Decision Support · Structural MRI

Autism Spectrum Disorder
Detection from Structural MRI

A subject-level clinical AI system that takes raw NIfTI brain scans, runs a full clinical-grade inference pipeline, and delivers structured diagnostic support — GradCAM + LIME spatial explanations, MC-Dropout uncertainty quantification, site-reliability indicators, and an LLM-generated clinical report. Built on ABIDE-I: 1,067 subjects across 17 acquisition sites.

0.994
AUC-ROC
95.6%
Sensitivity
97.2%
Specificity
0.027
Brier Score

Clinical Fit

Where this fits in an active diagnostic workflow

ASD diagnosis today is slow, subjective, and resource-intensive. This system is designed to address a specific gap in the pathway.

Current ASD diagnostic pathways rely heavily on behavioural assessments — ADOS-2, ADI-R — which require specialist time, are subjective, and create diagnostic delays of 18–24 months in many health systems. Structural MRI is routinely acquired in paediatric neurodevelopment referrals. The structural data already exists; it is just not being used algorithmically.

This system is designed as a pre-assessment triage layer: it processes the existing MRI acquisition, assigns a probability of ASD with uncertainty bounds, highlights the neuroanatomical regions driving the inference, and flags cases for expedited specialist review. The system does not replace the clinician — it ensures that the 30% of referrals with the clearest neuroimaging signatures are seen first, and that every specialist is presented with spatial evidence rather than a raw number.

📤
MRI Acquisition
Routine structural T1 scan — no new imaging protocol required
🔬
Quality Gating
4-metric automated quality filter; ~6.5% of suboptimal slices removed
🧠
AI Inference
Subject-level ASD probability with confidence interval across all valid slices
🗺️
Spatial XAI
GradCAM + LIME heatmaps identifying anatomical regions of interest
📋
Clinical Report
PDF with prediction, uncertainty, spatial findings, and site reliability flag
👩‍⚕️
Specialist Review
Prioritised queue: high-confidence cases reviewed first with AI-generated brief
30%
High-confidence triage
Estimated share of referrals with clear neuroimaging signatures, surfaced first
18–24 mo
Typical diagnostic delay
Current average from referral to confirmed ASD diagnosis in many health systems
17 sites
Multi-site validated
System validated across scanner heterogeneity — not just a single-site lab result

The operational proposition: The system reuses MRI data clinicians already have. No additional acquisition time, no new infrastructure. The PDF report integrates into an existing referral workflow without requiring a new interface — a consultant opens the same patient file, sees the AI brief alongside standard radiological notes, and proceeds. This is designed from the workflow backward, not from the model outward.


Technical Pipeline

Eight-stage clinical inference pipeline

Every stage mirrors what a production medical imaging AI system requires — not a research notebook.

01

NIfTI ingestion + axial slice extraction

Accepts .nii and .nii.gz volumes — the standard clinical neuroimaging format. Handles 3D/4D volumes, variable slice counts, and non-standard orientations without manual preprocessing.

02

4-metric quality gate

Mean intensity (blank rejection) · Brain coverage fraction · Pixel standard deviation (uniform slice rejection) · Laplacian variance (blur/motion rejection) · Contour circularity ≥ 0.45 (non-brain shape rejection). Removes ~6.5% of slices — consistent across train/val/test. No manual slice curation required at inference time.

03

Brain ROI crop + normalisation

Contour-based bounding box extraction removes black background before XAI — eliminating the background superpixel artefact that commonly corrupts LIME explanations in neuroimaging pipelines. Resize to 224×224 with ImageNet-style normalisation.

04

CNN inference — all valid slices

5-layer skip-connected CNN (1.6M parameters, NAdam-trained) processes every quality-filtered axial slice independently. P(ASD) and quality score computed per slice. No slice-level labelling required at deployment.

05

Subject-level aggregation via quality-weighted voting

Weighted average of P(ASD) across all valid slices, where weight = quality_score × confidence. Subject-level probability derived from consensus — not from any single slice — matching how a radiologist integrates information across a volume.

06

XAI — GradCAM + LIME on top-K slices

Highest-quality slices selected for dual explanation. GradCAM gradient-weighted class activation maps identify spatially which regions drove the classification. LIME superpixel perturbation (500 samples) provides complementary causal evidence. Spatial centre-of-mass mapped to anatomical region for clinical narration.

07

MC-Dropout uncertainty quantification

30 stochastic forward passes with Dropout2d active at inference. σ < 0.02 = low uncertainty. σ > 0.08 = review flag displayed inline. Critically: false negatives show σ = 0.150 (correctly flagged); false positives show σ = 0.005 — confident-wrong mode documented in Model Card.

08

LLM-generated clinical narrative + PDF export

Claude Haiku synthesises all structured outputs — confidence, vote distribution, GradCAM energy, anatomical region, LIME status, uncertainty band, site reliability, demographics — into clinical language. Exported as formatted PDF. LLM output includes explicit model limitations framing per regulatory guidance.


Validated Performance

Validated across 18,814 quality-filtered test slices

Slice-level validation metrics. Subject-level clinical deployment uses aggregation pipeline above.

0.994
AUC-ROC
Near-perfect discrimination
95.6%
Sensitivity
ASD detection rate
97.2%
Specificity
True negative rate
0.027
Brier Score
Calibrated (<0.10 threshold)

Well-calibrated probabilities are clinically essential. A Brier score of 0.027 means the model's stated confidence closely tracks actual outcomes — when it reports 87% probability, that represents a genuine 87% expected accuracy, not optimistic overconfidence. The calibration plot confirms near-diagonal reliability across the full probability range. This is what makes the uncertainty flags meaningful: the model's stated confidence is trustworthy.

Subgroup performance — sex and acquisition site

ABIDE-I is 85% male. Sex imbalance is a known limitation documented in the Model Card. Site variance is clinically significant and drives the most important operational constraint: site-specific calibration would be required before production deployment.

Sensitivity by Sex
Male
95.8%
Female
93.4%
Sensitivity by Site (selected)
UM_1
98.5%
OLIN
97.9%
NYU
95.2%
PITT
88.5%

Site variance is the most operationally significant finding. PITT sensitivity of 88.5% vs UM_1 at 98.5% is a 10 percentage point gap driven by scanner heterogeneity — different field strengths, voxel sizes, and acquisition protocols. The deployed system displays a site-specific reliability badge inline on every output. In a production deployment, site-specific calibration layers or harmonisation preprocessing (ComBat or equivalent) would be required before clinical use.


Explainability

Dual-method spatial explanation — not just a number

GradCAM and LIME answer independent questions about the same prediction, then cross-validated for consistency.

A clinical AI system that produces a probability without spatial justification is not a decision support system — it is a black box that clinicians cannot interrogate or trust. Every output from this system includes at minimum two independent explanation methods and an explicit uncertainty quantification that distinguishes genuine confidence from model overconfidence.

Gradient-Weighted Class Activation Mapping (GradCAM)

Backpropagates classification score gradients to the final convolutional layer (conv5). Produces a spatial heatmap showing which axial-slice regions contributed most to the ASD classification. Computationally fast (<1s per slice). Documented limitation: gradient saturation at extreme confidence levels produces flat heatmaps — at 100% confidence, backpropagated gradients approach zero. The system flags this condition and falls back to LIME as the primary explanation source in high-confidence cases.

LIME — Local Interpretable Model-Agnostic Explanations

Segments the MRI slice into anatomical superpixels. Trains a local linear surrogate over 500 perturbed samples to identify which regions causally drive the prediction. Slower (~20s/slice on CPU). Independent of gradient flow — provides complementary causal evidence. Agreement between GradCAM and LIME across the top-25% activated regions (IoU) is computed and reported. Divergence between the two methods is itself a clinically meaningful signal flagged in the report.

MC-Dropout Uncertainty Quantification

Dropout2d layers remain active at inference. 30 stochastic forward passes produce a distribution over P(ASD), reported as mean ± σ. σ < 0.02 = low uncertainty (reliable prediction). σ > 0.08 = review flag. Critical finding: false negatives show σ = 0.150 (uncertainty correctly flags the miss). False positives show σ = 0.005 — the dangerous mode where the model is confidently wrong. Documented in the Model Card as a known failure mode requiring human override protocol.

Anatomical Region Labelling + Agreement Analysis

GradCAM centre-of-mass coordinates mapped to approximate anatomical region via heuristic z-position lookup (cerebellum/brainstem → temporal/basal ganglia → parietal/temporal → frontal). Labelled as heuristic in the clinical report. GradCAM-LIME cross-validation: Pearson r and IoU of top-25% activated regions computed across the test set. TC-correct predictions: r = 0.56, IoU = 0.44. FN case: r = −0.29 (anti-correlated explanation — high-priority clinical flag).


Governance & Regulatory Framing

Built for how clinical AI actually gets cleared

Model Card, regulatory pathway, and failure mode documentation built in — not added as an afterthought.

Regulatory pathway (current status)

This system falls into the SaMD Class II category under FDA Digital Health Center of Excellence guidance — software that analyses medical images to inform clinical decisions regarding a neurodevelopmental condition. De Novo pathway would likely be required given the novel indications-for-use in ASD neuroimaging.

Current status: research-grade only. Prerequisites before clinical submission: prospective multi-site validation on data independent of ABIDE-I, subject-level performance validation (not slice-level), neuroradiologist review of XAI saliency map anatomical claims, and a complete clinical evidence package. The system is designed with this pathway in mind — the documentation, subgroup analysis, and failure mode characterisation are structured to feed a regulatory submission, not just a conference paper.

What the Model Card documents

Intended use: research decision support only. Out-of-scope uses: standalone diagnostic decisions, non-structural MRI modalities, paediatric populations under 5. Training data limitations: ABIDE-I is retrospective, 85% male, and restricted to 17 academic sites. Known failure modes: confident false positives (σ near zero), GradCAM saturation at high confidence, site-specific performance gaps up to 10%. Sex performance gap: 2.4 percentage points, driven by training imbalance. Site reliability warnings are displayed inline on all outputs. Regulatory status is included verbatim in every exported PDF report.

The system is designed so that any clinician who downloads the report understands exactly what they are and are not holding. The PDF includes the regulatory framing, the model's sensitivity/specificity at the acquisition site, the uncertainty estimate, and an explicit statement that the output is decision support — not diagnosis. This is not a disclaimer added at the end; it is the primary purpose of the output layer.


Origins

The 2023 B.Tech foundation — and why this became something more

The original project established the direction. The 2026 rebuild asked what it would take for that work to matter clinically.

Phase 1 — 2023 · B.Tech Capstone
Manipal Institute of Technology · Under Dr. J. Andrew, Dept. of Computer Science
Systematic comparison of CNN and Vision Transformer architectures for ASD classification — 1,067 ABIDE-I subjects, three CNN variants, three optimisers, custom ViTs in PyTorch and Keras.
91%
CNN Test Accuracy
Skip-CNN + NAdam
0.98
CNN AUC
Best configuration
~60%
ViT Accuracy
Scratch-trained · near random
1,067
Subjects
17 international sites
🧠
CNNs vs ViTs at clinical data scale
The ViT trained from scratch plateaued near random chance (~60%) while the 5-layer CNN reached 91% on the same 1,067-subject dataset. At this data scale, inductive biases matter: CNNs assume local spatial structure, which maps directly onto neuroanatomy. ViTs must learn spatial relationships from data alone — 1,067 subjects is insufficient for this. This shaped every architecture decision in the 2026 system.
📊
Architecture selection > hyperparameter tuning
All three optimisers (Adam, NAdam, RMSprop) produced results within 1% of each other within the same architecture. Adding skip connections to the CNN improved AUC from 0.97 to 0.98. The empirical takeaway: at this data scale, architecture choice has larger impact than optimiser selection — a lesson that generalises to most clinical AI deployment scenarios with limited labelled data.
🔬
Multi-site scanner heterogeneity — first encounter
ABIDE-I aggregates 17 sites with different scanners, protocols, and field strengths. Training without site normalisation introduces potential scanner bias — the model may learn acquisition artefacts rather than neurobiological features. First encounter with the problem that dominates real-world clinical AI deployment and motivated the site-stratified subgroup analysis in the 2026 system.
The open threads that became the 2026 system
LIME was set up but visualisation was incomplete. GradCAM was explored but not retained. The B.Tech project correctly identified explainability as the most critical open problem — then ran out of scope, GPU budget, and time. The 2026 rebuild is what happened when those threads were picked up with graduate-level tooling and a specific question: what does it take for a clinician to actually trust the output?
Phase 2 — 2026 · Graduate Independent Work
University of North Texas · MS AI (Biomedical Concentration)
Post-coursework independent rebuild — clinical pipeline, dual XAI, uncertainty quantification, Model Card, deployed system.
Capability2023 B.Tech Baseline2026 Clinical System
Input formatPNG slices (pre-extracted)NIfTI volumes (.nii, .nii.gz)
Data quality controlRaw slices including blank frames4-metric automated quality gate · ~6.5% removed
Evaluation metricsAccuracy + AUC only+ AUPRC, Brier score, calibration plot, threshold analysis
ExplainabilityLIME attempt (incomplete)GradCAM + LIME + spatial agreement cross-validation
Uncertainty quantificationNoneMC-Dropout, 30 stochastic passes, σ reported on every output
Subgroup analysisNoneSex-stratified + site-stratified · documented in Model Card
Clinical output layerNoneLLM clinical narrative, site reliability badge, PDF report
DeploymentKaggle notebookStreamlit app · HuggingFace Spaces · always-on
GovernanceNoneModel Card, regulatory framing, failure mode analysis

Technical stack

PyTorch 2.11Streamlit 1.56 nibabelpytorch-grad-cam limeanthropic (Claude Haiku) ReportLabscikit-learn scikit-imageOpenCV matplotlibpandas HuggingFace SpacesABIDE-I Dataset

The gap between a model that achieves 99.4% AUC in a notebook and a system a clinician would trust is not a performance gap. It is an engineering gap — explainability, uncertainty quantification, calibration verification, site reliability documentation, and a clinical report that states exactly what the result is and what it is not.

— The central design question this project answers

Upload your own NIfTI scan. See the full pipeline.

The live system accepts any structural MRI in NIfTI format. All ABIDE-I demo subjects are pre-loaded. Full analysis notebooks, source code, and Model Card are on GitHub.

Live Demo → 2026 Repo ↗ 2023 B.Tech ↗