Live System — Hugging Face Spaces ↗

// Obstetric Ultrasound AI · Fetal Biometry · Clinical Decision Support

Fetal Head Circumference
Automated Measurement

A clinical-grade AI pipeline for automated fetal head circumference (HC) measurement from obstetric ultrasound — static frame and cine-loop modes. Built to the ISUOG ±3mm acceptable error threshold. Delivers GradCAM++ explainability, MC-Dropout uncertainty maps, Hadlock gestational age estimation, and dual-mode PDF clinical reports. Deployed with a live interactive interface.

97.75%
Dice (Static)
1.65mm
HC MAE
✓ ISUOG
±3mm Threshold
0.9985
R² Static

Clinical Relevance

Where this fits in the antenatal care pathway

HC measurement is a mandatory biometric in every second- and third-trimester obstetric ultrasound. This system automates the most time-consuming part of that workflow.

Fetal head circumference measurement is performed at every routine anomaly scan (18–22 weeks) and growth scan (28–36 weeks). Manual HC measurement requires the sonographer to manually place callipers around the outer skull table on the best axial cross-section — a process that takes 2–4 minutes per measurement, introduces inter-observer variability of up to 5–7mm, and is subject to sonographer fatigue over a full scan list.

At a unit performing 20 scans per day, this system reduces measurement time from approximately 2 minutes to under 10 seconds, saves an estimated 153 sonographer hours per year per unit, and reduces inter-observer coefficient of variation (CV) by replacing subjective calliper placement with a reproducible algorithmic measurement — consistent with published evidence on AI-assisted biometry (Papageorghiou et al., 2014). The system also provides gestational age estimation from the measured HC via the Hadlock (1984) formula, eliminating a manual lookup step.

🔊
US Acquisition
Standard axial BPD view — no protocol change
📤
Image Upload
Single frame (static) or cine-loop sequence
🎯
Skull Segmentation
U-Net boundary localisation with deep supervision
📐
HC Measurement
Ellipse fitting to segmented boundary → mm circumference
📅
GA Estimation
Hadlock (1984) formula → gestational age ± 2-week CI
📋
Clinical Report
PDF with HC, GA, trimester, XAI overlay, uncertainty
153 hrs
Saved per unit/year
At 20 scans/day — 2 min manual → 10s automated
1.65mm
Mean absolute error
vs ISUOG ±3mm threshold — 70% margin to spare
ISUOG
Standard compliant
Both static and cine-loop modes within clinical threshold
SOTA+
Exceeds published SOTA
1.65mm MAE vs published SOTA 5.95mm on HC18 benchmark

Clinical context: The ISUOG (International Society of Ultrasound in Obstetrics and Gynecology) recommends ±3mm as the acceptable measurement error for HC in clinical practice. This system achieves 1.65mm MAE on the static model — more than 45% tighter than the clinical acceptability threshold. Both measurement modes (static and cine-loop) independently satisfy the ISUOG standard.


Validated Performance

Full-pipeline results vs course baseline and SOTA

The clinical pipeline exceeds published state-of-the-art on MAE. The course baseline was 17.25mm — a 9× improvement.

System Dice (%) HC MAE (mm) Notes
Course baseline (Phase 0)86.17%17.25mm0.9095Hollow-mask error, no optimisation
Static model — Phase 0 clinical97.75%1.65mm ✓0.9985Within ISUOG ±3mm threshold
Cine-loop — Phase 2 temporal95.95%2.10mm ✓0.9980Within ISUOG ±3mm threshold
Published SOTA (HC18 benchmark)97.89%5.95mmBest published result on HC18 dataset

On MAE — the clinically relevant metric — both systems exceed published SOTA: 1.65mm vs 5.95mm for the static model (3.6× better), and 2.10mm vs 5.95mm for the cine-loop model (2.8× better). The Dice score is within 0.14 percentage points of SOTA for the static model. The cine-loop model trades a small Dice reduction for the addition of temporal reasoning, uncertainty quantification across frames, and a complete clinical output pipeline.

Temporal attention ablation — isolating the contribution

Three configurations on the same test set demonstrate what temporal attention adds — and why it is essential, not optional.

Configuration
Dice
HC MAE
Interpretation
A — Static baseline
Phase 0 · single-frame model
97.75%
1.65mm
Gold standard for single-frame performance
B — Cine, no temporal attention
2D backbone applied frame-by-frame, no coordination
81.48%
19.37mm
Inconsistent per-frame predictions produce poor consensus — worse than static
C — Cine + temporal attention (ours)
Shared 2D encoder + temporal self-attention at bottleneck
95.95%
2.10mm
Recovers nearly all static performance + adds temporal reasoning

The B→C gap (MAE: 19.37→2.10mm) is the quantitative case for temporal attention. Applying a static backbone to cine frames without temporal coordination produces worse results than the single-frame model — inconsistent predictions across 16 frames cannot form a reliable consensus. The attention module coordinates across frames to identify which views have the clearest skull interface, recovering near-static performance while adding frame-level uncertainty quantification that does not exist in single-frame inference.


Cine-Loop Pipeline

Temporal model — 16-frame cine-loop inference

The cine-loop system processes ultrasound clip sequences rather than single frames, enabling temporal consensus and frame-level uncertainty maps.

In clinical practice, sonographers assess HC quality over a short cine sweep rather than from a single static frame. The temporal model mirrors this: it processes 16-frame clips using a shared 2D U-Net encoder across frames, with a lightweight self-attention module at the bottleneck that coordinates predictions across the temporal dimension. Peak memory is reduced 8–10× compared to a 3D U-Net approach, enabling full-dataset training on 806 image-clip pairs.

Architecture: 2D encoder + temporal attention

Shared-weight 2D U-Net encoder processes each frame independently. Temporal self-attention (only ~200K parameters vs backbone's 32M) operates on spatially-pooled bottleneck vectors [B, T, C], not full feature maps. Three-stage training: (1) frozen backbone — attention module only, (2) unfreeze decoder, (3) full fine-tune at low LR. Prevents corrupting the pretrained backbone while allowing temporal coordination to develop.

Pseudo-LDDM v2 synthetic cine generation

Real cine datasets are clinically held and not publicly available. The Pseudo-LDDM v2 framework generates clinically realistic synthetic sequences: Ornstein-Uhlenbeck probe motion (mean-reverting, non-periodic), per-frame independent skull cross-section variation (±2.5% semi-axis perturbation), Rician speckle noise, depth-dependent acoustic attenuation, and TGC drift simulation. Mean temporal HC std increased from ~0mm (rigid v1) to 10.33px (v2), producing genuine temporal variability that requires learned temporal reasoning.

Frame-level uncertainty quantification

Standard deviation of per-frame binary predictions across 16 frames produces a boundary uncertainty map — bright pixels identify where the model disagreed across the temporal sequence. This is the clinically meaningful uncertainty indicator: high variance at the skull interface indicates a borderline cross-sectional plane and flags the case for re-acquisition or manual review.

Temporal reliability score

Reliability scores of 0.97–0.99 across test sequences (vs 1.000 in the earlier rigid-mask version, which was trivially achievable and meant nothing). The T×T attention weight matrix and per-frame attention-received scores are exported as part of the XAI output, showing the clinician which frames in the sequence had the clearest skull boundary — directly interpretable in sonography workflow terms.


Technical Pipeline — Static Model

From ultrasound image to clinical measurement

Three targeted training improvements — no architecture change — produced the step from 17.25mm to 1.65mm MAE.

01

Image ingestion + mask flood-fill correction

HC18 annotations are 1-pixel-wide hollow ellipse outlines — not filled masks. The course baseline did not fill them, causing near-zero Dice from the area overlap metric comparing a solid prediction against a hollow outline. All annotation masks are flood-filled before training. GT masks pre-generated as .npz files at the original 800×540 resolution before downscaling. This single correction accounts for the majority of the 86%→97% Dice improvement.

02

U-Net segmentation with deep supervision

Auxiliary segmentation heads at decoder stages 2 and 3. Forces the full encoder to learn boundary-relevant features at every scale, not just the final decoder output. Combined with boundary-weighted loss using distance-transform upweighting of pixels near the GT mask edge — directly optimising the source of HC measurement error.

03

Clinically-motivated data augmentation

Rotation ±15°, horizontal flip, elastic deformation (simulating probe angle variation), brightness/contrast variation (simulating gain settings), Rician speckle noise simulation (correct ultrasound noise model, not Gaussian), coarse dropout. All augmentations match physical variability seen in real obstetric scanning — not generic image augmentation.

04

Ellipse fitting → HC measurement in mm

Minimum-area ellipse fitted to the segmented skull boundary. HC computed as ellipse circumference using Ramanujan's approximation, then converted to mm using pixel spacing from the image metadata. This is the same geometric approach used by clinical calliper software.

05

Hadlock gestational age estimation

GA estimated via corrected Hadlock (1984) formula: GA = 8.96 + 0.540·(HC/10) − 0.0040·(HC/10)² + 0.000399·(HC/10)³. HC converted to cm before application. Validated against published anchor points (HC=140mm→~17w, HC=250mm→~26w, HC=340mm→~38w). GA reported with ±2-week confidence interval matching the Hadlock clinical uncertainty range. Trimester classification derived from estimated GA.

06

Dual-mode PDF clinical report

Two selectable modes accessible via UI toggle. LLM mode: Claude Haiku generates two clinical paragraphs in medical terminology — measurement interpretation, gestational assessment, uncertainty commentary. Template mode: rule-based deterministic output with identical structure. Toggle designed to let clinicians compare modes directly. Both include regulatory framing, trimester classification, and XAI overlay.


Phase 3 — Explainability

Three-layer explainability architecture

Spatial attribution (GradCAM++), temporal attention visualisation, and boundary uncertainty maps — all implemented from scratch, all exportable in the clinical PDF.

🎯

GradCAM++ — Spatial Attribution

Custom implementation applied to the final decoder convolutional layer — no external packages. Shows which spatial regions drove the boundary prediction. GradCAM++ (vs standard GradCAM) avoids gradient saturation at high confidence by using second-order gradient information. Clinically: confirms the model attends to the outer skull table interface, not soft tissue. Visible as an overlay on the segmentation output in the deployed system.

📊

Boundary Uncertainty Map — Temporal

Standard deviation of per-frame binary predictions across 16 cine frames. Bright pixels in the hot colormap identify where the model's boundary localisation was inconsistent across the temporal sequence — the clinically meaningful uncertainty indicator. High boundary uncertainty at specific skull regions indicates a sub-optimal imaging plane and triggers a re-acquisition recommendation in the clinical report.

⏱️

Temporal Attention Weights

The T×T attention matrix and per-frame attention-received distribution are exported as part of the cine-loop XAI output. Shows which frames in the 16-frame sequence the model weighted most highly — directly interpretable as "which frames had the clearest skull interface." Clinically useful for retrospective clip quality review and for identifying the optimal freeze-frame for static measurement.


Phase 3 — Governance, Bias Audit & Business Case

XAI validation, GA-trimester fairness audit, and clinical deployment evidence

Phase 3 is not a training phase. It is the structured accountability layer that separates a research model from something that can be responsibly submitted for clinical use.

GA-stratified performance audit

Test set stratified by estimated gestational age trimester (Hadlock GA from GT HC). Mid-trimester (20–30w) achieves best performance — optimal second-trimester imaging window with highest skull ossification contrast and lowest acoustic shadowing. Early trimester (<20w) shows modestly higher MAE due to smaller skull diameter and lower SNR. Late trimester (>30w) shows slight MAE degradation from increasing acoustic shadowing behind the ossified calvarium. No systematic over- or under-measurement bias by HC size range. Image quality stratification (Laplacian variance) shows higher MAE at low-sharpness inputs — consistent with clinical expectation.

Late-trimester MAE elevation is documented explicitly in the Model Card as a known limitation, with a clinical recommendation: in pregnancies >30 weeks, automated measurement should be supplemented with manual sonographer verification. The deployed system displays a trimester-specific reliability indicator on all outputs. Limitation: HC18 contains no patient demographic metadata (ethnicity, BMI, scanner model) — true demographic subgroup analysis requires multi-centre data.

Business case & regulatory framing

Workflow impact: At a mid-size obstetric unit (20 scans/day, 250 working days/year), AI-assisted HC measurement reduces measurement time from ~2 min to ~10 sec per scan — saving an estimated 153 sonographer hours/year and ~$5,347 USD/year at $35/hr. Papageorghiou et al. (2014) showed automated biometry reduces inter-observer coefficient of variation from 3.2% → 1.1%, directly reducing repeat scan rate. Both effects compound in a high-volume screening setting.

This system falls into the SaMD Class II category under FDA Digital Health Center of Excellence guidance — software that measures fetal biometric parameters from ultrasound images to inform clinical management. EU classification: IVDR Class B. The Model Card, bias audit, and governance documentation are structured to feed a regulatory submission directly — not just to satisfy a research checklist.

Current status: research-grade decision support only. Prerequisites before clinical submission: prospective validation on real cine acquisitions (minimum 100 subjects), FDA 510(k) clearance, sonographer validation study per IEC 62304, and a full clinical evidence package.

Known limitation — late third trimester: MAE is elevated in pregnancies >30 weeks due to increased acoustic shadowing from the ossified skull table. Documented in the Model Card with a clinical recommendation for manual verification at this gestational age. Acoustic shadow-aware training data augmentation and adaptive TGC calibration are identified as the primary next steps for full third-trimester coverage.


Model Compression for Clinical Deployment

Making the model small enough to actually deploy

A segmentation model that achieves 97% accuracy on a research GPU is not a clinical product if it cannot run within hospital infrastructure constraints.

Clinical environments impose hard constraints that research benchmarks do not: limited GPU memory, shared compute infrastructure, latency requirements within a sonographer's scan time, and the need to run on hardware that was provisioned years before the model was trained. Phase 4a and 4b of this project addressed the deployment gap directly through structured CNN filter pruning.

The clinical deployment problem

The segmentation backbone (32M parameters) runs comfortably on a research GPU. In a clinical ultrasound suite, the available compute may be a mid-range workstation with a shared GPU serving PACS, VNA, and reporting tools simultaneously. Peak memory, inference latency within a 30-second measurement window, and the ability to run without an internet connection are the real constraints — not benchmark Dice scores.

Hybrid Crossover — the novel pruning method

Standard structured pruning identifies redundant convolutional filters and deletes them. The Hybrid Crossover method takes a different approach: it identifies two filters with high mutual similarity (measured by ILR saliency) and synthesises a new filter from both via linear regression, then replaces the pair with a single composite filter. The result retains information from both discarded filters rather than simply removing it.

Tested on VGG-16 (the backbone architecture used in the fetal HC system). Accuracy guard-rails enforce that no pruning step exceeds a specified accuracy tolerance — the model cannot be compressed past the clinical performance threshold.

Results in clinical terms

Phase 4a — Static U-Net (Residual U-Net)
8.11M → 4.57M · −43.7% parameters · ISUOG PASS
Dice: 97.75% → 97.64% (−0.11pp). MAE: 1.65mm → 1.76mm (+0.11mm). Wilcoxon p=0.0049 — statistically significant but clinically negligible. Per-block final channels: enc3: 128→71, enc4: 256→129, bottleneck: 512→257, dec4: 256→129, dec3: 128→65. 1.17× faster inference.
Compression−43.7% Dice Δ−0.11pp ISUOGPASS ✓
Phase 4b — Temporal U-Net (TemporalFetaSegNet)
8.90M → 5.20M · −41.6% parameters · ISUOG PASS
Dice: 95.95% → 96.00% (+0.05pp). MAE: 2.10mm → 2.06mm (−0.04mm). FT recovery rate: 103.8% — pruned+KD model exceeded its unpruned teacher. Wilcoxon p=0.1013 — not significant, statistically indistinguishable from baseline. TAM kept intact; only backbone pruned.
Compression−41.6% Dice Δ+0.05pp ISUOGPASS ✓

What this means in practice

For a clinical deployment team: the same HC measurement accuracy is available from a model that uses ~42–44% less memory and runs faster. This is the difference between requiring a dedicated GPU server and running on existing shared radiology workstation infrastructure. Model Cards for both pruned variants document the accuracy tradeoff explicitly so procurement teams can make the decision with evidence.

The pruning research contribution: The Hybrid Crossover method is documented as a standalone directed-study deliverable (CSCE 5934, Fall 2025, Prof. Russel Pears). The primary research claim — synthesis preserves more information than deletion — is validated by the accuracy improvement, not just preservation. The method is architecture-agnostic and applies to any convolutional backbone where redundant filter pairs can be identified via ILR saliency scoring.


Development Journey

From course project to clinical-grade pipeline — six phases

Phase 0 baseline → Phase 0 rebuild → Phase 1 synthetic data → Phase 2 temporal model → Phase 3 XAI & governance → Phase 4 compression.

Phase 0 — Course Baseline · CSCE 6260, Fall 2025
Static baseline with hollow-mask training error
Working proof-of-concept but hollow ellipse annotations trained without flood-fill. Dice computed against hollow outlines produced misleadingly low scores. No training optimisation. Results below clinical relevance.
Dice86.17% HC MAE17.25mm vs ISUOG5.75× over threshold
Phase 0 — Clinical Rebuild · Post-course
Hollow-mask correction + deep supervision + boundary-weighted loss
Flood-fill applied to all HC18 annotations. Deep supervision added at decoder stages 2 and 3. Boundary-weighted BCE+Dice loss with Sobel-derived edge weight maps. Augmentation: HorizontalFlip, Rotate(±15°), ElasticTransform, GaussNoise, RandomBrightnessContrast, CoarseDropout. Training: 80 epochs, AdamW lr=3e-4, CosineAnnealingLR. The hollow-mask fix alone accounts for the majority of the Dice improvement.
Dice97.75% HC MAE1.65mm vs ISUOGWithin threshold ✓
Phase 1 — Pseudo-LDDM v2 Synthetic Cine Generation
Physically accurate synthetic ultrasound cine dataset
Rewrote the cine generation framework from scratch. Added cross-sectional skull variation (±2.5% semi-axis perturbation per frame — the critical fix that makes temporal HC std non-trivial), Ornstein-Uhlenbeck probe motion (non-periodic, mean-reverting), Rician speckle noise (physically correct ultrasound model), depth-dependent attenuation, acoustic shadowing, and TGC drift. Mean temporal HC std: ~0mm (v1) → 10.33px (v2), producing genuine temporal variability required for attention learning. 806 high-fidelity clips generated.
Phase 2 — TemporalFetaSegNet
Shared 2D encoder + temporal self-attention module (TAM)
Replaced memory-prohibitive 3D U-Net with a shared 2D encoder + lightweight TAM at the bottleneck (spatial pool → Linear(512→256) → 8-head MHA → FFN → sigmoid gate, ~200K parameters). Three-stage training: frozen backbone → decoder+TAM → full fine-tune. Ablation: removing TAM collapses MAE from 2.10mm → 19.37mm — the attention module is load-bearing.
Dice95.95% HC MAE2.10mm vs ISUOGWithin threshold ✓
Phase 3 — XAI, Bias Audit & Governance
GradCAM++, trimester stratification, Model Card, business case
Not a training phase — a structured clinical validation and accountability layer. GradCAM++ (custom, no external packages) on Phase 0 final decoder layer; temporal attention T×T matrix and per-frame attention weights for Phase 2. GA-trimester bias audit: test set stratified into Early (<20w), Mid (20–30w), Late (>30w) — Mid trimester achieves best performance; Late trimester shows elevated MAE due to acoustic shadowing (documented as known limitation). HC size range analysis: no systematic over/under-measurement bias. Business case: 2 min manual → 10 sec AI at 20 scans/day yields 153 sonographer hours saved per unit per year (~$5,347 USD at $35/hr), plus inter-observer CV reduction from 3.2% → 1.1% (Papageorghiou et al. 2014). Model Card (Mitchell et al. 2019 framework): intended use, evaluation results, fairness analysis, known limitations, regulatory framing. Regulatory classification: SaMD Class II (FDA 21 CFR Part 892 · 510(k) pathway) and IVDR Class B (EU).
Phase 4a / 4b — Hybrid Crossover Structural Pruning
ILR importance scoring + filter synthesis for edge-deployable compression
ILR composite importance (0.6×RMS activation + 0.4×filter L1 + 0.2×Frobenius). Hybrid Crossover merging: dropped-channel features synthesised into kept channel via 50-step Adam regression — information preservation, not discard. 3 prune-FT cycles with KD recovery (teacher = frozen baseline, α=0.5, T=4.0). Phase 4b required surgical TAM proj_in/proj_out resizing with bottleneck pruning — a structural coupling absent in classification networks, solved via concat-index weight slicing. Phase 4b FT recovery rate: 103.8% — pruned+KD model exceeded its unpruned teacher.
Static (4a)4.57M · −43.7% Temporal (4b)5.20M · −41.6% ISUOGBoth PASS ✓

Technical stack

PyTorch 2.xStreamlit OpenCVscikit-image NumPy / SciPyanthropic (Claude Haiku) ReportLabmatplotlib HuggingFace SpacesGoogle Colab Pro HC18 Dataset (Eindhoven)ISUOG Guidelines Hadlock (1984)Papageorghiou et al. (2014)

The course project started 17.25mm away from clinical relevance. The post-course work closed that gap to 1.65mm — not by changing the model, but by fixing the training data representation, adding domain-appropriate optimisation, and building the clinical output layer that turns a segmentation into something a sonographer can use.

— The engineering path from proof-of-concept to clinical-grade

Try the live demo — upload your own ultrasound.

The system accepts standard obstetric ultrasound images in static or cine mode. Source code, Model Card, and all phase notebooks are on GitHub.

Live Demo → GitHub ↗ Get in Touch ↗