Background & Clinical Motivation
Why smaller, faster models matter for clinical deployment
A model that achieves 97% accuracy but requires a dedicated GPU server is not a clinical product. This work addresses the deployment gap directly.
🌐 For everyoneWhen an AI model is trained to read medical images, it develops thousands of internal components called filters — each looking for a specific pattern like an edge, a texture, or a shape. Many filters are redundant: they detect nearly the same thing as another filter. This wastes memory, slows inference, and makes it harder to deploy AI on hospital workstations or wearable devices.
The obvious fix is to find redundant filters and delete them. But deletion is permanent — any information encoded by the removed filter is lost. This project asks: what if instead of deleting, we synthesise? What if we create a new filter that captures the best of both redundant ones, replacing both with a single synthesised one? Same compression — two become one — but far less information loss.
⚙️ Technical contextThis work is directly motivated by the fetal head circumference deployment challenge in this portfolio. A U-Net achieving 97.36% Dice on a research GPU needs to run on existing hospital workstation hardware — which may have a shared GPU serving PACS, reporting tools, and the AI system simultaneously. Structured pruning produces architecturally smaller networks: fewer channels, fewer multiply-accumulate operations (MACs), real latency reductions — without sparse weight matrices that hardware cannot efficiently exploit.
Origin — Ashwini Sharma's founding observation
This project is rooted in prior MS thesis work at UNT. Ashwini Sharma studied gender classification on CelebA using a custom 5-layer CNN and found that only a small subset of final-layer filters contributed meaningfully to predictions — the rest were largely redundant. That observation raised the question this directed study was designed to answer.
How the methodology evolved week by week
From IoU to Cosine Similarity
Initial filter similarity metric — IoU on thresholded activation maps — replaced with cosine similarity. Hierarchical clustering with medoid replacement introduced as the grouping strategy.
The Stopping Criterion Problem
Central open problem identified: how do you decide when to stop pruning? Any fixed similarity threshold is arbitrary. Establishing a principled, data-driven stopping criterion became the key research challenge.
Correction-Probability Breakthrough
Merging decisions grounded in a new metric: does filter B actually correct the errors that filter A makes? Moving from geometric similarity to functional contribution — a fundamentally more meaningful criterion.
Two Tracks Emerge
Local HC (teammate's layer-local medoid merge) and Global Scoring (cross-layer ranking) diverged into parallel approaches — enabling a genuine head-to-head comparison on identical guard rails.
Hybrid Crossover Introduced
My core novel contribution: replacing the standard "drop the weaker filter" action with regression-based synthesis. ILR scoring, accuracy guard rails, and the full Global pipeline built and validated across both architectures and both datasets.
Novel Contribution
The Hybrid Crossover — synthesise, don't discard
The core idea I introduced and implemented for this directed study.
🌐 For everyoneStandard pruning: find two filters that do basically the same thing, keep the better one, discard the other. Simple — but permanent. The unique information in the discarded filter is gone. My approach creates a brand new filter designed to capture the most important activations from both parents — a "best of both worlds" filter through mathematical optimisation, then replacing both originals with this single synthesised one.
⚙️ Technical detailGiven two candidate filters A and B identified as redundant by the Global scoring pipeline:
target_hybrid = torch.max(fm_a, fm_b) # element-wise max
# Solve for new filter weights via gradient descent
f_new = nn.Parameter(torch.randn_like(filter_a))
optimizer = torch.optim.Adam([f_new], lr=1e-3)
for _ in range(n_steps):
pred = F.conv2d(calibration_data, f_new)
loss = F.mse_loss(pred, target_hybrid)
loss.backward(); optimizer.step()
# Replace both originals with synthesised filter
layer.weight.data[idx_a] = f_new.data
layer.weight.data = delete_channel(layer.weight.data, idx_b)
Why Hybrid Crossover is faster, not slower: Standard drop under aggressive compression frequently violates accuracy guard rails and triggers rollback — the algorithm undoes the change, soft-locks the channel, and searches for an alternative. This is expensive. The Hybrid Crossover avoids rollback because the synthesised filter approximates the pre-pruning representation by construction. Fewer rollbacks → 2× faster overall runtime despite the additional regression step.
The Global Scoring pipeline — ILR importance scoring
⚙️ Technical detailFilters are scored using ILR (Inside-Layer Ranking) — a weighted fusion of three signals computed from forward passes only, making it fast, stable, and reproducible across long runs with many architectural changes:
Filters ranked globally across all layers simultaneously — not layer by layer. Cross-layer ranking concentrates pruning on compute-intensive blocks (B4, B5), producing disproportionate latency speedup.
Accuracy guard rails — clinical-grade compression safety
🌐 For clinical contextEvery structural change is validated before being committed. If the model's accuracy drops beyond the defined tolerance, the change is rolled back. This is the engineering discipline that separates research-grade compression from deployment-grade compression — in clinical AI, a compressed model that quietly loses sensitivity on a subgroup is more dangerous than a large model that runs slowly.
if (baseline_acc - current_acc) > Δ_max_overall: ROLLBACK
if (baseline_class_acc[c] - current_class_acc[c]) > Δ_max_class: ROLLBACK
# CelebA limits: Δ_max_overall = 2pp | Δ_max_class = 6pp
# CIFAR-10 limits: Δ_max_overall = 3pp | Δ_max_class = 9pp
Results
Head-to-head: Hybrid Crossover vs. Standard Drop
Same architecture, same dataset, identical guard rails. Every other variable held constant.
| Metric | Standard Drop (Teammate) | Hybrid Crossover (Tarun) ★ | Delta |
|---|---|---|---|
| Final Top-1 Accuracy | 92.42% | 92.79% | +0.37% |
| Total Channels Pruned | 196 | 372 | +176 (~2×) |
| Optimisation Runtime | ~4,060s | ~2,154s | ~2× faster |
| Layer 16 Compression | 35.5% pruned | 68.75% pruned | +33.2pp |
| Accuracy drop at Layer 16 | −0.98% | −0.87% | Less drop, more pruning |
The counterintuitive result: The Hybrid Crossover is both more aggressive (2× more channels removed) AND more accurate AND faster. Regression-synthesised filters satisfy guard rails on the first or second attempt — standard drop under aggressive compression triggers expensive rollback loops that dominate total runtime.
Full canonical results — all 6 experiments
| Approach | Architecture | Dataset | Acc. Before | Acc. After | Δ Overall | Params Pruned | Latency |
|---|---|---|---|---|---|---|---|
| Local HC | 5-CNN | CelebA | 94.15% | 92.41% | 1.74pp | 46.0% | 1.04× |
| Local HC | VGG16 | CelebA | 97.93% | 96.45% | 1.48pp | 30.5% | 1.03× |
| Global | 5-CNN | CelebA | 94.15% | 92.15% | 1.99pp | 70.1% | 1.28× |
| Global | VGG16 | CelebA | 97.93% | 95.94% | 1.99pp | 45.6% | 1.21× |
| Global | 5-CNN | CIFAR-10 | 91.45% | 90.25% | 1.20pp | 7.9% | 2.05× |
| Global | VGG16 | CIFAR-10 | 97.95% | 95.35% | 2.60pp | 5.2% | 1.52× |
All runs satisfy dataset-specific guard rails. CIFAR-10 5-CNN: 2.05× latency speedup with 1.2pp accuracy drop — by concentrating pruning on compute-intensive blocks (B4: 75% of channels removed) while leaving early, low-redundancy blocks intact.
Team Contributions
2-person directed study — clear division of work
Supervised by Prof. Russel Pears, UNT Computer Science & Engineering.
Tarun Sadarla — this portfolio
- ▸Hybrid Crossover (Tier 2) merging strategy — core novel contribution. Full design, implementation, and evaluation of regression-based filter synthesis.
- ▸Merge Strategies analysis — 30-page systematic analysis of 6 alternative merging strategies evaluated against Hybrid Crossover
- ▸Threshold Design Strategies — 35-page analysis of 7 principled threshold selection methods for cosine similarity-based pruning
- ▸Shared baseline (Phase 1–2), hard floor configuration, IEEE-format final report — jointly with teammate
Sai Naga Chaithanya Aavula — teammate
- ▸Full Global scoring pipeline — ILR scoring, guard rail enforcement, benefit-aware pruning loop (8,694 lines for CelebA alone)
- ▸Full Local HC pipeline — hierarchical clustering pruner (2,966 lines)
- ▸Shared baseline, hard floor design, IEEE report — jointly with Tarun
Limitations & Future Work
What remains open
Current limitations
Middle layers (B3, B2) resist compression due to low filter redundancy — Hybrid Crossover, like Standard Drop, finds few valid merge candidates there. This limits overall compression ratios on deeper architectures.
All experiments are on binary classification tasks (CelebA gender, CIFAR-10). Multi-class extension and medical imaging domain validation (histopathology, ultrasound) remain open — filter redundancy patterns may differ substantially from binary tasks.
Next steps
Phase 4 fine-tuning — short retraining after pruning to recover accuracy in middle layers that currently resist compression.
Medical imaging domain — extend evaluation to histopathology patch classification and U-Net segmentation backbones, where the deployment motivation for this work originates.
INT8 quantisation — combine structured pruning with quantisation for embedded clinical deployment, targeting both model size and inference precision simultaneously.
What this project built that transfers to clinical AI deployment
Synthesis over deletion as a compression philosophy: Reframing structured pruning as a synthesis problem — rather than a removal problem — produced empirically better results. Any compression task where information retention matters more than maximum removal benefits from this inversion. Directly applicable to compressing clinical imaging backbones for hospital hardware.
Guard-rail engineering as responsible compression: Hard accuracy limits (≤2pp overall, ≤6pp per-class) enforced clinical-grade reliability constraints during compression — every structural change validated before committed, with rollback if violated. A compressed model that quietly loses sensitivity on a subgroup is more dangerous than a large model that runs slowly.
Why the faster method is also the better method: The Hybrid Crossover is 2× faster not despite being better, but because of it. Regression-synthesised filters satisfy accuracy constraints on the first attempt; Standard Drop triggers expensive rollback loops. Quality and efficiency are not always in tension — accuracy-preserving design choices can also reduce compute cost.
ILR over Taylor saliency for production stability: Taylor-based saliency requires gradient computation — sensitive to training mode, mixed-precision settings, and compiler behaviour. ILR uses only forward-pass statistics: cheaper, reproducible, and stable across long runs. Choosing a simpler, more stable signal over a theoretically richer but fragile one is a production engineering skill that applies directly to any clinical AI deployment pipeline.