Overview
Why the fusion failed — and why that's the most useful result
Emotion recognition from video and audio without text. The fusion not outperforming individual streams taught more about multimodal architecture constraints than success would have.
MELD (Friends TV series, 9,989 training utterances) presents 7-class emotion and 3-class sentiment labels with extreme class imbalance — Neutral dominates at ~47%, Fear and Disgust each have fewer than 300 training samples. I led the project and owned the complete visual stream: frame extraction, custom 5-layer CNN baseline, pretrained VGG-16 model, the video feature branch for fusion, and the dual-head multi-task output architecture.
Architecture
Visual stream — two models + fusion branch
The fusion failure — and what it means: Simple concatenation (100-D video + 192-D audio) without temporal alignment means the model sees video features from 1 frame per 3 seconds alongside audio features from the full clip — two different temporal windows of the same utterance. Fusion F1 dropped to 0.217 vs VGG-16 alone at 0.355. This failure is directly informative for multimodal clinical pipelines: imaging + EHR metadata fusion requires the same temporal alignment reasoning. The failure mode is the lesson.
What this project built that transfers to clinical AI
Why simple fusion fails: Understanding exactly why — temporal misalignment, feature-space mismatch — and what would fix it (cross-modal attention, temporal video modeling) is more valuable than if it had worked. The same failure mode applies to any clinical pipeline fusing imaging data with time-series (vital signs, ECG) or structured metadata.
Multi-task learning with dual output heads: Every other classification project has a single target. This was the first project with two simultaneous targets sharing a backbone but separate heads — how multi-task objectives interact during training is directly relevant to clinical AI systems that must simultaneously predict multiple biomarkers or measurements.
Transfer learning far from ImageNet: VGG-16 pretrained on 1,000-class natural images, fine-tuned for facial emotion recognition — a task with very different discriminative features. The +8pp improvement despite the domain gap showed that hierarchical visual representations are partially transferable, which informs every transfer learning decision in medical imaging.
My contributions as project lead (team of 4)
- ▸Frame extraction pipeline from raw MELD .mp4 clips at consistent temporal intervals
- ▸Custom 5-layer CNN baseline with dual sentiment + emotion output heads
- ▸VGG-16 fine-tuning with dual FC heads replacing ImageNet classification head
- ▸Video feature branch — VGG-16 + adaptive pooling producing 100-D embedding for fusion
- ▸Project lead — coordinated team of 4, managed visual/audio stream integration