Multimodal Emotion Recognition

Overview

Why the fusion failed — and why that's the most useful result

Emotion recognition from video and audio without text. The fusion not outperforming individual streams taught more about multimodal architecture constraints than success would have.

MELD (Friends TV series, 9,989 training utterances) presents 7-class emotion and 3-class sentiment labels with extreme class imbalance — Neutral dominates at ~47%, Fear and Disgust each have fewer than 300 training samples. I led the project and owned the complete visual stream: frame extraction, custom 5-layer CNN baseline, pretrained VGG-16 model, the video feature branch for fusion, and the dual-head multi-task output architecture.

48.3%

VGG-16 Emotion Acc

Visual stream

Emotion Classes

MELD dataset

Output Heads

Sentiment + Emotion

Architecture

Visual stream — two models + fusion branch

Baseline — my work

Custom 5-Layer CNN

Leaky ReLU, batch normalisation, max pooling. Dual output heads for simultaneous sentiment and emotion prediction. Serves as the performance baseline (Sentiment 40.1%, Emotion 35.8%).

Improved — my work

Pretrained VGG-16

VGG-16 (ImageNet pretrained) with dual task-specific FC heads. Transfer learning from natural image features. Sentiment 47.5%, Emotion 48.3% — +8pp over CNN baseline.

The fusion failure — and what it means: Simple concatenation (100-D video + 192-D audio) without temporal alignment means the model sees video features from 1 frame per 3 seconds alongside audio features from the full clip — two different temporal windows of the same utterance. Fusion F1 dropped to 0.217 vs VGG-16 alone at 0.355. This failure is directly informative for multimodal clinical pipelines: imaging + EHR metadata fusion requires the same temporal alignment reasoning. The failure mode is the lesson.

What this project built that transfers to clinical AI

Why simple fusion fails: Understanding exactly why — temporal misalignment, feature-space mismatch — and what would fix it (cross-modal attention, temporal video modeling) is more valuable than if it had worked. The same failure mode applies to any clinical pipeline fusing imaging data with time-series (vital signs, ECG) or structured metadata.

Multi-task learning with dual output heads: Every other classification project has a single target. This was the first project with two simultaneous targets sharing a backbone but separate heads — how multi-task objectives interact during training is directly relevant to clinical AI systems that must simultaneously predict multiple biomarkers or measurements.

Transfer learning far from ImageNet: VGG-16 pretrained on 1,000-class natural images, fine-tuned for facial emotion recognition — a task with very different discriminative features. The +8pp improvement despite the domain gap showed that hierarchical visual representations are partially transferable, which informs every transfer learning decision in medical imaging.

My contributions as project lead (team of 4)

▸Frame extraction pipeline from raw MELD .mp4 clips at consistent temporal intervals
▸Custom 5-layer CNN baseline with dual sentiment + emotion output heads
▸VGG-16 fine-tuning with dual FC heads replacing ImageNet classification head
▸Video feature branch — VGG-16 + adaptive pooling producing 100-D embedding for fusion
▸Project lead — coordinated team of 4, managed visual/audio stream integration

PyTorchVGG-16BiLSTM (teammate)MELD Datasetscikit-learn

Multimodal Emotion & Sentiment
Recognition — Visual Stream

Why the fusion failed — and why that's the most useful result

Visual stream — two models + fusion branch

What this project built that transfers to clinical AI

My contributions as project lead (team of 4)

View notebooks and architecture code.

Multimodal Emotion & SentimentRecognition — Visual Stream

Why the fusion failed — and why that's the most useful result

Visual stream — two models + fusion branch

What this project built that transfers to clinical AI

My contributions as project lead (team of 4)

View notebooks and architecture code.

Multimodal Emotion & Sentiment
Recognition — Visual Stream