// Multimodal · Visual Stream · Project Lead · Fusion Failure Analysis

Multimodal Emotion & Sentiment
Recognition — Visual Stream

Emotion and sentiment classification from video and audio without text — MELD dataset (Friends TV, 7 emotion classes). I led the project and owned the visual stream: VGG-16 transfer learning, dual-head multi-task architecture, and video feature branch for fusion. The fusion failure taught more than success would have.

Project LeadVGG-16 48.3% AccuracyDual-Head Multi-TaskPyTorch · MELD · Deep Learning Spring 2025

Overview

Why the fusion failed — and why that's the most useful result

Emotion recognition from video and audio without text. The fusion not outperforming individual streams taught more about multimodal architecture constraints than success would have.

MELD (Friends TV series, 9,989 training utterances) presents 7-class emotion and 3-class sentiment labels with extreme class imbalance — Neutral dominates at ~47%, Fear and Disgust each have fewer than 300 training samples. I led the project and owned the complete visual stream: frame extraction, custom 5-layer CNN baseline, pretrained VGG-16 model, the video feature branch for fusion, and the dual-head multi-task output architecture.

48.3%
VGG-16 Emotion Acc
Visual stream
7
Emotion Classes
MELD dataset
2
Output Heads
Sentiment + Emotion

Architecture

Visual stream — two models + fusion branch

Baseline — my work
Custom 5-Layer CNN
Leaky ReLU, batch normalisation, max pooling. Dual output heads for simultaneous sentiment and emotion prediction. Serves as the performance baseline (Sentiment 40.1%, Emotion 35.8%).
Improved — my work
Pretrained VGG-16
VGG-16 (ImageNet pretrained) with dual task-specific FC heads. Transfer learning from natural image features. Sentiment 47.5%, Emotion 48.3% — +8pp over CNN baseline.

The fusion failure — and what it means: Simple concatenation (100-D video + 192-D audio) without temporal alignment means the model sees video features from 1 frame per 3 seconds alongside audio features from the full clip — two different temporal windows of the same utterance. Fusion F1 dropped to 0.217 vs VGG-16 alone at 0.355. This failure is directly informative for multimodal clinical pipelines: imaging + EHR metadata fusion requires the same temporal alignment reasoning. The failure mode is the lesson.

What this project built that transfers to clinical AI

Why simple fusion fails: Understanding exactly why — temporal misalignment, feature-space mismatch — and what would fix it (cross-modal attention, temporal video modeling) is more valuable than if it had worked. The same failure mode applies to any clinical pipeline fusing imaging data with time-series (vital signs, ECG) or structured metadata.

Multi-task learning with dual output heads: Every other classification project has a single target. This was the first project with two simultaneous targets sharing a backbone but separate heads — how multi-task objectives interact during training is directly relevant to clinical AI systems that must simultaneously predict multiple biomarkers or measurements.

Transfer learning far from ImageNet: VGG-16 pretrained on 1,000-class natural images, fine-tuned for facial emotion recognition — a task with very different discriminative features. The +8pp improvement despite the domain gap showed that hierarchical visual representations are partially transferable, which informs every transfer learning decision in medical imaging.

My contributions as project lead (team of 4)

PyTorchVGG-16BiLSTM (teammate)MELD Datasetscikit-learn

View notebooks and architecture code.

CNN baseline, VGG-16 model, and full evaluation pipeline.

GitHub → Get in Touch