Audio-Visual Emotion Recognition in Continuous Domain

Master Thesis, University of Hamburg (Informatik)

This research focuses on the complex challenge of Audio-Visual Emotion Recognition (AV-ER) within the continuous domain (Valence-Arousal space). Unlike traditional discrete emotion classification (e.g., Happy, Sad), continuous modeling allows for a more nuanced understanding of human affect by positioning emotions on a dimensional spectrum.

Methodology & Innovation

The core of the project lies in multi-modal representation learning. By leveraging both self-supervised and semi-supervised techniques, the framework achieves robust feature extraction even from large-scale unlabelled datasets.

1. Dual-Stream Representation Learning

Speech (Audio): Implements and evaluates both contrastive and non-contrastive methodologies on unlabelled speech data (e.g., Tedlium v3). This ensures the model captures acoustic nuances that are typically lost in supervised training.
Facial (Visual): Utilizes high-fidelity synthetic data (FaceSynthetics) to learn rich facial representations, ensuring the model is resilient to variations in lighting, pose, and occlusions.

2. Pseudo-Labelling Pipeline

One of the key contributions is an automated high-fidelity labelling pipeline. By strategically generating “pseudo-labels” for unlabelled data, the system significantly reduces the need for expensive manual annotation while increasing the training volume and subsequent accuracy on datasets like IEMOCAP.

🛠️ Technical Architecture

The project is built as a robust, production-ready Python package with a modular design:

Framework: Developed using PyTorch for GPU-accelerated training and inference.
Experiment Tracking: Full integration with Comet ML for real-time monitoring of loss curves, valence/arousal correlation (CCC), and hyper-parameter optimization.
CLI-First Workflow: A comprehensive Command Line Interface that handles the entire pipeline—from dataset preparation and augmentation to multi-stage representation learning and final evaluation.

📊 Impact & Results

By integrating audio-visual streams through late-fusion and optimized attention mechanisms, the model demonstrates significant performance gains over single-modality baselines. The framework provides a scalable foundation for applications in Human-Robot Interaction, Empathetic AI, and Mental Health Monitoring, where understanding the subtle “shades” of emotion is critical.

Resource Directory

Methodology & Innovation

1. Dual-Stream Representation Learning

2. Pseudo-Labelling Pipeline

🛠️ Technical Architecture

📊 Impact & Results

View Next

Extract Job Titles from Resumes with Google Gemini AI

COLMAN: Collaborative Multi-Agent Navigation using Textual-Visual Embeddings

Methodology & Innovation

1. Dual-Stream Representation Learning

2. Pseudo-Labelling Pipeline

🛠️ Technical Architecture

📊 Impact & Results

View Next

Extract Job Titles from Resumes with Google Gemini AI

COLMAN: Collaborative Multi-Agent Navigation using Textual-Visual Embeddings

Quick Connect