CAMELS — Multimodal latent space for human

The question

Most conversational AI today is text-first: a transcript goes in, a response comes out. That works in chat. It fails in video calls — where the meaningful signal isn’t just what the person said, it’s how they said it (prosody), which sounds they actually made (phoneme stream, not the cleaned-up transcript), and what their face is doing (subtle motion, gaze, expression). The CAMELS research project asks: can we project all three of those signals — video, phoneme, prosody — into a single shared latent space that a downstream agent can attend to jointly, in real time, without first collapsing everything to text?

This is a team capstone research project. The training infrastructure and core encoder–adapter architecture are led by Watson Blair; I’m one of the contributors building on top of the trained latent space (see My contribution below).

The data

Dataset	What it gives us
Seamless Interaction (Meta)	4,000+ hours of in-person face-to-face interaction with synchronized video and audio
CANDOR corpus	1,650 video chat conversations between strangers with rich behavioural metadata
L2-Arctic	26,867 utterances from 24 non-native English speakers, used for phoneme-pipeline evaluation

For benchmarks the project uses CMU-MOSEI (sentiment + emotion), VGGSound (audio-visual retrieval), MER2025 (Chinese emotion recognition), and MELD (multimodal dialogue emotion).

The architecture

Three frozen pretrained encoders, each followed by a thin learnable adapter that projects into the shared 768-D latent space:

Modality	Frozen encoder	Output shape	Adapter
Video (face/landmarks)	Selectable; default `facemesh_landmarks`	`(d_video,)`	AVAEAdapter
Phoneme	`wav2vec2-lv-60-espeak-cv-ft` (CTC)	`(MAX_PHONES, 1024)`	PhonemeAdapter + PhonemeAttnPool
Prosody	librosa 22-dim hand-engineered features	`(22,)`	AVAEAdapter

The design pattern is “Adapted Pretrained Encoders” (APE): never re-train the heavy pretrained models, only learn the small adapters that bring them into a common representation.

The training curriculum

Adapter training proceeds in three stages, then a fourth stage trains downstream agents:

Stage A — Contrastive alignment. Adapters learn that the video / phoneme / prosody embeddings of the same moment in time should sit close in latent space, while embeddings from unrelated moments should sit far apart.
Stage B — Adds AVAE reconstruction. Each adapter is asked to also reconstruct its own input from the latent, regularising the latent so it actually carries the information.
Stage C — Adds bidirectional flow matching. Forces consistency between modalities at the latent level — if you know the prosody embedding, you should be able to flow to a plausible video embedding for the same moment.
Stage D — Conversational agent. With the latent space frozen, a downstream agent learns to operate inside it. The most recent finding here (Phase D1.5, validated April 2026) is that a lightweight latent-space response decoder beats a GPT-2 + LoRA text-only baseline by a large margin (cosine-sim 0.97 vs 0.54 at epoch 5 on a 20-dialogue IEMOCAP comparison), while being 1.2× faster per epoch and removing the external text-model dependency entirely.

My contribution

The training infrastructure, encoder architecture, and curriculum are a team effort. My specific contributions, all visible in the repo’s commit history, are:

Latent-space visualization (PR 6): t-SNE and UMAP projections of the 768-D latent across modalities and across training stages, used to qualitatively diagnose alignment between video / phoneme / prosody embeddings at each curriculum stage.
Stage D evaluation reports: per-phase plots and landmark-motion latent t-SNE used to track whether the downstream agent’s response embeddings drift away from the speaker-conditioned latent.
Documentation and design: post-facto architecture audit, a Docusaurus documentation site bootstrap, and a packaging roadmap so the encoders can eventually be consumed externally.

In analyst-language: the team builds the model; my contribution is making it interpretable and consumable — both for the team (visualizations during training) and for the outside world (docs, packaging design).

Visualizations

Chart to drop in: from PR 6 (feat(viz): latent-space t-SNE / UMAP). Two side-by-side panels — left = early-stage latent (modalities sit in separate clusters), right = post-Stage-C latent (modalities collapse into a shared manifold). This is the qualitative evidence that contrastive + flow matching is doing what we claim.

Why this matters

For real-time human–AI interaction (video meetings, accessibility tools, mental-health screening), the latent-space approach has three concrete advantages over today’s transcript-first pipelines:

Latency. Text-first systems wait for an ASR pass before they can react. A shared-latent system reacts to the prosody and face as the speaker is still talking.
Robustness. When ASR fails (accents, overlapping speech, low-bandwidth audio), text-first systems silently lose all signal. CAMELS still has video and prosody.
Bandwidth. A 768-D vector per timestep is dramatically cheaper than streaming raw video + audio to a downstream model.

Limitations

The encoders are frozen pretrained models — biases and failure modes baked into wav2vec2 and FaceMesh propagate downstream.
Evaluation is currently dominated by retrieval-style metrics; ecological validity in actual live calls is still ahead of us.
The 22-dim hand-engineered prosody adapter is the weakest link; a learned prosody encoder is a likely future direction.

CAMELS — Multimodal latent space for human–AI interaction