Research Papers audio_ml representation_learning music

PHALAR introduces a contrastive audio representation framework using phasor-based complex-valued heads, achieving ~70% r

PHALAR introduces a contrastive audio representation framework using phasor-based complex-valued heads, achieving ~70% relative accuracy improvement over state-of-the-art in musical stem retrieval with fewer parameters and faster training.

Original Post

PHALAR: Phasors for Learned Musical Audio Representations Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70\%$ over the state-of-the-art while requiring $<50\%$ of the parameters and a 7$\times$ training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.

Source: ARXIV (arxiv)
Author: Davide Marincione, Michele Mancusi, Giorgio Strano +4 more
Date: 2026-05-05
Relevance: 4
Topics: audio_ml, representation_learning, music

View Original Post ↗

PHALAR introduces a contrastive audio representation framework using phasor-based complex-valued heads, achieving ~70% r

Related Posts

Anthropic Fellows introduce Model Spec Midtraining (MSM), a method that teaches ...

Anthropic Fellows research demonstrates that a model deliberately underperformin...

Joint research from MATS, Redwood, and Anthropic shows that a strategically sand...