Holistic Evaluation of Audio Representations

What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning?

The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music.

HEAR consists of:

  • A benchmark of nineteen diverse tasks. These tasks encompass multiple audio domains: speech, environmental sound, and music, with tasks that involve short and long time spans;
  • Open-source evaluation code. Evaluation consists of classification tasks, both multiclass and multilabel, requiring either prediction over the entire audio scene (clip), or temporal-based onset detection of sound event;
  • An API for developing HEAR-compatible models, making it easy for researchers to develop new models and perform evaluation using HEAR evaluation code;
  • A leaderboard to keep track of performance on the HEAR benchmark.

The HEAR benchmark was launched as a NeurIPS competition in 2021. Please see our upcoming journal article, to appear in the PMLR issue on NeurIPS 2021 Competitions, for more information on HEAR.


To stay up-to-date with HEAR, please consider following:

Organizing Team

Joseph Turian and Jordie Shier and Bhiksha Raj and Björn W. Schuller and Christian James Steinmetz and Colin Malloy and George Tzanetakis and Gissel Velarde and Kirk McNally and Max Henry and Nicolas Pinto and Yonatan Bisk and Gyanendra Das and Humair Raj Khan and Camille Noufi and Dorien Herremans and Eduardo Fonseca and Jesse Engel and Justin Salamon and Philippe Esling and Pranay Manocha and Shinji Watanabe and Zeyu Jin

You can learn more about the committee here.


HEAR was sponsored by Google and all evaluations conducted as a part of the 2021 NeurIPS challenge were performed on Google Cloud Platform.