HEAR Benchmark Tasks and Data

Downloading

Datasets were all preprocessed and normalized to a common human-readable format using hearpreprocess. They are available for download from our Zenodo mirror. This has all audio tasks, but only at 48000Hz sampling rate.

For other sampling rates (16000, 22050, 32000, 44100), please download files (requester pays) from:

Google Storage: gs://hear2021-archive/tasks/
AWS: s3://hear2021-archive/tasks/

(with CLI flag --requester-payer requester).

Task Summary

Summary the 19 evaluation tasks in the HEAR Benchmark.

Task Name	Embed Type	Predictor Type	Split Method	Duration (sec)	# clips	Evaluation Metric	Novel
DCASE 2016 Task 2	T	L	TVT	120.0	72	Onset FMS	✓
NSynth Pitch 5hr	S	C	TVT	4.0	5000	Pitch Acc.	✓
NSynth Pitch 50hr	S	C	TVT	4.0	49060	Pitch Acc.	✓
Speech Commands 5hr	S	C	TVT	1.0	22890	Accuracy	✓
Speech Commands Full	S	C	TVT	1.0	100503	Accuracy
Beehive States	S	C	TVT	600.0	576	AUCROC
Beijing Opera Percussion	S	C	5-fold	4.77	236	Accuracy	✓
CREMA-D	S	C	5-fold	5.0	7438	Accuracy
ESC-50	S	C	5-fold	5.0	2000	Accuracy
FSD50K	S	L	TVT	0.3-30.0	51185	mAP
Gunshot Triangulation	S	C	7-fold	1.5	88	Accuracy	✓
GTZAN Genre	S	C	10-fold	30.0	1000	Accuracy
GTZAN Music Speech	S	C	10-fold	30.0	128	Accuracy
LibriCount	S	C	5-fold	5.0	5720	Accuracy
MAESTRO 5hr	T	L	5-fold	120.0	185	Onset FMS	✓
Mridangam Stroke	S	C	5-fold	0.81	6977	Accuracy	✓
Mridangam Tonic	S	C	5-fold	0.81	6977	Accuracy	✓
Vocal Imitations	S	C	3-fold	11.26	5601	mAP	✓
VoxLingua107 Top10	S	C	5-fold	18.64	972	Accuracy	✓

Adapted from DCASE 2016, Task 2 office sound event detection. Our evaluation uses different splits, so the numbers cannot be directly compared to previously published results.
Postprocessing: Segments were postprocessed using 250 ms median filtering. At each validation step, a minimum event duration of 125 or 250 ms was chosen to maximize onset-only event-based F-measure (with 200ms tolerance). Scores were computed using sed_eval.

NSynth Pitch is a HEAR open task and is a multiclass classification problem. The goal of this task is to classify instrumental sounds from the NSynth Dataset into one of 88 pitches. Results for this task are measured by pitch accuracy as well as chroma accuracy. The chroma accuracy metric only considers the pitch class and disregards octave errors.

Classification of known spoken commands, with additional categories for silence and unknown commands. This task was described in Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. As per the literature, we measure accuracy.

This is a binary classification task using audio recordings of two beehives. The beehives are in one of two states: a Queen-less beehive, where for some reason the Queen is missing, and a normal beehive. There are 930 clips in this data set, which are mostly 10 minutes long. (Nolasco et al. 2019)

This is a novel audio classification task developed using the Beijing Opera Percussion Instrument Dataset. The Beijing Opera uses six main percussion instruments that can be classified into four main categories: Bangu, Naobo, Daluo, and Xiaoluo. There are 236 audio clips. Scores are averaged over 5-folds.

CREMA-D is a dataset for emotion recognition. The original dataset contains audiovisual data of actors reciting sentences with one of six different emotions (Anger, Disgust, Fear, Happy, Neutral and Sad). For HEAR we only use the audio recordings. As per the literature, we use 5-fold cross validation. There are 7438 clips.

This is a multiclass classification task on environmental sounds. The ESC-50 dataset is a collection of 2000 environmental sounds organized into 50 classes. Scores are averaged over 5 folds. (The folds are predefined in the original dataset.)

FSD50K is a multilabel task (Fonseca et al., 2020). This dataset contains over 100 hours of human-labeled sound events from Freesound). Each of the approximately 51k audio clips is labeled using one or more of 200 classes from the AudioSet Ontology, encompassing environmental sounds, speech, and music. Unlike the other datasets, for FSD50K scene embeddings we did not alter the audio clip length. Each clip is between 0.3 and 30 seconds long. We use the predefined train/val/eval split. Evaluation is done using mean average precision (mAP).

Gunshot triangulation is a novel resource multiclass classification task that utilizes a unique dataset: gunshots recorded in an open field using iPod Touch devices (Cooper and Shaw, 2020). This data consist of 22 shots from 7 different firearms, for a total of 88 audio clips, the smallest dataset in HEAR. Each shot is recorded using four different iPod Touches, located at different distances from the shooter. The goal of this task is to classify audio by the iPod Touch that recorded it, i.e., to identify the location of the microphone. The dataset was split into 7 different folds, where each firearm belonged to only one fold. Results are averaged over each fold.

The GTZAN Genre Collection (Tzanetakis and Cook, 2002) is a dataset of 1000 audio tracks (each 30 seconds in duration) that are categorized into ten genres (100 tracks per genre). The task is multiclass classification. As per the literature, scores are averaged over 10 folds. However, we don’t used the corrected artist-conditional splits from (Sturm, 2013).

GTZAN Music Speech is a binary classification task, where the goal is to distinguish between music and speech. The dataset consists of 120 tracks (each 30 seconds in duration) and each class (music/speech) has 60 examples.

LibriCount is a multiclass speaker count identification task (Stöter et al., 2018b). The dataset contains audio of a simulated cocktail party environment with between 0 to 10 speakers. The goal of this task is to classify how many speakers are present in each of the recordings. Following Stöter et al. (2018a), we treat this as a classification, not regression, problem.

This is a novel music transcription task adapted from MAESTRO. For HEAR, we created a subsampled version that includes 5 hours of training and validation audio, in 120 second clips. To evaluate submissions, a shallow transcription model was trained on timestamp-based embeddings provided by the participant models.
We use note onset FMS and note onset with offset FMS for evaluation, as per the original MAESTRO paper (Hawthorne et al., 2019) and the preceding Onsets and Frames paper (Hawthorne et al., 2018).
Note onset measures the ability of the model to estimate note onsets with 50 ms tolerance and ignores offsets. Note onset w/ offset includes onsets as well as requires note duration within 20% of ground truth or within 50 ms, whichever is greater.

We used the Mridangam Stroke Dataset (Anantapadmanabhan et al., 2013) for two novel multiclass classification tasks: Stroke classification and Tonic classification. The Mridingam is a pitched percussion instrument used in carnatic music, which is a sub-genre of Indian classical music. This dataset comprises 10 different strokes played on Mridingams with 6 different tonics.

Vocal Imitations (Kim et al., 2018a) is a novel multiclass classification task, where the goal is to match a vocal imitation of a sound with the sound that is being imitated. The dataset contains 5601 vocal imitations of 302 reference sounds, organized by AudioSet ontology. Given a vocal sound, the classification task is to retrieve the original audio it is imitating.

This is a novel multiclass classification task derived from the VoxLingua107 dataset (Valk and Alum¨ae, 2021). The goal of the task is to identify the spoken language in an audio file. For HEAR we selected the top 10 most frequent languages from the development set, which resulted in just over 5 hours of audio over 972 audio clips.

Table Info:

embedding type: timestamp (T) or scene (S)
predictor type: multiclass (C) or multilabel (L)
split method used during downstream evaluation: train/validation/test (TVT) or K-fold
duration of clips in seconds
total number of audio clips in a task
primary evaluation metric
whether or not the task is novel. Novel tasks are not comparable to the literature.

Note on clip duration:

For all tasks except FSD50K, clips were standardized to one duration using padding or trimming, typically the 95th percentile length in the original corpus.

License

The datasets have different open licenses. Please see LICENSE.txt for each individual dataset’s license.