Language-based Audio Moment Retrieval

Hokuto Munakata, Taichi Nishimura, Shota Nakada, Tatsuya Komatsu

LY Corporation, Japan

In this paper, we propose and design a new task called audio moment retrieval (AMR). Unlike conventional language-based audio retrieval tasks that search for short audio clips from an audio database, AMR aims to predict relevant moments in untrimmed long audio based on a text query. Given the lack of prior work in AMR, we first build a dedicated dataset, Clotho-Moment, consisting of large-scale simulated audio recordings with moment annotations. We then propose a DETR-based model, named Audio Moment DETR (AM-DETR), as a fundamental framework for AMR tasks. This model captures temporal dependencies within audio features, inspired by similar video moment retrieval tasks, thus surpassing conventional clip-level audio retrieval methods. Additionally, we provide manually annotated datasets to properly measure the effectiveness and robustness of our methods on real data. Experimental results show that AM-DETR, trained with Clotho-Moment, outperforms a baseline model that applies a clip-level audio retrieval method with a sliding window on all metrics, particularly improving Recall1@0.7 by 9.00 points. Our datasets and code are publicly available in https://h-munakata.github.io/Language-based-Audio-Moment-Retrieval.

We construct datasets to retrieve audio moments relevant to the text query from untrimmed audio data. The audio we assume for AMR is one minute long and includes some specific scenes that can be represented by text such as audio of a sports broadcast including a scene in which the goal is scored and people cheering. Based on these assumptions, we create a large-scale simulated AMR dataset named Clotho-Moment. As shown in Figure 2, this dataset is generated with Clotho [1] and Walking Tours [2] as foreground and background, respectively.

Audio moment DETR (AM-DETR)

We propose Audio Moment DETR (AM-DETR), an AMR model inspired by the model proposed for video moment retrieval (VMR) [3-5]. As shown in Figure 3, AM-DETR consists of a feature extractor and a DETR-based network the same as the VMR models. The key part of AM-DETR is the DETR-based network [3-5] that captures the temporal dependency between audio features. Specifically, the architecture of AM-DETR is based on QD-DETR, which is simple but achieves high retrieval performance for VMR by capturing both the temporal dependency between video frames and cross-modal dependency between frames and text query [4].

Audio sample

Text Query

Ground-truth
moment

Predicted
moment

A woman is making an announcement
about something

Bird are singing nearby a source of water
such as a small waterfall.

The fast cars pass by as the rain falls.

A plane accelerates and takes off
before fading in the distance.

A burst of metal instruments playing is
followed by clapping, people talking and
more rhythmic sounds of horn music.

A sink of water is being used to wash up in.

Audio sample

Text Query

Ground-truth
moment

Predicted
moment

Someone dries his hair with a hair dryer.

Two people perform tap dance to background music

Water cascades down from a waterfall.

Spectators watch sports and cheer.

Men and women sing in a chorus in a hall.

A sports car starts with a loud noise.

References

[1] K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An audio captioning dataset,” in Proc. ICASSP, 2020, pp. 736–740.
[2] S. Venkataramanan, M. N. Rizve, J. Carreira, Y. M. Asano, and Y. Avrithis, “Is imagenet worth 1 video? learning strong image encoders from 1 long unlabelled video,” in Proc. ICLR, 2024.
[3] J. Lei, T. L. Berg, and M. Bansal, “Detecting moments and highlights in videos via natural language queries,” in Proc. NeurIPS, vol. 34, 2021, pp. 11 846–11 858.
[4] W. Moon, S. Hyun, S. Park, D. Park, and J.-P. Heo, “Query-dependent video representation for moment retrieval and highlight detection,” in Proc. CVPR, 2023, pp. 23023–23033.
[5] W. Moon, S. Hyun, S. Lee, and J.-P. Heo, “Correlation-guided query-dependency calibration in video representation learning for temporal grounding,” arXiv preprint arXiv:2311.08835, 2023.
[6] T. Geng, T. Wang, J. Duan, R. Cong, and F. Zheng, “Dense-localizing audio-visual events in untrimmed videos: A large-scale benchmark and baseline,” in Proc. CVPR, 2023, pp. 22942–22951.
[7] T. Nishimura, S. Nakada, H. Munakata, and T. Komatsu, “Lighthouse: A user-friendly library for reproducible video moment retrieval and highlight detection,” arXiv preprint arXiv:2408.02901, 2024.

Language-based Audio Moment Retrieval

Abstract

Clotho-Moment

Audio moment DETR (AM-DETR)

Sample data & predicted moments

Clotho-Moment

UnAV-100 subset

References