Language-based Audio Moment Retrieval

Hokuto Munakata, Taichi Nishimura, Shota Nakada, Tatsuya Komatsu

LY Corporation, Japan

Under review
[Paper] [CLAP Features of Clotho-moment] [Wav files]
[Code (AMR)] [Code (Data preparation & Zero-shot SED)]

Abstract

In this paper, we propose and design a new task called audio moment retrieval (AMR). Unlike conventional language-based audio retrieval tasks that search for short audio clips from an audio database, AMR aims to predict relevant moments in untrimmed long audio based on a text query. Given the lack of prior work in AMR, we first build a dedicated dataset, Clotho-Moment, consisting of large-scale simulated audio recordings with moment annotations. We then propose a DETR-based model, named Audio Moment DETR (AM-DETR), as a fundamental framework for AMR tasks. This model captures temporal dependencies within audio features, inspired by similar video moment retrieval tasks, thus surpassing conventional clip-level audio retrieval methods. Additionally, we provide manually annotated datasets to properly measure the effectiveness and robustness of our methods on real data. Experimental results show that AM-DETR, trained with Clotho-Moment, outperforms a baseline model that applies a clip-level audio retrieval method with a sliding window on all metrics, particularly improving Recall1@0.7 by 9.00 points. Our datasets and code are publicly available in https://h-munakata.github.io/Language-based-Audio-Moment-Retrieval.

Fig. 1. An overview of audio moment retrieval (AMR). Given a long audioand text query,
the system retrieves the relevant moments as the start and end time stamps.

Clotho-Moment

We construct datasets to retrieve audio moments relevant to the text query from untrimmed audio data. The audio we assume for AMR is one minute long and includes some specific scenes that can be represented by text such as audio of a sports broadcast including a scene in which the goal is scored and people cheering. Based on these assumptions, we create a large-scale simulated AMR dataset named Clotho-Moment. As shown in Figure 2, this dataset is generated with Clotho [1] and Walking Tours [2] as foreground and background, respectively.

Fig. 2. Clotho-Moment containing audio and text query-audio moment pairs is generated by overlaying Clotho on Walking Tours.

Audio moment DETR (AM-DETR)

We propose Audio Moment DETR (AM-DETR), an AMR model inspired by the model proposed for video moment retrieval (VMR) [3-5]. As shown in Figure 3, AM-DETR consists of a feature extractor and a DETR-based network the same as the VMR models. The key part of AM-DETR is the DETR-based network [3-5] that captures the temporal dependency between audio features. Specifically, the architecture of AM-DETR is based on QD-DETR, which is simple but achieves high retrieval performance for VMR by capturing both the temporal dependency between video frames and cross-modal dependency between frames and text query [4].

Fig. 3. The architecture of Audio Moment DETR. This model first extracts embedding by audio/text encoder and then the DETR-based network transforms the embedding to pairs of the audio moment and its confidence score.

Sample data & predicted moments

We demonstrate sample data of Clotho-Moment and UnAV-100 subset (our manual annotation data of [6]), and the predicted moments by AM-DETR trained with Clotho-Moment. AM-DETR is trained using the open-source library called Lighthouse [7].

Clotho-Moment

Audio sample Text Query Ground-truth
moment
Predicted
moment

A woman is making an announcement
about something

Bird are singing nearby a source of water
such as a small waterfall.

The fast cars pass by as the rain falls.

A plane accelerates and takes off
before fading in the distance.

A burst of metal instruments playing is
followed by clapping, people talking and
more rhythmic sounds of horn music.

A sink of water is being used to wash up in.


UnAV-100 subset

Audio sample Text Query Ground-truth
moment
Predicted
moment

Someone dries his hair with a hair dryer.

Two people perform tap dance to background music

Water cascades down from a waterfall.

Spectators watch sports and cheer.

Men and women sing in a chorus in a hall.

A sports car starts with a loud noise.


References

[1] K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An audio captioning dataset,” in Proc. ICASSP, 2020, pp. 736–740.
[2] S. Venkataramanan, M. N. Rizve, J. Carreira, Y. M. Asano, and Y. Avrithis, “Is imagenet worth 1 video? learning strong image encoders from 1 long unlabelled video,” in Proc. ICLR, 2024.
[3] J. Lei, T. L. Berg, and M. Bansal, “Detecting moments and highlights in videos via natural language queries,” in Proc. NeurIPS, vol. 34, 2021, pp. 11 846–11 858.
[4] W. Moon, S. Hyun, S. Park, D. Park, and J.-P. Heo, “Query-dependent video representation for moment retrieval and highlight detection,” in Proc. CVPR, 2023, pp. 23023–23033.
[5] W. Moon, S. Hyun, S. Lee, and J.-P. Heo, “Correlation-guided query-dependency calibration in video representation learning for temporal grounding,” arXiv preprint arXiv:2311.08835, 2023.
[6] T. Geng, T. Wang, J. Duan, R. Cong, and F. Zheng, “Dense-localizing audio-visual events in untrimmed videos: A large-scale benchmark and baseline,” in Proc. CVPR, 2023, pp. 22942–22951.
[7] T. Nishimura, S. Nakada, H. Munakata, and T. Komatsu, “Lighthouse: A user-friendly library for reproducible video moment retrieval and highlight detection,” arXiv preprint arXiv:2408.02901, 2024.