Hokuto Munakata, Taichi Nishimura, Shota Nakada, Tatsuya Komatsu
LY Corporation, Japan
Under review
[Paper]
[CLAP Features of Clotho-moment]
[Wav files]
[Code (AMR)]
[Code (Data preparation & Zero-shot SED)]
In this paper, we propose and design a new task called audio moment retrieval (AMR). Unlike conventional language-based audio retrieval tasks that search for short audio clips from an audio database, AMR aims to predict relevant moments in untrimmed long audio based on a text query. Given the lack of prior work in AMR, we first build a dedicated dataset, Clotho-Moment, consisting of large-scale simulated audio recordings with moment annotations. We then propose a DETR-based model, named Audio Moment DETR (AM-DETR), as a fundamental framework for AMR tasks. This model captures temporal dependencies within audio features, inspired by similar video moment retrieval tasks, thus surpassing conventional clip-level audio retrieval methods. Additionally, we provide manually annotated datasets to properly measure the effectiveness and robustness of our methods on real data. Experimental results show that AM-DETR, trained with Clotho-Moment, outperforms a baseline model that applies a clip-level audio retrieval method with a sliding window on all metrics, particularly improving Recall1@0.7 by 9.00 points. Our datasets and code are publicly available in https://h-munakata.github.io/Language-based-Audio-Moment-Retrieval.
We construct datasets to retrieve audio moments relevant to the text query from untrimmed audio data. The audio we assume for AMR is one minute long and includes some specific scenes that can be represented by text such as audio of a sports broadcast including a scene in which the goal is scored and people cheering. Based on these assumptions, we create a large-scale simulated AMR dataset named Clotho-Moment. As shown in Figure 2, this dataset is generated with Clotho [1] and Walking Tours [2] as foreground and background, respectively.
We propose Audio Moment DETR (AM-DETR), an AMR model inspired by the model proposed for video moment retrieval (VMR) [3-5]. As shown in Figure 3, AM-DETR consists of a feature extractor and a DETR-based network the same as the VMR models. The key part of AM-DETR is the DETR-based network [3-5] that captures the temporal dependency between audio features. Specifically, the architecture of AM-DETR is based on QD-DETR, which is simple but achieves high retrieval performance for VMR by capturing both the temporal dependency between video frames and cross-modal dependency between frames and text query [4].
We demonstrate sample data of Clotho-Moment and UnAV-100 subset (our manual annotation data of [6]), and the predicted moments by AM-DETR trained with Clotho-Moment. AM-DETR is trained using the open-source library called Lighthouse [7].
Audio sample | Text Query | Ground-truth moment |
Predicted moment |
---|---|---|---|
A woman is making an announcement |
|||
Bird are singing nearby a source of water |
|||
The fast cars pass by as the rain falls. |
|||
A plane accelerates and takes off |
|||
A burst of metal instruments playing is |
|||
A sink of water is being used to wash up in. |
Audio sample | Text Query | Ground-truth moment |
Predicted moment |
---|---|---|---|
Someone dries his hair with a hair dryer. |
|||
Two people perform tap dance to background music |
|||
Water cascades down from a waterfall. |
|||
Spectators watch sports and cheer. |
|||
Men and women sing in a chorus in a hall. |
|||
A sports car starts with a loud noise. |