CASTELLA: Long Audio Dataset
with Captions and Temporal Boundaries

Hokuto Munakata, Takehiro Imamura, Taichi Nishimura, Tatsuya Komatsu

LY Corporation, Japan


[Paper] [Annotation data] [Audio data] [CLAP Features (Zenodo)] [CLAP Features (HuggingFace)]

TL;DR

This repository provides a new dataset that includes long audio, captions of local audio events, and temporal boundaries. Please also check demo page

Abstract

We introduce CASTELLA, a human-annotated audio benchmark for the task of audio moment retrieval (AMR). Although AMR has various useful potential applications, there is still no established benchmark with real-world data. The early study of AMR trained the model with solely synthetic datasets. Moreover, the evaluation is based on annotated dataset of fewer than 100 samples. This resulted in less reliable reported performance. To ensure performance for applications in real-world environments, we present CASTELLA, a large-scale manually annotated AMR dataset. CASTELLA consists of 1,009, 213, and 640 audio recordings for train, valid, and test split, respectively, which is 24 times larger than the previous dataset. We also establish a baseline model for AMR using CASTELLA. Our experiments demonstrate that a model fine-tuned on CASTELLA after pre-training on the synthetic data outperformed a model trained solely on the synthetic data by 5.4 points in Recall1@0.7. The dataset and code will be publicly available upon paper acceptance.
Fig. 1. Sample of CASTELLA. Each sample contains long audio recording, global and local captions, and temporal boundaries.

How to use?

Annotation data and audio data are available at GitHub - line/castella and GitHub - h-munakata/CASTELLA-audio. For audio moment retrieval, we provide the recipe of audio moment retrieval using CASTELLA dataset on GitHub - line/lighthouse. The extracted CLAP feature for this recipe and the procedure to setup is available at Zenodo or HuggingFace.

Examples

Here are some examples from the CASTELLA dataset.

Global Caption

A television program features music, sound effects, and several people engaged in conversation

Local Captions

  1. Sound effects, music, and cheering fill the air
  2. A woman sings along with the piano
  3. A man and a woman have a conversation as music plays in the background
  4. Cheering fills the air as a man and a woman speak
  5. Amid cheering and applause, a woman sings to the piano

Global Caption

A man works with wood and machinery while talking

Local Captions

  1. A man talks as a tape measure snaps shut
  2. Wood is cut while machinery runs
  3. A man talks as wood is being cut while machinery runs

Global Caption

Someone is cooking with a metal pot or frying pan on a gas stove using water

Local Captions

  1. Water flows
  2. A gas stove ignites
  3. Water is boiling

Examples

@article{munakata2025castella,
    title={CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries},
    author={Munakata, Hokuto and Takehiro, Imamura and Nishimura, Taichi and Komatsu, Tatsuya},
    journal={arXiv preprint arXiv:2511.15131},
    year={2025},
}