Ponencia
Urban sound & sight : Dataset and benchmark for audio-visual urban scene understanding
Fecha
2022Registro en:
Fuentes, M., Steers, B., Zinemanas, P. y otros. Urban sound & sight : Dataset and benchmark for audio-visual urban scene understanding [en línea]. EN: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23-27 may, pp 141-145. Piscataway, NJ : IEEE, 2022. DOI 10.1109/ICASSP43922.2022.9747644
10.1109/ICASSP43922.2022.9747644
Autor
Fuentes, Magdalena
Steers, Bea
Zinemanas, Pablo
Rocamora, Martín
Bondi, Luca
Wilkins, Julia
Shi, Qianyi
Hou, Yao
Das, Samarjit
Serra, Xavier
Bello, Juan Pablo
Institución
Resumen
Automatic audio-visual urban traffic understanding is a growing area of research with many potential applications of value to industry, academia, and the public sector. Yet, the lack of well-curated resources for training and evaluating models to research in this area hinders their development. To address this we present a curated audio-visual dataset, Urban Sound & Sight (Urbansas), developed for investigating the detection and localization of sounding vehicles in the wild. Urbansas consists of 12 hours of unlabeled data along with 3 hours of manually annotated data, including bounding boxes with classes and unique id of vehicles, and strong audio labels featuring vehicle types and indicating off-screen sounds. We discuss the challenges presented by the dataset and how to use its annotations for the localization of vehicles in the wild through audio models.