TAPIR: Transformers for Action, Phase, Instrument, and steps Recognition

Verlyck, Mathilde Agathe

dc.contributor	Arbeláez Escalante, Pablo Andrés
dc.contributor	Valderrama Manrique, Mario Andrés
dc.contributor	Giraldo Trujillo, Luis Felipe
dc.contributor	CINFONIA
dc.creator	Verlyck, Mathilde Agathe
dc.date.accessioned	2022-10-06T20:30:48Z
dc.date.accessioned	2023-09-07T01:56:41Z
dc.date.available	2022-10-06T20:30:48Z
dc.date.available	2023-09-07T01:56:41Z
dc.date.created	2022-10-06T20:30:48Z
dc.date.issued	2022-06-06
dc.identifier	http://hdl.handle.net/1992/62532
dc.identifier	instname:Universidad de los Andes
dc.identifier	reponame:Repositorio Institucional Séneca
dc.identifier	repourl:https://repositorio.uniandes.edu.co/
dc.identifier.uri	https://repositorioslatinoamericanos.uchile.cl/handle/2250/8728821
dc.description.abstract	Surgical workflow analysis aims to improve the safety, planning, and efficiency of surgical procedures. However, most benchmarks for studying surgical interventions focus on a specific challenge instead of leveraging the intrinsic complementarity among different tasks. In this work, we present a new model to approach a holistic surgical scene understanding. Jointly with the release of the Phase, Step, Instrument, and Atomic Visual Action recognition (PSI-AVA) dataset by the Biomedical Computer Vision (BCV) group from the Universidad de Los Andes, we present Transformers for Action, Phase, Instrument, and steps Recognition (TAPIR) as a solid approach to surgical scene understanding. PSI-AVA includes annotations for both longterm (Phase and Step recognition) and short-term reasoning (Instrument detection and novel Atomic Action recognition) in robot-assisted radical prostatectomies videos. TAPIR leverages the dataset's multi-level annotations as it benefits from the learned representation on the instrument detection task to improve its classification capacity. Lastly, our experimental results in both PSI-AVA and other publicly available databases demonstrate that TAPIR is a stepping stone for future research in the holistic benchmark.
dc.language	eng
dc.publisher	Universidad de los Andes
dc.publisher	Maestría en Ingeniería Biomédica
dc.publisher	Facultad de Ingeniería
dc.publisher	Departamento de Ingeniería Biomédica
dc.relation	[1] Aidean Sharghi, Helene Haugerud, et al. Automatic operating room surgical activity recognition for robot-assisted surgery. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 385-395. Springer, 2020.
dc.relation	[2] Narges Ahmidi, Lingling Tao, Shahin Sefati, Yixin Gao, Colin Lea, Benjamin Bejar Haro, Luca Zappella, Sanjeev Khudanpur, René Vidal, and Gregory D Hager. A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery. IEEE Transactions on Biomedical Engineering, 64(9):2025-2041, 2017.
dc.relation	[3] Ralf Stauder, Daniel Ostler, et al. The tum lapchole dataset for the m2cai 2016 workflow challenge. arXiv preprint arXiv:1610.09278, 2016.
dc.relation	[4] Andru P Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel De Mathelin, and Nicolas Padoy. Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE transactions on medical imaging, 36(1):86-97, 2016.
dc.relation	[5] Maria Grammatikopoulou, Evangello Flouty, Abdolrahim Kadkhodamohammadi, Gwenol'e Quellec, Andre Chow, Jean Nehme, Imanol Luengo, and Danail Stoyanov. Cadis: Cataract dataset for image segmentation. arXiv preprint arXiv:1906.11586, 2019.
dc.relation	[6] Lena Maier-Hein, Martin Wagner, Tobias Ross, et al. Heidelberg colorectal data set for surgical data science in the sensor operating room. Scientific data, 8(1):1-11, 2021.
dc.relation	[7] Chinedu Innocent Nwoye, Cristians Gonzalez, Tong Yu, Pietro Mascagni, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. Recognition of instrument-tissue interactions in endoscopic videos via action triplets. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 364-374. Springer, 2020.
dc.relation	[8] Chinedu Innocent Nwoye, Tong Yu, Cristians Gonzalez, Barbara Seeliger, Pietro Mascagni, Didier Mutter, Jacques Marescaux, et al. Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos. arXiv preprint arXiv:2109.03223, 2021.
dc.relation	[9] Arnaud Huaulmé, Duygu Sarikaya, Kévin Le Mut, Fabien Despinoy, Yonghao Long, Qi Dou, Chin-Boon Chng, Wenjun Lin, Satoshi Kondo, Laura Bravo-Sánchez, et al. Micro-surgical anastomose workflow recognition challenge report. Computer Methods and Programs in Biomedicine, 212:106452, 2021.
dc.relation	[10] Vivek Singh Bawa, Gurkirt Singh, Francis KapingA, Inna Skarga-Bandurova, Elettra Oleari, et al. The saras endoscopic surgeon action detection (esad) dataset: Challenges and methods. arXiv preprint arXiv:2104.03178, 2021.
dc.relation	[11] Emmett D Goodman, Krishna K Patel, Yilun Zhang, William Locke, Chris J Kennedy, Rohan Mehrotra, Stephen Ren, Melody Guan, Maren Downing, Chen, et al. A real-time spatiotemporal ai model analyzes skill in open surgical videos. arXiv preprint arXiv:2112.07219, 2021.
dc.relation	[12] Lena Maier-Hein, Swaroop S Vedula, Stefanie Speidel, et al. Surgical data science for next-generation interventions. Nature Biomedical Engineering, 1(9):691-696, 2017.
dc.relation	[13] Florent Lalys and Pierre Jannin. Surgical process modelling: a review. International journal of computer assisted radiology and surgery, 9(3):495-511, 2014.
dc.relation	[14] Zelun Luo, Wanze Xie, Siddharth Kapoor, Yiyun Liang, Michael Cooper, Juan Carlos Niebles, Ehsan Adeli, and Fei-Fei Li. Moma: Multiobject multi-actor activity parsing. Advances in Neural Information Processing Systems, 34, 2021.
dc.relation	[15] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatiotemporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6047-6056, 2018.
dc.relation	[16] Max Allan et al. 2017 robotic instrument segmentation challenge. arXiv preprint arXiv:1902.06426, 2019.
dc.relation	[17] Max Allan et al. 2018 robotic scene segmentation challenge. arXiv preprint arXiv:2001.11190, 2020.
dc.relation	[18] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213-229. Springer, 2020.
dc.relation	[19] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
dc.relation	[20] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
dc.relation	[21] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6824-6835, 2021.
dc.relation	[22] Tobias Czempiel, Magdalini Paschali, et al. Opera: Attention-regularized transformers for surgical phase recognition. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 604-614. Springer, 2021.
dc.relation	[23] Xiaojie Gao, Yueming Jin, Yonghao Long, Qi Dou, and Pheng-Ann Heng. Trans-svnet: accurate phase recognition from surgical videos via hybrid embedding aggregation transformer. In International Conference on Medical Image Computing and Computer Assisted Intervention, pages 593-603. Springer, 2021.
dc.relation	[24] Xinpeng Ding and Xiaomeng Li. Exploiting segment-level semantics for online phase recognition from surgical videos. arXiv preprint arXiv:2111.11044, 2021.
dc.relation	[25] Bokai Zhang, Julian Abbing, Amer Ghanem, Danyal Fer, Jocelyn Barker, Rami Abukhalil, Varun Kejriwal Goel, and Fausto Milletarì. Towards accurate surgical workflow recognition with convolutional networks and transformers. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, pages 1-8, 2021.
dc.relation	[26] Zixu Zhao, Yueming Jin, and Pheng-Ann Heng. Trasetr: Track-to-segment transformer with contrastive query for instance-level instrument segmentation in robotic surgery. arXiv preprint arXiv:2202.08453, 2022.
dc.relation	[27] Satoshi Kondo. Lapformer: surgical tool detection in laparoscopic surgical video using transformer architecture. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 9(3):302-307, 2021.
dc.relation	[28] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6202-6211, 2019.
dc.relation	[29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770-778, 2016.
dc.relation	[30] Cristina González, Laura Bravo-Sánchez, and Pablo Arbelaez. Isinet: An instance-based approach for surgical instrument segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 595-605. Springer, 2020.
dc.relation	[31] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
dc.relation	[32] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015.
dc.rights	Atribución-NoComercial 4.0 Internacional
dc.rights	Atribución-NoComercial 4.0 Internacional
dc.rights	http://creativecommons.org/licenses/by-nc/4.0/
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	http://purl.org/coar/access_right/c_abf2
dc.title	TAPIR: Transformers for Action, Phase, Instrument, and steps Recognition
dc.type	Trabajo de grado - Maestría

Este ítem pertenece a la siguiente institución

Universidad de los Andes (Colombia)