Caption generation with transformer models across multiple medical imaging modalities

Vela Jarquin, Daniel

dc.contributor	Santos Díaz, Alejandro
dc.contributor	School of Engineering and Sciences
dc.contributor	Soenksen, Luis Ruben
dc.contributor	Montesinos Silva, Luis Arturo
dc.contributor	Ochoa Ruiz, Gilberto
dc.contributor	Tamez Peña, José Gerardo
dc.contributor	Campus Monterrey
dc.contributor	dnbsrp
dc.creator	Vela Jarquin, Daniel
dc.date.accessioned	2023-07-17T20:49:08Z
dc.date.accessioned	2023-07-19T19:10:25Z
dc.date.available	2023-07-17T20:49:08Z
dc.date.available	2023-07-19T19:10:25Z
dc.date.created	2023-07-17T20:49:08Z
dc.date.issued	2023-06
dc.identifier	Vela Jarquin, D. (2023). Caption generation with transformer models across multiple medical imaging modalities (Master's thesis). Instituto Tecnológico de Monterrey.
dc.identifier	https://hdl.handle.net/11285/651044
dc.identifier	https://orcid.org/0000-0001-5624-8791
dc.identifier	1154114
dc.identifier	57215617169
dc.identifier.uri	https://repositorioslatinoamericanos.uchile.cl/handle/2250/7715654
dc.description.abstract	Caption generation is the process of automatically providing text excerpts that describe relevant features of an image. This process is applicable to very diverse domains, including healthcare. The field of medicine is characterized by the vast amount of visual information in the form of X-Rays, Magnetic Resonances, Ultrasound and CT-scans among others. Descriptive texts generated to represent this kind of visual information can aid medical professionals to achieve a better understanding of the pathologies and cases presented to them and could ultimately allow them to make more informed decisions. In this work, I explore the use of deep learning to face the problem of caption generation in medicine. I propose the use of a Transformer model architecture for caption generation and evaluate its performance on a dataset comprised of medical images that range across multiple modalities and represented anatomies. Deep learning models, particularly encoder-decoder architectures have shown increasingly favorable results in the translation from one information modality to another. Usually, the encoder extracts features from the visual data and then these features are used by the decoder to iteratively generate a sequence in natural language that describes the image. In the past, various deep learning architectures have been proposed for caption generation. The most popular architectures in the last years involved recurrent neural networks (RNNs), Long short-term memory (LSTM) networks and only recently, the use of Transformer type architectures. The Transformer architecture has shown state-of-the art performance in many natural language processing tasks such as machine translation, question answering, summarizing and not long ago, caption generation. The use of attention mechanisms allows Transformers to better grasp the meaning of words in a sentence in a particular context. All these characteristics make Transformers ideal for caption generation. In this thesis I present the development of a deep learning model based on the Transformer architecture that generates captions for medical images of different modalities and anatomies with the ultimate goal to aid professionals improve medical diagnosis and treatment. The model is tested on the MedPix online database, a compendium of medical imaging cases and the results are reported. In summary, this work provides a valuable contribution to the field of automated medical image analysis
dc.language	eng
dc.publisher	Instituto Tecnológico y de Estudios Superiores de Monterrey
dc.relation	acceptedVersion
dc.rights	http://creativecommons.org/licenses/by/4.0
dc.rights	openAccess
dc.title	Caption generation with transformer models across multiple medical imaging modalities
dc.type	Tesis de Maestría / master Thesis

Este ítem pertenece a la siguiente institución

Instituto Tecnológico de Monterrey (México)