A comparative study of deep learning-based image captioning models for violence description

CONANT PABLOS, SANTIAGO ENRIQUE; 56551; González Martínez, Fernando

dc.contributor	Conant Pablos, Santiago Enrique
dc.contributor	Escuela de Ingeniería y Ciencias
dc.contributor	Hugo Terashima, Marín
dc.contributor	González Mendoza, Miguel
dc.contributor	Nimrod González, Franco
dc.contributor	Campus Monterrey
dc.contributor	emijzarate
dc.creator	CONANT PABLOS, SANTIAGO ENRIQUE; 56551
dc.creator	González Martínez, Fernando
dc.date.accessioned	2022-11-10T22:59:05Z
dc.date.accessioned	2023-07-19T19:16:53Z
dc.date.available	2022-11-10T22:59:05Z
dc.date.available	2023-07-19T19:16:53Z
dc.date.created	2022-11-10T22:59:05Z
dc.date.issued	2019-10-10
dc.identifier	González Martínez, F. (2022).A comparative study of deep learning-based image captioning models for violence description (Tesis de maestría). Instituto Tecnológico y de Estudios Superiores de Monterrey. Recuperado de: https://hdl.handle.net/11285/649872
dc.identifier	https://hdl.handle.net/11285/649872
dc.identifier	https://orcid.org/0000-0002-2510-0767
dc.identifier.uri	https://repositorioslatinoamericanos.uchile.cl/handle/2250/7715852
dc.description.abstract	The safety and security of people will always hold one of the top positions for governments, countries, states, enterprises, and families. One of the greatest advances in the field of security technologies was the invention of surveillance cameras, giving public and private owners the possibility to observe recorded past events to protect their property. Giving undeniable proof of events that occurred when they were not present. It is safe to say that most corporations and some homes have some type of security technology, from the simplest surveillance system to more complicated technologies, such as facial and fingerprint recognition. With these types of security systems, there exists a drawback, the volume of data generates from each of them. When talking about surveillance cameras we have thousands of hours being recorded and stored for later access to review any past event. The problem arises when the volume of data generated surpasses the capability of humans to analyze it. However, should humans decide to analyze it, human errors become a factor too, as the quantity and nature of the data could overwhelm, and cause humans to miss an event that should not be missed. In this work, the events contain violence and suspicious behavior, such as robberies, assaults, street riots, and fights, among others. Thus, presenting the need for a system that can recognize such events happening and generate a brief description for a faster interpretation by the humans using the system. The field of image captioning and video captioning have been present in computer science for the past decade. Image captioning works by converting an image and words into features using deep learning models, combining them, and creating predictions from what the model believes should be the output for a given state. Given the time for which this task has existed, Image Captioning has been through many changes in the development of its models. The basic model utilizes convolutional neural networks for image analysis and recurrent neural networks for sentence analysis and generation. The addition of attention further improved the results from these models by teaching models where to focus when analyzing images and sentences. Finally, the creation of the Transformer, which has dominated the field in most tasks, thanks to the ability to perform most of its calculations in parallel, thus being faster than past models. The performance improvements can be seen thanks to previous works that are on top of the leaderboards for image recognition, text generation, and captioning. The purpose of this work is to create and train models to generate descriptions of normal and violent images. The models proposed in this work are Encoder-Decoder, Encoder-Decoder using Attention layers, and Transformers. The dataset used as a base for this work is the Flickr8k dataset. This dataset is a collection of around 8000 images with 5 descriptions each, obtained through human consultation. For this work, we extended the dataset to include violent images and their descriptions. The descriptions were obtained by asking a group of three persons to describe the image shown, mentioning subjects, objects, actions, and places as best they could. The images were retrieved by using Microsoft’s Bing API. The models were then evaluated using BLEU-N, METEOR, CIDEr, and ROUGE-L. These are machine translation evaluation metrics that are used to compare generated sentences to reference sentences and obtain an objective metric. Results show that the models can generate sentences that describe normal and violent images. However, the Soft-Attention model obtained the best performance over normal and violent images. Given our results, these models can generate descriptions of violent and normal images. The availability of these models could help analyze images found on the web, giving a brief description before opening images containing violent content. The results obtained can be used as a base to further improve these models and the possibility of creating models that can analyze violent videos. This could result in a system that is capable of analyzing images and videos in the background and generating a brief description of the events found in them, potentially leading to better reaction times from security and increased crime prevention.
dc.language	eng
dc.publisher	Instituto Tecnológico y de Estudios Superiores de Monterrey
dc.relation	acceptedVersion
dc.relation	REPOSITORIO NACIONAL CONACYT
dc.rights	http://creativecommons.org/licenses/by/4.0
dc.rights	openAccess
dc.title	A comparative study of deep learning-based image captioning models for violence description
dc.type	Tesis de Maestría / master Thesis

Este ítem pertenece a la siguiente institución

Instituto Tecnológico de Monterrey (México)