Analysis of insurance claims data based on networks

Moreno Vásquez, Manuel Alejandro

Otro

Fecha

2020-07-31

Registro en:

https://repositorio.unal.edu.co/handle/unal/78807

Autor

Moreno Vásquez, Manuel Alejandro

Institución

Universidad Nacional de Colombia

Resumen

Este trabajo propone una metodología estadística para el aprendizaje de codificaciones relacionales de variables influyentes de alta cardinalidad para clasificación binaria supervisada. La codificación clasifica las categorías según su importancia relativa para obtener el resultado de interés en los datos de entrenamiento utilizando el algoritmo de PageRank personalizado para redes bipartitas. Para la obtención de los puntajes se realiza un análisis diádico de redes bipartitas construidas sobre las relaciones entre las categorías en estudio, enriqueciendo la interpretabilidad de las estructuras intrínsecas de la variable objetivo en el proceso de formación. Una aplicación de la metodología propuesta es la clasificación supervisada para la detección de fraudes. Se realiza un caso de estudio experimental con un escenario de detección de fraude de seguros de automóviles para comparar el rendimiento de las técnicas de codificación.

This work proposes a statistical methodology for learning relational encodings of influential high dimensional variables for supervised binary classification. The encoding ranks the categories according to its relative importance for obtaining the outcome of interest in the training data using a personalized PageRank algorithm for bipartite networks. For obtaining the scores, a dyadic analysis of the bipartite networks constructed on the relationships among the categories under study is made, enriching the knowledge and interpretability of the intrinsic structures of the target variable in the training process. Binary classification tasks account for a high percentage of applications of predictive modelling in industries such as insurance, banking, telecommunications, etc. The hardship that the curse of dimensionality carries in widespread statistical learning algorithms makes it necessary to explore encoding alternatives to dummy and other ad hoc methods in the literature. The proposed methodology brings a statistically driven and structure oriented representation of categorical variables that can be fed into supervised learning binary classification models. An application of the proposed methodology is supervised classification for fraud detection. Fraud is a social phenomena with several impacts in which active research is made from the statistical and network community. Insurance companies are highly exposed to fraudulent claims and the nature of the data required for its analysis is mostly qualitative. An experimental case study is conducted with an automobile insurance fraud detection scenario for comparing the performance of the proposed methodology for bipartite encoding and the popular target encoding (Micci-Barreca, 2001). The empirical results show that the bipartite networks encoding can help random forest models to lower the false positive rate. This encoding also highlights relations among categorical variables, making it more interpretable than some of the popular methods in the statistical learning community.

Materias

Mostrar el registro completo del ítem