Otro
Analysis of insurance claims data based on networks
Autor
Moreno Vásquez, Manuel Alejandro
Institución
Resumen
Este trabajo propone una metodología estadística para el aprendizaje de codificaciones relacionales de variables influyentes de alta cardinalidad para clasificación binaria supervisada. La codificación clasifica las categorías según su importancia relativa para obtener el resultado de interés en los datos de entrenamiento utilizando el algoritmo de PageRank personalizado para redes bipartitas. Para la obtención de los puntajes se realiza un análisis diádico de redes bipartitas construidas sobre las relaciones entre las categorías en estudio, enriqueciendo la interpretabilidad de las estructuras intrínsecas de la variable objetivo en el proceso de formación. Una aplicación de la metodología propuesta es la clasificación supervisada para la detección de fraudes. Se realiza un caso de estudio experimental con un escenario de detección de fraude de seguros de automóviles para comparar el rendimiento de las técnicas de codificación. This work proposes a statistical methodology for learning relational encodings
of influential high dimensional variables for supervised binary classification. The encoding
ranks the categories according to its relative importance for obtaining the outcome
of interest in the training data using a personalized PageRank algorithm for bipartite
networks. For obtaining the scores, a dyadic analysis of the bipartite networks constructed
on the relationships among the categories under study is made, enriching the knowledge
and interpretability of the intrinsic structures of the target variable in the training
process.
Binary classification tasks account for a high percentage of applications of predictive
modelling in industries such as insurance, banking, telecommunications, etc. The hardship
that the curse of dimensionality carries in widespread statistical learning algorithms
makes it necessary to explore encoding alternatives to dummy and other ad hoc methods
in the literature. The proposed methodology brings a statistically driven and structure
oriented representation of categorical variables that can be fed into supervised learning
binary classification models.
An application of the proposed methodology is supervised classification for fraud
detection. Fraud is a social phenomena with several impacts in which active research
is made from the statistical and network community. Insurance companies are highly
exposed to fraudulent claims and the nature of the data required for its analysis is mostly
qualitative. An experimental case study is conducted with an automobile insurance
fraud detection scenario for comparing the performance of the proposed methodology for
bipartite encoding and the popular target encoding (Micci-Barreca, 2001). The empirical
results show that the bipartite networks encoding can help random forest models to lower
the false positive rate. This encoding also highlights relations among categorical variables,
making it more interpretable than some of the popular methods in the statistical learning
community.