2025-12-09 –, Machine Learning & AI
This study is oriented to a synthetic non-life insurance premium dataset generated using several Generative Models. As a benchmark, a Conditional Gaussian Mixture Model has been employed. The validation of the generated data involved several steps: data visualisation, comparison with univariate analysis, PCA and UMAP representations between the trained data and the generated samples. In addition, check the consistency of data produced, the statistical Kolmogorov–Smirnov test and predictive modelling of frequency and severity with Generalised Linear Models (GLMs) exploited by Tweedie distribution as a measure of the generated data's quality, followed by the evidence of features importance. For further comparison, advanced Deep Learning architectures have been employed: Conditional Variational Autoencoders (CVAEs), CVAEs enhanced with a Transformer Decoder, a Conditional Diffusion Model, and Large Language Models. The analysis assesses each model’s ability to capture the underlying distributions, preserve complex dependencies, and maintain relationships intrinsic to the premium data. These findings provide insightful directions for enhancing synthetic data generation in insurance, with potential applications in risk modelling, pricing strategies with data scarcity, and regulatory compliance.
In classification and regression tasks, generative models aim to learn the joint probability distribution of data. These models focus on generating data points similar to the training data. Open insurance datasets are rare because they encode proprietary risk structures of the Company, limiting researchers’ access to comprehensive data for analysis and assessing new approaches. Generative models enable reproducible experimentation and innovation today.
In the talk I explore several generative models used to produce synthetic data.
1) Conditional Gaussian Mixture Models used as a benchmark;
2) Conditional Variational Autoencoders;
3) Conditional Variational Autoencoders with a Transformer Decoder;
4) Conditional Diffusion Model;
5) Large Language Models.
Finally, I gave the overall results, followed by different approaches.
No
Statistics & Actuarial background
Actuary during the day
Data Scientist in the free time