Claudio Salvatore Arcidiacono
I am Claudio, a Senior Data Scientist at Mollie. I have been working in the fintech sector over the past 7 years, I have lots of experience in classical machine learning problems, mainly in binary classification problems. I love to contribute to data science open source packages like feature engine, scikit-learn and narwhals. I maintain a couple of packages myself (felimination and sklearo). In my free time I am a coffee scientist, I use a data driven approach to dial in the perfect cup of espresso.
Session
Text embeddings are a powerful tool for encoding the essence of unstructured text data into a structured, dense, multidimensional vector representation. Due to their inner structure, tree based models such as decision trees, gradient boosted decision trees and random forests struggle to effectively use text embeddings features. This is due to the fact that trees can use only one feature every time they split, so the number of used embedding dimensions is limited to the tree depth.
Other models, such as linear models for example, can use text embeddings more effectively because they are able to use all of the embedding dimensions simultaneously.
In this presentation we will present a novel approach to transform text embedding features into a format that tree-based models can effectively use. The proposed approach combines the strengths of non-tree based models with predictive power of tree based models to create a more effective feature representation for tree-based models.