PyData Global 2025

How to Effectively use text embeddings in tree based models
2025-12-10 , Machine Learning & AI

Text embeddings are a powerful tool for encoding the essence of unstructured text data into a structured, dense, multidimensional vector representation. Due to their inner structure, tree based models such as decision trees, gradient boosted decision trees and random forests struggle to effectively use text embeddings features. This is due to the fact that trees can use only one feature every time they split, so the number of used embedding dimensions is limited to the tree depth.

Other models, such as linear models for example, can use text embeddings more effectively because they are able to use all of the embedding dimensions simultaneously.

In this presentation we will present a novel approach to transform text embedding features into a format that tree-based models can effectively use. The proposed approach combines the strengths of non-tree based models with predictive power of tree based models to create a more effective feature representation for tree-based models.


The presentation is aimed at Data Science and Machine Learning practitioners who are already familiar with tree-based models and want to learn how to effectively incorporate text embeddings features to boost the performances of their models.

The methodology showcased in the presentation is available in the sklearo open source package.

The structure of the talk will be as follows:

  • 5 minutes Overview of text embeddings, how tree-based models are built, and the challenges they face with text embeddings compared to linear models.
  • 5 minutes Explanation of how can we leverage non-tree based models to transform text embeddings into a format that tree based models can effectively use.
  • 5 minutes Explanation on cross-fitting, a technique used to avoid target leakage when generating features using the target variable.
  • 5 minutes Code examples of how this technique can be used in practice using the sklearo open source library.
  • 5 minutes Performance comparison of tree based models using text embeddings as-is vs using the transformed features.

Prior knowledge about fundamental machine learning concepts such as overfitting, cross-validation, and feature engineering is recommended but not required.


Prior Knowledge Expected:

No

I am Claudio, a Senior Data Scientist at Mollie. I have been working in the fintech sector over the past 7 years, I have lots of experience in classical machine learning problems, mainly in binary classification problems. I love to contribute to data science open source packages like feature engine, scikit-learn and narwhals. I maintain a couple of packages myself (felimination and sklearo). In my free time I am a coffee scientist, I use a data driven approach to dial in the perfect cup of espresso.