PyData Tel Aviv 2025

Is This Feature Actually Interesting? ML & LLMs for Automated Insight Discovery
2025-11-05 , ML+analytics

machine learning excels at prediction, often leaving data scientists manually sifting through feature importance lists to find truly interesting insights. This talk introduces "InterFeat," an automated pipeline that goes beyond predictive power to identify features that are novel, plausible, and useful, i.e., "Interesting". We'll demonstrate how combining classical ML, knowledge graphs, literature mining, and Large Language Models (LLMs) can operationalize the elusive concept of "interestingness." Using a case study on real world biomedical data (UK Biobank), I show how this framework automatically surfaces potentially groundbreaking hypotheses (validated by doctors) that traditional methods miss. Attendees will learn a practical approach to accelerate discovery in their own complex datasets.


The goal of data analysis isn't just prediction; it's often the discovery of new, actionable insights. However, identifying genuinely "interesting" phenomena – those that are novel, mechanistically plausible, and potentially useful – remains a largely manual and subjective process. Standard techniques like feature importance ranking (e.g., SHAP) or statistical significance testing often highlight known or trivial relationships.
This talk presents InterFeat, a systematic framework designed to automate the discovery of interesting hypotheses from structured data. We'll walk through the pipeline's components:
Utility Filters: Using statistical tests and model-based importance (ML models) to identify features with predictive value.
Novelty Filters: Leveraging knowledge graphs (like SemMedDB) and large-scale literature databases (like PubMed) to automatically screen out well-established associations.
LLM Annotation Layer: Employing retrieval-augmented Large Language Models (LLMs like GPT-4) to assess the novelty and plausibility of remaining candidates, integrating vast domain knowledge and generating natural language explanations for why a feature might be interesting.
We'll showcase InterFeat's application to identifying novel disease risk factors in the large-scale UK Biobank dataset, demonstrating its ability to recover factors years before they appeared in literature and achieve significantly higher rates of expert-validated interesting features compared to baseline methods.
Key Takeaways for the PyData Audience:
A formal definition and operationalization of "interestingness" beyond mere prediction.
Practical techniques for integrating ML, KGs, text mining, and LLMs in a discovery pipeline.
How to use LLMs effectively for hypothesis evaluation and explanation, grounded in data and external knowledge sources (mitigating hallucination).
Insights from applying this framework to real-world, large-scale data.
Access to the open-source codebase (github.com/ddofer/InterFeat) to adapt the pipeline for their own research or business problems.


Prior Knowledge Expected:

Previous knowledge expected

Dan Ofer received the B.Sc. degree in psychobiology, in 2013, and the dual M.Sc. degree in bioinformatics and neurobiology from The Hebrew University. He is currently a PhD Candidate with Professor's Dafna Shahaf and Michal Linial, and an AI Researcher in industry since 2015. Previously, at SparkBeyond/McKinsey he developed AI solutions in multiple industries, including insurance, finance, healthcare, and novel biomarker discovery with CRI. His research interests include Biological Foundation models, explainable AI, automated feature engineering on tabular data, Protein LLMs, and AI in healthcare.
Passionate Bookworm, geek and Photographer