2025-12-09 –, Analytics, Visualization & Decision Science
Tired of exact matches failing on messy data? This talk showcases how BM25, a powerful fuzzy search algorithm, tackles the challenge of enriching massive datasets with noisy product names. We'll compare practical, large-scale implementations using Python's bm25s library (accelerated by GPUs) and DuckDB's built-in full-text search. Join us to learn how to achieve fast, accurate data integration and discover the optimal tools for your fuzzy matching needs.
The problem at hand:
Are you constantly battling messy, inconsistent product names across massive datasets? Traditional exact matching just doesn't cut it when you're trying to integrate data from various sources (like a 1-million-row internal catalog with a 3.8-million-row external one like Open Food Facts). This talk addresses that exact problem: how to efficiently and accurately find fuzzy matches, saving you countless hours of manual reconciliation and enabling robust data enrichment. It's crucial for anyone working with real-world, imperfect data at scale.
Is this talk for me?
This talk is for data engineers, data scientists, and analytics professionals who work with large-scale datasets and face challenges with data integration, record linkage, or building robust search functionalities. A basic understanding of dataframes and SQL will be helpful, but no deep prior knowledge of search algorithms is required.
This will be an informative and practical talk with a clear focus on real-world application. While we'll briefly cover the "why" behind BM25, the emphasis will be on "how" to implement and optimize it. We'll present concrete benchmarks and code examples, moving beyond theoretical concepts.
What will I learn?
By the end of this session, you will:
- Understand why BM25 is a superior choice for fuzzy matching noisy product names compared to traditional methods.
- See a practical, head-to-head comparison of implementing BM25 using Python libraries (specifically the optimized Cython bm25s) and DuckDB's native full-text search.
- Gain insights into performance implications (speed and memory usage) for each approach on large datasets, including the benefits of GPU acceleration with Dask CuDF.
- Learn production tips for persisting indexes, handling bulk queries, and managing memory effectively.
- Be equipped to choose the most suitable BM25 implementation for your specific data enrichment and fuzzy matching needs, allowing you to build faster and more accurate data pipelines.
Any pre-requisite knowledge I should have?
- A medium level background in python
- An introductory level information about DuckDB
- An introductory level information into how BM25 works would be bonus!
Yes
Aniket is an engineer at heart. He has founded Curlscape, where he helps businesses bring practical AI applications to life fast. He has led the design and deployment of large-scale systems across industries, from finance and healthcare to education and logistics. His work spans LLM-based information extraction, agentic workflows, voice assistants, and continuous evaluation frameworks.