PyData Vermont 2025

Cleaning Messy Data at Scale: APIs, LLMs, and Custom NLP Pipelines
2025-10-22 , UVM Alumni House Silver Pavilon

Messy and inconsistent data is the curse of any analytic or modeling workflow. This talk uses the example of working with address data and demonstrates how natural-language-based approaches can be applied to clean and normalize addresses at scale. The presentation will showcase the results of several methods, ranging from naive regular expression rules to 3rd-party APIs, open-source address parsing, scalable LLM embeddings with vector search, and custom text embeddings.

Attendees will leave knowing when to choose each method and how to balance cost, speed, and precision.


Description

Address data often arrives in inconsistent formats, riddled with typos, and missing key fields. At scale, this makes matching to a ground truth dataset both costly and error-prone. This talk explores practical strategies for solving this problem:

  1. Basic regex rules: the most naive and scalable approach to cleaning fields. This method serves as our benchmark.
  2. 3rd-party APIs and open source solutions: quick to implement but often expensive, rate-limited, and inflexible. This method serves as our gold standard.
  3. LLM embeddings + vector search: Easy to use but overkill, and requiring careful handling of ambiguity and edge cases.
  4. Custom address generator + embeddings: a tailored approach designed to train domain-specific representations.

By focusing on an objective that isn’t domain-specific and approachable (i.e., working with simple text), attendees will learn how to design robust pipelines and choose the right tradeoff between accuracy, cost, and scalability.

Audience & Prerequisites
Data analysts, data scientists, ML engineers, and data engineers with Python experience and basic NLP/vector search knowledge. However, technical NLP concepts such as embeddings, tokenization, and vector search will be covered.

Takeaways

  • Pros and cons of API-based, LLM-based, and custom approaches
  • How to integrate embeddings into structured data cleaning
  • Designing scalable, cost-efficient address-matching workflows

Outline

  • Introduction (0–5 min)
  • Why is address data inherently messy?
  • Real-world impact of poor address quality (e.g., delivery errors, duplicate records)
  • Key challenges at scale: variability, typos, missing fields
  • API-based Matching (5–10 min)
  • How commercial and open source APIs work for address cleaning and matching
  • Pros: simplicity, domain expertise, ready-made infrastructure
  • Cons: cost per request, rate limits, limited customization
  • Example workflow
  • LLM Embeddings + Vector Search (10–25 min)
  • Using tokenization and embeddings for address similarity scoring
  • Setting up vector search for matching to ground truth
  • Pros: scalable, reduced dependency on external vendors
  • Cons: resource greedy, requires optimization
  • Custom Address Generator + Embeddings (25–35 min)
  • Building dedicated embeddings for domain-specific matching
  • Combining with vector search for high-accuracy results
  • Conclusion & Q&A (35–40 min)
  • Comparing tradeoffs: cost, scalability, accuracy
  • Choosing the right approach for your use case
  • Open discussion

Keywords

Address Matching, BigQuery, Data Cleaning, Data Quality, libpostal, LLM Embeddings, NLP, Vector Search, Pytorch


Prior Knowledge Expected: No previous knowledge expected

Thibault Dody is a Senior Data Scientist at Faraday, specializing in scalable machine learning architectures for consumer behavior prediction. His previous work focused on developing methods to detect harmful online and social media content. He earned his Master’s degree in Computational Science from MIT.