Cleaning Messy Data at Scale: APIs, LLMs, and Custom NLP Pipelines PyData Vermont 2025

Cleaning Messy Data at Scale: APIs, LLMs, and Custom NLP Pipelines
.ical
2025-10-22 09:30–10:15, UVM Alumni House Silver Pavilon

Messy and inconsistent data is the curse of any analytic or modeling workflow. This talk uses the example of working with address data and demonstrates how natural-language-based approaches can be applied to clean and normalize addresses at scale. The presentation will showcase the results of several methods, ranging from naive regular expression rules to 3rd-party APIs, open-source address parsing, scalable LLM embeddings with vector search, and custom text embeddings.

Attendees will leave knowing when to choose each method and how to balance cost, speed, and precision.

Description

Address data often arrives in inconsistent formats, riddled with typos, and missing key fields. At scale, this makes matching to a ground truth dataset both costly and error-prone. This talk explores practical strategies for solving this problem:

Basic regex rules: the most naive and scalable approach to cleaning fields. This method serves as our benchmark.
3rd-party APIs and open source solutions: quick to implement but often expensive, rate-limited, and inflexible. This method serves as our gold standard.
LLM embeddings + vector search: Easy to use but overkill, and requiring careful handling of ambiguity and edge cases.
Custom address generator + embeddings: a tailored approach designed to train domain-specific representations.

By focusing on an objective that isn’t domain-specific and approachable (i.e., working with simple text), attendees will learn how to design robust pipelines and choose the right tradeoff between accuracy, cost, and scalability.

Audience & Prerequisites
Data analysts, data scientists, ML engineers, and data engineers with Python experience and basic NLP/vector search knowledge. However, technical NLP concepts such as embeddings, tokenization, and vector search will be covered.

Takeaways

Pros and cons of API-based, LLM-based, and custom approaches
How to integrate embeddings into structured data cleaning
Designing scalable, cost-efficient address-matching workflows

Outline

Introduction (0–5 min)
Why is address data inherently messy?
Real-world impact of poor address quality (e.g., delivery errors, duplicate records)
Key challenges at scale: variability, typos, missing fields
API-based Matching (5–10 min)
How commercial and open source APIs work for address cleaning and matching
Pros: simplicity, domain expertise, ready-made infrastructure
Cons: cost per request, rate limits, limited customization
Example workflow
LLM Embeddings + Vector Search (10–25 min)
Using tokenization and embeddings for address similarity scoring
Setting up vector search for matching to ground truth
Pros: scalable, reduced dependency on external vendors
Cons: resource greedy, requires optimization
Custom Address Generator + Embeddings (25–35 min)
Building dedicated embeddings for domain-specific matching
Combining with vector search for high-accuracy results
Conclusion & Q&A (35–40 min)
Comparing tradeoffs: cost, scalability, accuracy
Choosing the right approach for your use case
Open discussion

Keywords

Address Matching, BigQuery, Data Cleaning, Data Quality, libpostal, LLM Embeddings, NLP, Vector Search, Pytorch

Prior Knowledge Expected: No previous knowledge expected

Thibault Dody

Thibault Dody is a Senior Data Scientist at Faraday, specializing in scalable machine learning architectures for consumer behavior prediction. His previous work focused on developing methods to detect harmful online and social media content. He earned his Master’s degree in Computational Science from MIT.

Cleaning Messy Data at Scale: APIs, LLMs, and Custom NLP Pipelines .ical 2025-10-22 09:30–10:15, UVM Alumni House Silver Pavilon

Cleaning Messy Data at Scale: APIs, LLMs, and Custom NLP Pipelines
.ical
2025-10-22 09:30–10:15, UVM Alumni House Silver Pavilon