PyData Berlin 2025

Training Specialized Language Models with Less Data: An End-to-End Practical Guide
2025-09-02 , B07-B08

Small Language Models (SLMs) offer an efficient and cost-effective alternative to LLMs—especially when latency, privacy, inference costs or deployment constraints matter. However, training them typically requires large labeled datasets and is time-consuming, even if it isn't your first rodeo.

This talk presents an end-to-end approach for curating high-quality synthetic data using LLMs to train domain-specific SLMs. Using a real-world use case, we’ll demonstrate how to reduce manual labeling time, cut costs, and maintain performance—making SLMs viable for production applications.

Whether you are a seasoned Machine Learning Engineer or a person just getting starting with building AI features, you will come away with the inspiration to build more performant, secure and environmentally-friendly AI systems.


Training effective language models typically involves two major bottlenecks: the need for vast amounts of labeled data and the engineering complexity of fine-tuning. This talk introduces a practical framework for addressing both, enabling teams to build small, domain-specialized language models (SLMs) that are deployable, secure, and cost-efficient—without needing massive labeled datasets.

SLMs are especially well-suited for focused tasks such as classification, function calling, or question answering, where full-scale LLMs are overkill. They are smaller, faster, and easier to deploy on local or mobile infrastructure—making them ideal for latency-sensitive, privacy-conscious, or resource-limited applications. However, fine-tuning them still traditionally requires manually labeled data in the tens of thousands.

Our approach uses synthetic data generation and validation techniques to drastically reduce the labeling burden. Leveraging large language models (LLMs) as “teacher models,” we generate and curate synthetic training data tailored to specific tasks. This data, combined with a handful of manually labeled examples and a clear task description, is then used to fine-tune SLMs (“student models”) that match or exceed the performance of larger models on the same narrow tasks.

We'll walk through a detailed example focused on a real-life use case covering:
- Task scoping: How to define your model’s purpose and output space clearly.
- Synthetic data generation: Prompting LLMs to generate meaningful and diverse examples.
- Data validation: Techniques for filtering out poor-quality, duplicate, or malformed synthetic data.
- Model fine-tuning: How the student model is trained to emulate the teacher’s domain knowledge.
- Deployment: Delivering the model as binaries for use on internal infrastructure or edge devices.

We’ll also discuss key challenges teams face in adopting this approach—such as validation bottlenecks, overfitting on synthetic data, and the need for interpretable task definitions—and how we’ve addressed them in production environments.

This talk is targeted at data scientists, ML engineers, and tech leads who are looking for pragmatic strategies to bring specialized AI features into production without relying on API-based LLMs or manual annotation at scale. No prior knowledge of model distillation is required, though basic familiarity with supervised learning and model training will be helpful.

Attendees will leave with:
- A concrete workflow for training SLMs using synthetic data
- Insights into trade-offs between SLMs and LLMs
- Techniques for validating and curating LLM-generated data
- A better understanding of when and how to deploy small models effectively in production

This is not a theoretical talk. It is a field-tested approach grounded in real use cases, designed to empower small teams to build efficient, private, and reliable NLP systems.


Expected audience expertise: Domain:

Novice

Prerequisites:

Basic ML concepts (what is a model, what does model accuracy mean etc.)

Abstract as a tweet (X) or toot (Mastodon):

🚀 Train domain-specific language models without massive labeled datasets! Learn how to use LLMs to generate + validate synthetic data and fine-tune fast, accurate SLMs for production. A practical, end-to-end guide to efficient NLP. #PyData #NLP #ML #LLM #SLM

Jacek is the CTO of distil labs, making it easy to build specialized AI agents that can be deployed on-device/on-prem. Before that, he was a machine learning team lead at AWS, working on the core components of AWS Q, Automated ML, and natural language processing. He holds a PhD in Machine Learning for Quantum Mechanics from Imperial College London.