PyData Tel Aviv 2025

Mining Parliamentary Gold: Building Hebrew ASR from 9,000 Hours of Knesset Debates
2025-11-05 , Eng

How we transformed 16 years of challenging Knesset recordings, featuring overlapping speech, audience heckling, and non-verbatim protocols, into an 8,825-hour Hebrew ASR dataset, then fine-tuned Whisper models to achieve a 10% WER reduction through smart data engineering over brute-force pre-training. Attendees learn about the challenges and potential of "public-data gold mining.".


Parliamentary recordings are a hidden goldmine for Hebrew ASR, but they're also uniquely challenging. The Knesset's decade-plus of streamed debates contain a variety of technical obstacles, including sessions lasting 50+ hours, protocols that preserve the speaker's "spirit" rather than exact words, timestamp artifacts, and constant background noise from audience interruptions.

We’ll show how a fully reproducible Python pipeline - built with stable_ts + faster-whisper + CTranslate2, with only light ffmpeg remuxing - force-aligns every sentence (despite heckling, in-between shouts, diglossia, and wobbly timestamps), normalizes the text, then packages the whole lot as a license-clean, Hugging Face-ready dataset.

With the dataset in hand, we fine-tune both Whisper-base and Whisper-Turbo, achieving 10% WER reduction compared to former models. The side-by-side error taxonomy shows why smart data engineering can beat brute-force pre-training, especially for low-resource languages.

Attendees walk away with (1) a concrete recipe for turning messy public audio into research-grade training data; (2) Hebrew-specific alignment and text-cleaning tricks that generalise to more languages; and (3) fresh evidence that high-quality local data can outperform generic pre-training, empowering communities to build world-class models on their own terms.


Prior Knowledge Expected:

No previous knowledge expected

M.Sc. CS & Math student at Weizmann Institute of Science; ivrit.ai co-founder (non-profit)

AI & Tech Consultant, ivrit.ai