Mining Parliamentary Gold: Building Hebrew ASR from 9,000 Hours of Knesset Debates
Yanir Marmor, Yoad Snapir
How we transformed 16 years of challenging Knesset recordings, featuring overlapping speech, audience heckling, and non-verbatim protocols, into an 8,825-hour Hebrew ASR dataset, then fine-tuned Whisper models to achieve a 10% WER reduction through smart data engineering over brute-force pre-training. Attendees learn about the challenges and potential of "public-data gold mining.".