Waris Gill
I am a final-year PhD student in the Computer Science department at Virginia Tech. Currently, I am interning at Redis as a Machine Learning Engineer.
Sessions
Large Language Models (LLMs) have opened new frontiers in natural language processing but often come with high inference costs and slow response times in production. In this talk, we’ll show how semantic caching using vector embeddings—particularly for frequently asked questions—can mitigate these issues in a RAG architecture. We’ll also discuss how we used contrastive fine-tuning methods to boost embedding model performance to accurately identify duplicate questions. Attendees will leave with strategies for reducing infrastructure costs, improving RAG latency, and strengthening the reliability of their LLM-based applications. Basic familiarity with NLP or foundation models is helpful but not required.