04-18, 12:05–12:35 (US/Eastern), Auditorium 5
Large Language Models (LLMs) have opened new frontiers in natural language processing but often come with high inference costs and slow response times in production. In this talk, we’ll show how semantic caching using vector embeddings—particularly for frequently asked questions—can mitigate these issues in a RAG architecture. We’ll also discuss how we used contrastive fine-tuning methods to boost embedding model performance to accurately identify duplicate questions. Attendees will leave with strategies for reducing infrastructure costs, improving RAG latency, and strengthening the reliability of their LLM-based applications. Basic familiarity with NLP or foundation models is helpful but not required.
Who Should Attend?
This talk is designed for AI engineers and researchers interested in building with LLMs in production. Attendees with a basic understanding of NLP and RAG systems will benefit most, but the concepts and demonstrations will be approachable for a general technical audience.
Why It’s Interesting?
As organizations incorporate LLMs into real-world products, they grapple with inference compute demands and sluggish response times. Semantic caching offers a pragmatic solution: once you identify frequently asked questions (or reoccurring queries), you can serve results from a cache rather than running a fresh, computationally expensive inference every time. This lowers cost and latency. Moreover, using various fine-tuning methods on the retrieval models improves the accuracy of “question deduplication,” ensuring cache hits are matched reliably.
Key Takeaways
- Semantic Caching Fundamentals: How to design and implement a caching layer tailored for question-answering or conversational systems (RAG).
- Embedding Fine-Tuning: An overview of contrastive methods to improve embedding models’ ability to detect near-duplicate or semantically similar queries.
- Practical Insights: Best practices for integrating semantic caching in production, along with tips for monitoring performance and keeping infrastructure costs down.
- Real world examples.
Background Knowledge
- Minimal NLP/ML Knowledge: Familiarity with embeddings, vector similarity, and basic model inference is helpful.
- Basic Software Engineering: Familiarity with productionizing ML workflows will help contextualize the caching strategy.
Talk Outline (30 minutes)
- Introduction to LLM challenges in production (high inference cost, slow responses) with real world examples.
- Overview of semantic caching: concepts, benefits, and common pitfalls.
- Improving cache hit rates with contrastive fine-tuning: what it is and how it enhances embedding models.
- Demo of improving duplicate question detection.
- Recap and system architecture review.
- Share resources for further learning (GitHub links, additional reading, etc.)
By the end of this session, attendees will have a clear roadmap for employing semantic caching and contrastive fine-tuning to reduce costs and improve performance in LLM-powered applications. We look forward to sharing our experiences and answering your questions!
Previous knowledge expected
Tyler leads the Applied AI Engineering group at Redis, working hands-on with customers and partners on real-time GenAI and ML workloads. Previously, Tyler led ML Engineering at a early-stage eCommerce startups building novel search & recommendation systems graduated from the University of Virginia with a BS in Physics and MS in Data Science. His passions involve MLOPs system design and working with LLMs to solve actual problems. He also enjoys distilling myths and building bridges in the tech community through knowledge and resource sharing.
Tyler and his wife Cynthia reside in Richmond, VA where they enjoy hosting friends, family, and soaking in the city's history, landmarks, nature, food and creative scene.
Dr. Srijith Rajamohan currently leads AI Research at Redis for building efficient and scalable retrieval systems with GenAI. Prior to this role, he has led the data science effort for Sage Copilot and also led the team that created and deployed domain-specific LLMs to address the deficiencies of off-the-shelf models for accounting. He also had stints at Databricks where he led the data science developer advocacy efforts and at Nerdwallet as a data scientist. Before making the switch to the tech sector, he spent about six years in academia as a computational scientist at Virginia Tech.
I am a final-year PhD student in the Computer Science department at Virginia Tech. Currently, I am interning at Redis as a Machine Learning Engineer.