2025-11-09 –, Tutorial Track 3
How can you use LLMs in professional settings where cloud APIs are off-limits due to cost, privacy, or compliance? In this talk, we’ll explore how to run powerful, open-source models like Mistral and LLaMA locally — and make them useful in the real world.
We’ll cover the engineering patterns, trade-offs, and deployment approaches that make local LLMs production-ready. You’ll learn how to build a private internal knowledge assistant that runs completely offline using RAG (retrieval-augmented generation), local embeddings, and quantized models. A short live demo will show it in action — answering organization-specific questions without sending a single token to the cloud.
This talk dives into the fast-growing space of local LLMs and their emerging role in secure, cost-sensitive, or regulated environments.
We’ll begin by examining:
- Why local LLMs are gaining adoption (privacy, control, cost)
- When it makes sense to use models like Mistral, Phi-3, or LLaMA 3 locally
- Key trade-offs: quality vs speed, quantization formats (GGUF, GPTQ), and hardware choices
Then we shift to ML engineering and deployment architecture:
- Serving local models efficiently: Ollama, vLLM, Transformers
- Optimizing inference: batching, streaming, CPU vs GPU, latency tips
- Embedding pipelines with LlamaIndex or LangChain
- Indexing and chunking strategies for scalable RAG
- Deployment approaches: Docker containers, on-premise clusters, offline workflows
To ground these ideas, the talk includes a short live demo:
A fully offline AI system that answers team-specific questions by searching across Slack exports, documentation, and meeting notes using a local LLM and vector store — no external API, no internet access.
Target Audience
- ML engineers, MLOps and platform teams
- Backend/infra engineers deploying AI tools internally
- Organizations in regulated industries (healthcare, finance, legal, defense)
- Developers looking to reduce GenAI cost or improve privacy
Previous knowledge expected
As a Data and Applied Scientist at Microsoft with 7 years of experience spanning multiple geographies, I specialize in harnessing the power of AI to transform products and user experiences. My work ranges from developing on device AI models to implementing large language models that revolutionize how data science is practiced at scale. With a Master's degree in Computer Science and Artificial Intelligence from the University of Massachusetts Amherst, I bring both academic rigor and practical expertise to every challenge, consistently pushing the boundaries of what AI can achieve.