Schedule
The schedule is still being finalized and is subject to major changes.
Given the pace of innovations in this area, the following list is subject to change.
Color Legend: Presenter Reviewer Scriber
Introduction
- Aug 18
- Course Introduction
- Anand
- Paper Presentation Preferences Fill out the form here
- Aug 20
- Topics, Challenges & Tips
- Anand
- π How to Read a Paper
- π How to Give a Bad Talk
- π Writing Reviews for Systems Conferences
- π Challenges and Applications of Large Language Models
- π An Open Source Stack for AI Compute
- Aug 22
- Paper Presentation Preferences Due
Basics & Project
- Aug 25
- Background
- Vima
- π Attention Is All You Need
- π The Illustrated Transformer
- π The Illustrated Stable Diffusion
- π Multimodality and Large Multimodal Models
- π Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
- Aug 27
- Project Logistics
- Anand
- π How to Write a Great Research Paper
- π Hints and Principles for Computer System Design
- Aug 29
- Project Groups Formation Due
- Sep 1
- No class Labor day
Pre-training
- Sep 3
- Parallelism Strategies
- π Efficient large-scale language model training on GPU clusters using megatron-LM Required
- π WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training Required
- π Zero Bubble (Almost) Pipeline Parallelism
- Sep 5
- Project Proposal Due
- Sep 8
- Dynamic Switching
- π Enabling Parallelism Hot Switching for Efficient Training of Large Language Models Required
- π Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections Required
- Sep 10
- Dealing with Issues
- π Understanding Stragglers in Large Model Training Using What-if Analysis
- π Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks Required
- π SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation Required
- Sep 15
- Resiliency
- π Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates Required
- π ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation Required
Post-Training
- Sep 17
- Parameter Efficient Fine Tuning
- π LoRA: Low-Rank Adaptation of Large Language Models Required
- π QLoRA: Efficient Finetuning of Quantized LLMs Required
- Sep 22
- Test-time Scaling
- π DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Required
- π Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters Required
- π Efficient Reasoning Models: A Survey
- π A Survey of Reasoning with Foundation Models
- Sep 24
- RL
- π HybridFlow: A Flexible and Efficient RLHF Framework Required
- π Optimizing RLHF Training for Large Language Models with Stage Fusion Required
- π AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
- Sep 29
- Adaptive Representation
- π Flextron: Many-in-One Flexible Large Language Model Required
- π Matryoshka Representation Learning Required
- π Confident Adaptive Language Modeling
- π Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
- Oct 1
- Mid-Semester Project Presentations
- Oct 6
- No class Recess
- Oct 8
- Mid-Semester Project Presentations
Inference
- Oct 13
- Single Instance Serving
- π NanoFlow: Towards Optimal Large Language Model Serving Throughput Required
- π DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving Required
- π Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation
- π Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
- π Orca: A Distributed Serving System for Transformer-Based Generative Models
- π Efficient Memory Management for Large Language Model Serving with PagedAttention
- Oct 15
- Multi Instance Serving
- π Llumnix: Dynamic Scheduling for Large Language Model Serving Required
- π Mooncake: Trading More Storage for Less Computation β A KVCache-centric Architecture for Serving LLM Chatbot Required
- π BlitzScale: Fast and Live Large Model Autoscaling with O(1) Host Caching
- π AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving
- Oct 20
- Diffusion
- π Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models Required
- π DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling Required
- Oct 22
- Multimodality
- π ModServe: Scalable and Resource-Efficient Large Multimodal Model Serving Required
- π DREAM: A Dynamic Scheduler for Dynamic Real-Time Multi-Model ML Workloads Required
- Oct 27
- Multimodality - Vision I
- π LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Required
- π A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs
- π An Image is Worth 1/2 Tokens After Layer 2: Plug-and-PLay Acceleration for VLLM Inference Required
- π Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
- π [CLS] Token Tells Everything Needed for Training-free Efficient MLLMs
- Oct 29
- Multimodality - Vision II
- π MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Required
- π ViLA: Efficient Video-Language Alignment for Video Question Answering
- π M-LLM Based Video Frame Selection for Efficient Video Understanding Required
Agentic Systems
- Nov 3
- Workflow Optimization I
- π Parrot: Efficient Serving of LLM-based Applications with Semantic Variable Required
- π Cognify: Supercharging Gen-AI Workflows With Hierarchical Autotuning Required
- Nov 5
- Workflow Optimization II
- π DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving Required
- π Towards End-to-End Optimization of LLM-based Applications with Ayo Required
- π Autellix: An Efficient Serving Engine for LLM Agents as General Programs
- Nov 10
- RAGs
- π METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation Required
- π TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval
- π RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving Required
- π LEANN: A Low-Storage Vector Index
- Nov 12
- Reasoning
- π Efficiently Serving LLM Reasoning Programs with Certaindex Required
- π ReAct: Synergizing Reasoning and Acting in Language Models Required
- Nov 17
- Applications I
- π Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents Required
- π Mathematical Discoveries from Program Search with Large Language Models (FunSearch)
- π AlphaEvolve: A Coding Agent for Scientific and Algorithmic Discovery Required
- Nov 19
- Applications II
- π NetLLM: Adapting Large Language Models for Networking Required
- π TextGrad: Automatic βDifferentiationβ via Text Required
- π Building AI Agents for Autonomous Clouds: Challenges and Design Principles
Hardware
- Nov 24
- Infrastructure Considerations for AI
- Vima
- π WaferLLM: Large Language Model Inference at Wafer Scale
- π Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
- π USHER: Holistic Interference Avoidance for Resource Optimized ML Inference
- π FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
- π Metaβs Second Generation AI Chip: Model-Chip Co-Design and Productionization Experiences
Conclusion
- Nov 24
- Course Wrap-Up
- Anand
- Dec 1
- Final Project Poster Presentation
- Dec 8
- Final Project Report + Code Due