Schedule

Given the pace of innovations in this area, the following list is subject to change.

Jump to the current module

Introduction

Aug 18

Course Introduction: Anand

Paper Presentation Preferences Fill out the form here

Aug 20

Topics, Background & Challenges: Anand

📖 Challenges and Applications of Large Language Models

📖 An Open Source Stack for AI Compute

Aug 22

Paper Presentation Preferences Due

Basics & Project

Aug 25

Background: Vima

📖 Attention Is All You Need

📖 The Illustrated Transformer

📖 The Illustrated Stable Diffusion

📖 Multimodality and Large Multimodal Models

📖 Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Aug 27

Project Logistics: Anand

📖 How to Read a Paper

📖 How to Give a Bad Talk

📖 Writing Reviews for Systems Conferences

📖 How to Write a Great Research Paper

📖 Hints and Principles for Computer System Design

Aug 29

Project Groups Formation Due

Sep 1

No class Labor day

Pre-training

Sep 3

Parallelism Strategies

📖 Efficient large-scale language model training on GPU clusters using megatron-LM Required Shangqing

📖 WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training Required Xinyu

📖 Zero Bubble (Almost) Pipeline Parallelism

Sep 5

Project Proposal Due

Sep 8

Dynamic Switching

📖 Enabling Parallelism Hot Switching for Efficient Training of Large Language Models Required Ziyi

📖 Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections Required Akash

Sep 10

Dealing with Issues

📖 Understanding Stragglers in Large Model Training Using What-if Analysis

📖 Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks Required Joel

📖 SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation Required Keshav

Sep 15

Resiliency

📖 Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates Required Suyeon

📖 ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation Required Vedaang

Post-Training

Sep 17

Parameter Efficient Fine Tuning

📖 LoRA: Low-Rank Adaptation of Large Language Models Required Oytun

📖 QLoRA: Efficient Finetuning of Quantized LLMs Required Ikhyun

Sep 22

Test-time Scaling

📖 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Required Ashutosh

📖 Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters Required Kewen

📖 Efficient Reasoning Models: A Survey

📖 A Survey of Reasoning with Foundation Models

Sep 24

RL

📖 HybridFlow: A Flexible and Efficient RLHF Framework Required Chengyin

📖 Optimizing RLHF Training for Large Language Models with Stage Fusion Required Misun

📖 AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Sep 29

Adaptive Representation

📖 Flextron: Many-in-One Flexible Large Language Model Required Akash

📖 Matryoshka Representation Learning Required Jae Hyung

📖 Confident Adaptive Language Modeling

📖 Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving

Oct 1

Mid-Semester Project Presentations

Oct 6

No class Recess

Oct 8

Mid-Semester Project Presentations

Inference

Oct 13

Single Instance Serving

📖 NanoFlow: Towards Optimal Large Language Model Serving Throughput Required Keshav

📖 DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving Required Xinyu

📖 Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation

📖 Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

📖 Orca: A Distributed Serving System for Transformer-Based Generative Models

📖 Efficient Memory Management for Large Language Model Serving with PagedAttention

Oct 15

Multi Instance Serving

📖 Llumnix: Dynamic Scheduling for Large Language Model Serving Required Oytun

📖 Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot Required Ziyi

📖 BlitzScale: Fast and Live Large Model Autoscaling with O(1) Host Caching

📖 AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

Oct 20

Diffusion

📖 Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models Required Mursalin

📖 DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling Required Hritvik

Oct 22

Multimodality

📖 ModServe: Scalable and Resource-Efficient Large Multimodal Model Serving Required Rohit

📖 DREAM: A Dynamic Scheduler for Dynamic Real-Time Multi-Model ML Workloads Required Ikhyun

Oct 27

Multimodality - Vision I

📖 LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Required Kalit

📖 A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs

📖 An Image is Worth 1/2 Tokens After Layer 2: Plug-and-PLay Acceleration for VLLM Inference Required Hritvik

📖 Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

📖 [CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

Oct 29

Multimodality - Vision II

📖 MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Required Suyeon

📖 ViLA: Efficient Video-Language Alignment for Video Question Answering

📖 M-LLM Based Video Frame Selection for Efficient Video Understanding Required Chengyin

Agentic Systems

Nov 3

Workflow Optimization I

📖 Parrot: Efficient Serving of LLM-based Applications with Semantic Variable Required Shangqing

📖 Cognify: Supercharging Gen-AI Workflows With Hierarchical Autotuning Required Uma

Nov 5

Workflow Optimization II

📖 DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving Required Mursalin

📖 Towards End-to-End Optimization of LLM-based Applications with Ayo Required Kewen

📖 Autellix: An Efficient Serving Engine for LLM Agents as General Programs

Nov 10

RAGs

📖 METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation Required Vedaang

📖 TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

📖 RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving Required Kalit

📖 LEANN: A Low-Storage Vector Index

Nov 12

Reasoning

📖 Efficiently Serving LLM Reasoning Programs with Certaindex Required Rohit

📖 ReAct: Synergizing Reasoning and Acting in Language Models Required Jae Hyung

Nov 17

Applications I

📖 Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents Required Ashutosh

📖 Mathematical Discoveries from Program Search with Large Language Models (FunSearch)

📖 AlphaEvolve: A Coding Agent for Scientific and Algorithmic Discovery Required Misun

Nov 19

Applications II

📖 NetLLM: Adapting Large Language Models for Networking Required Uma

📖 TextGrad: Automatic “Differentiation” via Text Required Joel

📖 Building AI Agents for Autonomous Clouds: Challenges and Design Principles

Hardware

Nov 24

Infrastructure Considerations for AI: Vima

📖 WaferLLM: Large Language Model Inference at Wafer Scale

📖 Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

📖 USHER: Holistic Interference Avoidance for Resource Optimized ML Inference

📖 FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

📖 Meta’s Second Generation AI Chip: Model-Chip Co-Design and Productionization Experiences

Conclusion

Nov 24

Course Wrap-Up: Anand

Dec 1

Final Project Poster Presentation

Dec 8

Final Project Report + Code Due