Our main research
Efficient LLM Serving Systems
Collaborators: ETRI, Microsoft Research Redmond, Samsung Research, KAIST, Samsung DS
- SLO-aware, adaptive KV cache management
- Generalized KV cache reuse
- Accuracy-preserving long-context pruning
- SLO-aware failure recovery for LLMs
- Systems for agentic AI
Continual and On-Device Learning
Collaborators: Palantir, SNU, UIUC
- Low-latency, high-accuracy on-device learning
- Adaptive, resource-efficient continual learning
Large-Scale Distributed Training
Collaborators: Samsung Research, KAIST, UC Merced, USC
- Learning-based planning for heterogeneous, geo-distributed training
- Fast distributed training on heterogeneous accelerators
Fast and Scalable Big Data Analytics
Collaborators: Amazon, Samsung Electronics, SNU
- Efficient caching for iterative analytics
- Data preprocessing for scalable ML pipelines
Research details ⚠
SLO-aware, adaptive KV cache management
SLO-aware, adaptive KV cache management is a system-level optimization for Large Language Model (LLM) serving that dynamically balances memory usage and performance to meet specific Service Level Objectives (SLOs), such as Time-To-First-Token (TTFT) or Time-Between-Tokens (TBT).
Generalized KV cache reuse
Generalized KV Cache Reuse is an advanced optimization for LLM serving that allows the system to reuse KV cache segments from any part of a prompt, moving beyond the limitations of traditional prefix caching.
Accuracy-preserving long-context pruning
Accuracy-preserving long-context pruning is a technique used to compress the KV cache of Large Language Models (LLMs) by removing less important tokens without degrading the model's performance on long-context tasks.
SLO-aware failure recovery for LLMs
SLO-aware failure recovery for LLMs is a resilience strategy designed to recover from hardware or software failures while strictly minimizing the impact on Service Level Objectives (SLOs) like latency and throughput.
Systems for agentic AI
Systems for Agentic AI are specialized software architectures designed to support AI agents that don't just "chat," but independently plan, use tools, and execute multi-step tasks to achieve a goal.
Low-latency, high-accuracy on-device learning
Low-latency, high-accuracy on-device learning refers to the ability of a local device (like a smartphone, IoT sensor, or medical device) to train and adapt its AI models instantly without relying on cloud servers.
Adaptive, resource-efficient continual learning
Adaptive, resource-efficient continual learning is an AI paradigm that allows models to learn from a constant stream of new data while operating within the strict physical limits of local hardware (like smartphones or IoT sensors).
Learning-based planning for heterogeneous, geo-distributed training
Learning-based planning for heterogeneous, geo-distributed training is an intelligent orchestration strategy used to train large AI models across multiple, physically separated data centers with varying hardware capabilities.
Fast distributed training on heterogeneous accelerators
Fast distributed training on heterogeneous accelerators is a technique to speed up AI model training by efficiently combining different types of hardware (e.g., NVIDIA GPUs, AMD GPUs, and specialized TPUs) into a single, cohesive system.
Efficient caching for iterative analytics
Efficient caching for iterative analytics is a strategy designed to speed up data-heavy tasks—like machine learning training or big data queries—by intelligently storing and reusing data that is accessed repeatedly.
Data preprocessing for scalable ML pipelines
Data preprocessing for scalable ML pipelines is the process of transforming raw data into a clean, usable format at a scale that can handle millions or billions of records without crashing the system.