Our main research

Efficient LLM Serving Systems

Collaborators: ETRI, Microsoft Research Redmond, Samsung Research, KAIST, Samsung DS

  • SLO-aware, adaptive KV cache management
  • Generalized KV cache reuse
  • Accuracy-preserving long-context pruning
  • SLO-aware failure recovery for LLMs
  • Systems for agentic AI

Continual and On-Device Learning

Collaborators: Palantir, SNU, UIUC

  • Low-latency, high-accuracy on-device learning
  • Adaptive, resource-efficient continual learning

Large-Scale Distributed Training

Collaborators: Samsung Research, KAIST, UC Merced, USC

  • Learning-based planning for heterogeneous, geo-distributed training
  • Fast distributed training on heterogeneous accelerators

Fast and Scalable Big Data Analytics

Collaborators: Amazon, Samsung Electronics, SNU

  • Efficient caching for iterative analytics
  • Data preprocessing for scalable ML pipelines

Research details

SLO-aware, adaptive KV cache management

SLO-aware, adaptive KV cache management is a system-level optimization for Large Language Model (LLM) serving that dynamically balances memory usage and performance to meet specific Service Level Objectives (SLOs), such as Time-To-First-Token (TTFT) or Time-Between-Tokens (TBT).

Generalized KV cache reuse

Generalized KV Cache Reuse is an advanced optimization for LLM serving that allows the system to reuse KV cache segments from any part of a prompt, moving beyond the limitations of traditional prefix caching.

Accuracy-preserving long-context pruning

Accuracy-preserving long-context pruning is a technique used to compress the KV cache of Large Language Models (LLMs) by removing less important tokens without degrading the model's performance on long-context tasks.

SLO-aware failure recovery for LLMs

SLO-aware failure recovery for LLMs is a resilience strategy designed to recover from hardware or software failures while strictly minimizing the impact on Service Level Objectives (SLOs) like latency and throughput.

Systems for agentic AI

Systems for Agentic AI are specialized software architectures designed to support AI agents that don't just "chat," but independently plan, use tools, and execute multi-step tasks to achieve a goal.

Low-latency, high-accuracy on-device learning

Low-latency, high-accuracy on-device learning refers to the ability of a local device (like a smartphone, IoT sensor, or medical device) to train and adapt its AI models instantly without relying on cloud servers.

Adaptive, resource-efficient continual learning

Adaptive, resource-efficient continual learning is an AI paradigm that allows models to learn from a constant stream of new data while operating within the strict physical limits of local hardware (like smartphones or IoT sensors).

Learning-based planning for heterogeneous, geo-distributed training

Learning-based planning for heterogeneous, geo-distributed training is an intelligent orchestration strategy used to train large AI models across multiple, physically separated data centers with varying hardware capabilities.

Fast distributed training on heterogeneous accelerators

Fast distributed training on heterogeneous accelerators is a technique to speed up AI model training by efficiently combining different types of hardware (e.g., NVIDIA GPUs, AMD GPUs, and specialized TPUs) into a single, cohesive system.

Efficient caching for iterative analytics

Efficient caching for iterative analytics is a strategy designed to speed up data-heavy tasks—like machine learning training or big data queries—by intelligently storing and reusing data that is accessed repeatedly.

Data preprocessing for scalable ML pipelines

Data preprocessing for scalable ML pipelines is the process of transforming raw data into a clean, usable format at a scale that can handle millions or billions of records without crashing the system.