Garen: Reliable Cluster Management with Atomic State Reconciliation

Abstract: Modern cluster managers orchestrate large-scale services and resources through a set of controllers, each managing a specific part of the cluster by iteratively reconciling the cluster states into the desired states. However, controllers are prone to various state inconsistencies stemming from asynchrony, concurrency, and failures, posing significant challenges in reliable cluster operation. Our analysis reveals that many consistency bugs remain unresolved, and even proposed fixes are often rejected due to side effects such as backward incompatibility or performance loss.
We present Garen, a system implementing a concept called atomic state reconciliation (ASR), which ensures atomicity and consistency of reconciliation to protect the cluster against state inconsistencies. Designing Garen requires overcoming several challenges. First, to ensure high scalability, Garen must detect conflicts within the minimal set of cluster states directly involved in the reconciliation process. These states, however, are inherently dynamic due to the conditional execution of reconciliation logic. Garen addresses this by decomposing the reconciliation logic into smaller, independent blocks that serve as units of conditional execution. Each block is evaluated to determine whether it contributes to state transitions, confirming its eligibility for inclusion in the ASR. Moreover, Garen ensures that all state transitions within ASR comply with cluster-wide constraints and are committed to the data store without incurring false conflicts. Through real-world case studies, Garen resolves 17 previously unresolved consistency bugs without introducing side effects. It also maintains scalability by sustaining the API request rate across various cluster sizes, with a latency overhead of less than 3%.

  • Authors: Mingi Kim, Ahnjae Shin, Jaewoo Maeng, Myeongjae Jeon, Byung-Gon Chun
  • Submission: EuroSys, Apr. 2026
  • View