Kim, C. (CSE)- Toward Adaptive Graph Processing and Fault-Tolerant Agentic Inference on Heterogeneous Distributed Systems

Edge computing and distributed AI systems increasingly operate under heterogeneous resources, dynamic workloads, and frequent failures, requiring both adaptivity and fault tolerance for efficient execution. In heterogeneous edge clusters, nodes differ significantly in CPU throughput, memory capacity, and network bandwidth, while modern distributed GPU clusters supporting agentic LLM inference must recover large amounts of runtime state under routine failures. This dissertation addresses these challenges through two systems: Zsiga, an adaptive distributed graph processing system for heterogeneous edge clusters, and Forte, a fault-tolerant KV cache recovery system for distributed agentic LLM inference.
Zsiga improves connected component computation through capacity-aware graph partitioning and runtime-adaptive boundary migration, reducing execution time by up to 90.9% while eliminating out-of-memory failures under heterogeneous resource constraints. Forte addresses KV cache recovery for long-running agentic inference workloads, where failures can erase accumulated reasoning trajectories and tool interaction histories. Forte exploits the observation that not all KV blocks are equally critical, introducing criticality-aware erasure coding, domain-diverse placement, and prioritized foreground recovery to enable efficient recovery under correlated failures. Experimental results show that Forte is the only evaluated scheme that successfully resumes execution under correlated domain failures, reducing foreground stall by 89.7% and end-to-end recovery latency by 50.6–58.9% at 2.0$\times$ memory overhead. Together, these systems demonstrate how adaptivity and fault tolerance can improve the efficiency and resilience of distributed systems in heterogeneous and failure-prone environments.
Event Host: Chaeeun Kim, Ph.D. Student, Computer Science & Engineering
Advisor: Chen Qian & Liting Hu
Zoom: https://ucsc.zoom.us/j/9863615188?pwd=kTka0aZXJ070tor1EKvrt3X6AveBRp.1
Passcode: cG5SL8