Kim, C. (CSE)- Toward Adaptive Graph Processing and Fault-Tolerant Agentic Inference on Heterogeneous Distributed Systems
Edge computing and distributed AI systems increasingly operate under heterogeneous resources, dynamic workloads, and frequent failures, requiring both adaptivity and fault tolerance for efficient execution. In heterogeneous edge clusters, nodes differ significantly in CPU throughput, memory capacity, and network bandwidth, while modern distributed GPU clusters supporting agentic LLM inference must recover large amounts of runtime […]