Loading Events

« All Events

Hybrid Event

Oh, S. (CSE) – Efficient Instruction Supply for Datacenter Processors

May 28 @ 11:00 am12:00 pm
Hybrid Event
Three silhouetted figures talking, overlaid with graphics of digital data, charts, and technology interfaces.

Modern datacenter CPUs lose 25–66% of execution cycles to instruction-delivery stalls. This bottleneck persists, despite the recent trend towards accelerators and GPUs, as there is continuing demand by applications that only execute on CPUs. Two workload classes dominate today’s datacenter execution cycles: hyperscale server software (databases, build systems, and content stores), whose large instruction footprints create severe frontend pathologies; and agentic AI systems, in which large-language-model agents plan, dispatch tools, and maintain growing conversational contexts, causing CPUs to account for up to 88% of end-to-end agent latency. Reflecting this shift, major CPU vendors have publicly repositioned the CPU as the orchestration layer of the AI stack and have begun shipping processors optimized for agent-centric workloads.

This dissertation argues that instruction delivery is the dominant CPU bottleneck across both workload classes and that the recent trend towards agentic AI further exacerbates this challenge. In hyperscale server binaries, the primary pathologies are wrong-path prefetch pollution and post-recovery instruction-delivery gaps across large, irregular call graphs. In agentic AI systems, the bottleneck shifts to an orchestration substrate composed of protocol stacks, dynamic-runtime dispatch, and agent-specific extensions that is even more frontend-bound than traditional warehouse-scale workloads.

To address these bottlenecks, this dissertation presents three technical contributions, together with a companion infrastructure contribution. First, Utility-Driven Prefetching (UDP) extends fetch-directed instruction prefetching (FDIP) with a learned per-prefetch utility model that admits candidates based on their historical contribution to demand-fetch hits, including those reached along wrong-path execution. Second, Junction-based Unified Miss-point Prefetching (JUMP) addresses the post-recovery instruction-delivery gap that UDP and prior FDIP optimizations cannot reach by launching a lightweight secondary FDIP thread at a learned miss point following each branch-prediction failure. Across a suite of datacenter workloads, UDP improves IPC by 3.6% on average (up to 16.1%) over a state-of-the-art FDIP baseline, while JUMP improves IPC by 2.0% on average (up to 14.9%). Combined, the two mechanisms substantially close the gap between FDIP and a perfect L1 instruction cache at a storage cost of only a few tens of kilobytes.
Third, this dissertation introduces the Agentic Tax, the first CPU characterization study of agentic AI workloads across three runtime families. The study is packaged as a deterministic-replay benchmark infrastructure that enables repeatable, cycle-level evaluation under controlled conditions. The characterization shows that the orchestration substrate of agentic AI workloads is significantly more frontend-bound than the hyperscale datacenter workloads examined in prior work, and that it introduces new dominant function families with no analog in traditional warehouse-scale systems. These findings motivate two architectural directions proposed as future work: extending UDP and JUMP to optimize the orchestration substrate itself, and designing heterogeneous CPU cores that allocate frontend resources according to the execution phase.

Event Host: Surim Oh, Ph.D. Candidate, Computer Science & Engineering 

Advisor: Heiner Litz

Zoom: https://ucsc.zoom.us/j/94753352649?pwd=7vQxlnSJkUb0KfG3t6STo639LhRv7j.1

Passcode: 205162

Details

Other

Room Number
E2-399

Venue