How Giant AI Workloads and the Looming “Bandwidth Wall” are Impacting System Architectures - Led By Reason

  • Home
  • Our Blog
  • How Giant AI Workloads and the Looming “Bandwidth Wall” are Impacting System Architectures

When it comes to how fast artificial intelligence (AI) models can continue to grow, the sky is not the limit; system design is. As researchers continue to push boundaries with conversational AI, computer vision, recommender systems, and other workloads, AI models advancing toward hundreds of trillions of parameters may soon be commonplace.

Without significant architectural advancements, however, current system designs will not be able to keep up. Part of the problem is that the largest AI workloads have reached points where they are pushing the physical limitations of standard electrical interconnects. And memory capacity limitations, bottlenecks, and stranded resources in today’s systems are amplifying performance and efficiency losses at scale and spurring the need for rapid and significant technology innovations in scale-out systems.

The Runaway Growth in AI Workloads

Over the past few years, we’ve seen break-neck advancements in AI. Just consider that in 2019 Transformer, the biggest natural language processing (NLP) model, had 465 million parameters, or fewer synapses than a honeybee. By mid-2020, Gshard MoE included more than a trillion parameters, or roughly the same number of synapses as a mouse. And NVIDIA has projected that we could see models with 100 trillion or more connections—or roughly the equivalent synapses as a Macaque—in 2023. If the progress continues at its current rate, models with human-levels of synapses won’t be far off. But only if our computing infrastructure is ready.

Keeping up with rapid AI model growth will require significant increases in computational throughput. That means either adding nodes or increasing the speed of communication between each node. The problem is that even in today’s most powerful systems, the cross-fabric bandwidth speed is relatively low, at roughly hundreds of Gigabits per second (Gbps). And unless bandwidth improves, returns will diminish rapidly with further scale out.

Interconnect bottlenecks only become more problematic as researchers push machines to run more experiments and more all-to-all connectivity is needed; although computation considerations remain fairly constant, information exchange as a proportion of overall runtime increases dramatically, pushing the limits of fabric capacity and making the further scaling of systems impractical. It’s not unlike replicating human brains to solve a problem. Sure, two brains may be better than one, but they won’t be more efficient at solving a twice-more-complex problem without a proper interconnection. You run into the same issue when scaling out nodes with copper-based components that are approaching their limit not only due to bandwidth, but also from cost, power, density, weight, and configuration perspectives.

It’s not just copper constraints that pose obstacles in today’s architectures, it’s also tight coupling requirements related to the reliance on GPU-HBM (high bandwidth memory) and GPU-GPU communication paradigms for AI. In today’s systems, GPUs must go through the CPU to access DRAM, increasing latency and severely impairing the utility of DRAM capacity.

Educated guesses about the future machine sizes that will be required to support artificial general intelligence (AGI), human-level translation, and beyond put throughput challenges in stark relief. A 105x improvement, for example, would require 1000x more nodes that are 100x faster at a minimum. The alternative of 10,000x more nodes that are 10x faster is simply impractical due to the enormous space and cost implications.

Pivoting to Optical I/O and New Architectures

Moving forward, vertically integrated vendors and hyperscalers see a transition to photonics, or optical I/O, as the best way to overcome interconnect bottlenecks and maximize the potential of scale-out systems. Optical I/O not only promises higher bandwidth and lower latency, but it can underpin increased platform flexibility for new architectures. For example, in-package optical interconnect solutions and disaggregated architectures that decouple memory from accelerators and processors could maximize the potential of scale-out systems to overcome interconnect bottlenecks in AI workloads.

In the upcoming webinar, Meeting the Bandwidth Demands of Next-Gen HPC & AI System Architectures webinar on November 9 at 9 am PST, leading AI/HPC system design and solution experts will discuss the way forward.

Join us as our panel explains how performance and efficiency losses from memory capacity limitations, network bottlenecks, and stranded resources are being amplified at scale, creating a need for new system architectures. And hear about what they’re discovering as they define and develop architectures capable of supporting next-generation AI models.

This discussion will feature experts from Ayar Labs, NVIDIA, the Department of Energy/ National Nuclear Security Administration, and Liqid.