Carnegie Mellon University

Hamerschlag Hall with a sunburst

May 14, 2026

Test of Time Award

By Krista Burns

Krista Burns

Parallel Data Lab researchers Henggang Cui, engineering manager at Latitude AI and a Carnegie Mellon University electrical and computer engineering alumnus; Hao Zhang, a postdoctoral researcher at Carnegie Mellon University; Gregory Ganger, the Jatras Professor of Electrical and Computer Engineering; Phillip Gibbons, professor of electrical and computer engineering and computer science; and Eric Xing, professor of computer science, were honored with the prestigious Test of Time Award during the 2026 EuroSys Conference.

Held in Edinburgh on April 27-30, the EuroSys conference is a premier forum for discussing various issues of systems software research and development, including implications related to hardware and applications. The conference brought together professionals from academia and industry. It has a strong focus on systems research and development; operating systems, database systems, real-time systems, networked systems, storage systems, middleware, distributed, parallel, and embedded computing systems.

The team received the Test of Time Award for their paper, GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server.

The paper addresses the challenges of scaling deep learning training across distributed GPU systems. While GPUs provide significant computational advantages over CPUs, existing multi-GPU and multi-machine setups suffer from inefficiencies such as high data movement overhead and limited memory capacity. To overcome these limitations, the authors proposed GeePS, a distributed parameter server architecture designed to efficiently coordinate data and model updates across GPUs in multiple machines, enabling scalable training of large neural networks with billions of parameters.

Experimental results demonstrated that GeePS significantly improved training throughput and scalability. The system achieved up to a 13x increase in processed training images per second when scaling from a single optimized GPU implementation to 16 machines. Additionally, GeePS enabled a small cluster of four GPU-equipped machines to surpass the performance of a state-of-the-art CPU-only system running on 108 machines. These findings highlight the effectiveness of GeePS in reducing communication bottlenecks and improving resource utilization in distributed deep learning.

“We set out to remove the bottlenecks that made large-scale GPU training inefficient at the time, and made distributed deep learning truly practical,” says the team. “It’s incredibly rewarding to see that vision recognized with the Test of Time Award and reflected in today’s AI systems.”