PhD Seminar • Systems and Networking • Zorse: Optimizing LLM Training Efficiency on Heterogeneous GPU Clusters

Friday, July 25, 2025 1:00 pm - 2:00 pm EDT (GMT -04:00)

Please note: This PhD seminar will take place in DC 1304.

Runsheng (Benson) Guo, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Khuzaima Daudjee

Training LLMs is extremely resource-intensive, requiring massive GPU compute. While many organizations can’t afford large, homogeneous GPU clusters, they often have access to a mix of older and newer GPUs across datacenters. Leveraging this heterogeneous compute is promising but comes with many challenges. Efficient utilization requires balancing computational workloads across different GPUs, handling uneven memory capacities, and optimizing communication over complex network topologies.

In this talk, I’ll present Zorse, a system designed to efficiently scale LLM training on heterogeneous GPU clusters. Zorse integrates pipeline and data parallelism in a way that is both communication- and memory-efficient, while flexibly supporting configurations with asymmetric pipeline stages and a mix of GPU types within data-parallel groups. It also includes a planner that automatically selects an efficient training configuration based on the cluster and training workload. I’ll share the key design insights behind Zorse and show how it outperforms existing systems on diverse heterogeneous training scenarios.