PhD Seminar • Systems and Networking • Zorse: Optimizing LLM Training Efficiency on Heterogeneous GPU Clusters | Cheriton School of Computer Science

Please note: This PhD seminar will take place in DC 1304.

Runsheng (Benson) Guo, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Khuzaima Daudjee

Training LLMs is extremely resource-intensive, requiring massive GPU compute. While many organizations can’t afford large, homogeneous GPU clusters, they often have access to a mix of older and newer GPUs across datacenters. Leveraging this heterogeneous compute is promising but comes with many challenges. Efficient utilization requires balancing computational workloads across different GPUs, handling uneven memory capacities, and optimizing communication over complex network topologies.

In this talk, I’ll present Zorse, a system designed to efficiently scale LLM training on heterogeneous GPU clusters. Zorse integrates pipeline and data parallelism in a way that is both communication- and memory-efficient, while flexibly supporting configurations with asymmetric pipeline stages and a mix of GPU types within data-parallel groups. It also includes a planner that automatically selects an efficient training configuration based on the cluster and training workload. I’ll share the key design insights behind Zorse and show how it outperforms existing systems on diverse heterogeneous training scenarios.

Location Information

Location Address: DC - William G. Davis Computer Research Centre
200 University Avenue West
DC 1304
Waterloo, ON, CA N2L 3G1

Location coordinates: