Please note: This PhD seminar will take place in DC 1304.
Runsheng (Benson) Guo, PhD candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Khuzaima Daudjee
Training LLMs is extremely resource-intensive, requiring massive GPU compute. While many organizations can’t afford large, homogeneous GPU clusters, they often have access to a mix of older and newer GPUs across datacenters. Leveraging this heterogeneous compute is promising but comes with many challenges. Efficient utilization requires balancing computational workloads across different GPUs, handling uneven memory capacities, and optimizing communication over complex network topologies.
In this talk, I’ll present Zorse, a system designed to efficiently scale LLM training on heterogeneous GPU clusters. Zorse integrates pipeline and data parallelism in a way that is both communication- and memory-efficient, while flexibly supporting configurations with asymmetric pipeline stages and a mix of GPU types within data-parallel groups. It also includes a planner that automatically selects an efficient training configuration based on the cluster and training workload. I’ll share the key design insights behind Zorse and show how it outperforms existing systems on diverse heterogeneous training scenarios.