Please note: This master’s thesis presentation will take place online.
Petar Basta, Master’s candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Khuzaima Daudjee
While it is known that quantization of large language models (LLMs) reduces memory usage via lower-bit weights, studies quantifying the resulting impact on energy usage are scarce. We present the first holistic comparison of weight-only quantization strategies for LLM inference across varying input/output lengths and GPUs, highlighting differences in energy efficiency in addition to accuracy degradation and runtime. We study quantized inference on Llama 2 7B, Phi 3.5 Mini Instruct, and Qwen 2.5 7B across GLUE MNLI, MMLU, HumanEval, and GSM8K, evaluating 10 post-training quantization (PTQ) strategies on NVIDIA H100, A6000, A100, and L4 GPUs.
We highlight four key findings. First, quantization techniques tend to exhibit peak energy efficiency relative to full-precision baselines when inputs are sufficiently long and outputs are short. Second, they become marginally more energy-efficient relative to full-precision models as batch sizes increase, though gains are modest. Third, fused-kernel implementations such as EETQ int8 and Bitsandbytes int8 offer the highest energy savings, up to 4× compared to FP32 on short text generation, with negligible accuracy loss. Lastly, energy usage closely tracks runtime on our evaluated benchmarks, indicating that in practice, latency optimization can serve as a reliable proxy for energy efficiency. We report on how quantization impacts energy consumption and suggest directions for strategically selecting energy-conscious quantization strategies.