Yilun Luo, HuaQing Zheng, Haoqian Meng, Wenyuan Liu, Peng Zhang
The paper presents a low-bit quantization framework for efficient deployment of openPangu models on Ascend NPUs, achieving significant memory and speed improvements while maintaining accuracy.
This research focuses on making large language models, specifically Huawei's openPangu models, more efficient for practical use. The models are designed to enhance reasoning capabilities but come with high memory and processing demands. By converting the model computations into a more compact form, known as low-bit quantization, the researchers were able to reduce these demands. The approach allowed the models to run faster and use less memory without significantly compromising their accuracy.