Huawei Cloud unveiled a breakthrough computing technology at its SME AI Solutions Conference: the FlexNPU Flexible Intelligent Computing Operating System, designed to rein in soaring token consumption and deliver optimal cost-performance for enterprise AI agents in the Agentic era.
At the AI infrastructure layer, Huawei Cloud offers Ascend series products and its self-developed AI Infra OS. FlexNPU’s flexible computing technology meets small-model training needs for SMEs while boosting resource utilization through elastic scheduling.
At the model service layer, Huawei Cloud supports mainstream open-source models, enabling businesses to select models tailored to their needs or fine-tune proprietary models at low cost. At the agent platform layer, Huawei Cloud provides efficient development environments to help SMEs build enterprise-grade AI agents. At the application layer, Huawei Cloud collaborates with partners across analytics, marketing, collaboration, DevOps, and content creation.
Huawei Cloud Fellow and Chief Architect Gu JiongJiong highlighted a critical pain point: average inference pool utilization is below 30%, leaving much of the costly AI hardware idle. He stressed that in the Agent era, autonomous planning, multi-round iterations, and long contexts drive exponential token growth, making cost reduction the most urgent challenge.

FlexNPU addresses this by enabling flexible, liquid-like allocation of NPU/GPU resources, dynamically adapting to business needs. Key attributes include:
- Extreme Sharing & Elasticity: Innovative PD dynamic scheduling and mixed offline inference reduce idle compute, boosting token cost-efficiency by at least 40%.
- Small-Model Optimization: Fine-grained resource slicing down to 1% NPU card and 128MB memory lowers average compute costs by 2–3x.
- High Availability: Token-level KV cache snapshots enable rapid recovery from hardware faults, minimizing recomputation and improving inference continuity.
Gu likened FlexNPU to the “Ruyi Jingu Bang” (magical staff), emphasizing its ability to expand or contract compute power at will, ultimately maximizing AI infrastructure efficiency and optimizing token economics for enterprises.





Must log in before commenting!
Sign In Sign Up