How to optimize AI models for low-power inferencing at the edge

By Tim Weekes

Solutions

Published

1 August 2025

Written by Tim Weekes Senior Consultant

Tim trained as a journalist and wrote for professional B2B publications before joining TKO in 1998. In his time at TKO, Tim has worked in various client service roles, helping electronics companies to achieve success in PR, advertising, lead generation and digital marketing campaigns. He now supports clients with strategic messaging, the writing of technical and marketing promotional materials, and the creation of videos and podcasts. Tim has a BA (Hons) degree and a Diploma in Direct Marketing.

Why power efficiency matters in edge AI systems

For ordinary consumers, the most eye-catching example of artificial intelligence (AI) is apps such as ChatGPT or Gemini – natural-language software tools which can answer any question the user chooses to pose. These AI query engines run in the cloud, and are hosted in huge, power-hungry data centers.

The AI applications embedded in electronics devices attract much less public attention, but perform huge amounts of work in the real world.

In industrial equipment, AI enables autonomous last-mile delivery robots to recognize objects in their environment and to navigate safely in public areas.

In consumer devices, AI enables activity tracking wristbands to assess the wearer’s health status by observing patterns in vital signs such as heart rate, blood pressure and blood oxygen saturation. AI is also the technology in smart glasses which detects objects of interest in the field of view and provides context-aware information about them.

In many cases, the device manufacturer will prefer to implement these AI functions on-device, at the edge of the network, rather than in the cloud. Edge implementation offers multiple benefits over AI processing in the cloud:

Lower latency
Reduced data transport and storage costs
More privacy and security protection
Lower power consumption – reducing or eliminating wireless data transmissions saves energy

A sporty image showing a runner’s focus on the fitness tracker on the wrist, underlining the importance of health monitoring while exercising

Cloud-based models: too big for edge devices?

The accuracy and insight achieved by cloud-based AI models have been rapidly improving – and most of the improvement is achieved by increasing their size: in general, training a model on more data, and increasing the number of parameters on which it operates, improve its performance. Bigger is better.

In the cloud, there is almost no constraint on the size of an AI model or the amount of power required to run inferencing operations.

The opposite applies at the edge. Embedded devices have severely constrained hardware resources:

Less processing power
Less memory capacity
Less power, particularly if supplied by a battery. In any case, limited thermal dissipation capability often constrains the system’s ability to consume power.

This means that, at the edge, devices cannot simply run the same models as are deployed in the cloud – they need streamlined models which fit the hardware constraints of devices at the edge, and which support low-power inferencing.

The purpose of this article is to outline practicable edge AI model optimization strategies, such as quantization and neural network pruning, to enable edge AI devices to perform low-power inference. Using these techniques, edge devices can take advantage of AI while respecting the limits of the compute and energy resources in typical embedded computing applications.

Model selection for efficient inference

The large language models (LLMs) which provide the basis for generative AI apps such as ChatGPT, Gemini and DeepSeek are intended to be universal knowledge engines: the principle of their operation is to draw on all of humanity’s graphical and textual documented knowledge to be able to generate useful responses to any prompt. This means that these models are incredibly large, occupying multiple gigabytes of memory space to handle inferencing operations.

For edge AI and embedded devices, the requirements are completely different: edge AI devices have a limited set of specific functions, and do not need to deploy universal knowledge – and in any case, their memory capacity is far too small to accommodate an LLM.

This means that design engineers need to choose carefully the model to be deployed to an edge AI device, to ensure that it has:

A small memory footprint
Functionality which is suited to the application

AI software developers have responded to this need by introducing various specialist models suitable for AI at the edge. These include:

MobileNet – an efficient Convolutional Neural Network (CNN) for mobile and embedded devices. MobileNet is used to perform object detection and image classification in facial recognition, augmented reality and other vision-based applications.

TinyML – a model which is intended to enable machine learning functions on the most constrained edge and endpoint devices, especially those that are based on a microcontroller (MCU). Typical use cases for TinyML-based systems include computer vision, visual wake words, keyword spotting, predictive maintenance, gesture recognition, and predictive maintenance in industrial machines.

SqueezeNet – this deep neural network performs accurate image classification with far fewer parameters than other specialized classification models. With the implementation of deep compression, the size of the SqueezeNet parameter file can be cut from 5MB to just 500kB while maintaining good accuracy.

As described above, in AI, bigger is normally better: in general, the larger the model, the more accurate its inference outputs will be. This means that there is a trade-off between model size and accuracy. Models designed for use on edge AI devices, however, apply streamlined methods for performing neural networking functions, such as extracting features and filtering, which maintain high accuracy – though not as high as the largest cloud-based models can achieve.

On the other hand, the smaller size of edge AI models requires fewer compute operations, and so reduces latency, often producing inference outputs faster than complex cloud-based models do, even though a cloud system can call on the rich resources of a data center.

*Autonomous delivery robots such as this Dax unit use edge AI algorithms to recognize objects and navigate around urban spaces. (Image credit: Lizzythetech under Creative Commons licence.)*

Decide whether AI is necessary to the application

In the effort to downsize AI models to fit the constraints of an edge device, it is easy to overlook one answer to AI’s requirement for large memory, compute and power resources: avoid using AI.

While AI offers exciting new capabilities that were not previously available to embedded devices, there are some problems which do not need an AI solution. The application of traditional sensor interfacing and control logic – the functions for which an edge device’s microcontroller is most obviously optimized – can provide predictable, real-time, deterministic outputs. If rules-based logic can meet an application’s requirement, it will result in lower power consumption, faster operation and a smaller memory footprint than any AI-based solution.

Quantization of AI models: reducing both precision and compute requirement

Quantization is a technique for reducing the size and computational requirements of deep learning models. This is achieved by reducing the precision of the model’s parameters – its weights and activations.

Quantization involves the conversion of high-precision data, such as 32-bit floating-point numbers, into less precise values, such as 8-bit integers. This has the effect of making the model more efficient, allowing it to run faster, use less memory, and consume less energy.

The types of quantization

Two types of quantization can be implemented to reduce the size and computational load of a machine learning model:

Post-training quantization isperformed after the model has been fully trained using the standard 32-bit floating-point precision. This is a relatively simple and fast approach, but might require careful calibration to minimize the loss of accuracy consequent on reducing the precision of the model’s data.

Quantization-aware training – here, quantization is incorporated into the training process, allowing the model to learn how to adapt to lower precision. This often results in higher accuracy than post-training quantization achieves.

Developers who want to perform efficient edge inference can use one of these two quantization techniques to size their model for the compute and memory capacity available to them. To do so, they will need to use a machine learning framework which supports quantization: familiar examples include TensorFlow, PyTorch, ONNX Runtime, and Intel® Neural Compressor.

Pruning and model compression

Pruning of AI models reduces their size and the computational requirement imposed on the host system by removing unnecessary parameters or connections. Performed effectively, pruning can maintain the performance of a model while creating a more efficient AI system which is easier to deploy.

Pruning can be applied to individual weights, removing specific parameters below a specified threshold value. This is referred to as unstructured pruning.

In structured pruning, entire neurons, channels, or layers are removed.

In transformer network-based models, which underpin the operation of generative AI systems, attention heads are pruned.

Unstructured pruning offers more flexibility but requires specialized hardware and software. Structured pruning is easier to implement, and provides greater certainty that inferencing will be faster on standard hardware. For 32-bit microcontrollers, structured pruning is generally the preferred method for achieving performance gains.

Pruning is supported by familiar AI frameworks, including TensorFlow/TensorFlow Lite, PyTorch, and ONNX Runtime.

How pruning is performed

Magnitude-based pruning is the most common approach – parameters with the smallest absolute values are removed, based on the assumption that they contribute least to model output.

Gradual pruning removes parameters iteratively during training, allowing the model to adapt. This normally works better than removing everything at once.

Fine tuning follows pruning. The purpose is to help the model recover lost performance by adjusting the remaining parameters.

The art of model pruning is to achieve a good enough compression ratio without compromising performance. Developers can expect to reduce the number of parameters by between 70% and 90% with minimal loss of inferencing accuracy.

Model distillation for parameter reduction

An alternative to pruning is model distillation: this enables edge AI developers to create compact, efficient models by transferring knowledge from large ‘teacher’ models to smaller ‘student’ models.

In model distillation, the developer trains a student model to mimic the teacher’s behavior, rather than learning directly from the original training data. The student model learns from the teacher’s soft probability distributions over all classes, not just the hard labels. These soft targets contain richer information about class relationships and uncertainty, providing more nuanced learning signals.

The benefits are substantial: student models can achieve up to 95% of the teacher’s performance, with a model which is between five and ten times smaller and faster.

Runtime optimizations on deployment to embedded hardware

So far, this discussion of ways to achieve low-power edge AI inference has focused on model optimization. But certain system design choices strongly affect the amount of power that an edge AI device consumes.

In general, the faster the device can perform an inference, the more quickly the system can revert to a low-power, quiescent state. One way to accelerate inferencing is to use dedicated AI hardware. Examples include:

Neural processing unit (NPU), such as the Arm® Ethos™-U55 or Ethos-U85, which can be embedded in edge AI microcontrollers
Digital signal processor (DSP), a hardware device which is well suited to the mathematical operations of neural networks

The speed and efficiency of an NPU or DSP can be enhanced by the correct design of the memory sub-system: tightly coupled memory (TCM) located close to the processor helps to reduce data load times, accelerating inference operations.

Finally, developers can categorize inference functions as either time-critical or low-priority: low-priority inferences can be performed when the device is in a low-power mode, consuming less power than in active mode.

Power management strategies

Alongside methods for accelerating inference, embedded device developers can also implement good practice for power management in edge devices. This could involve implementing techniques such as:

Dynamic clock frequency and voltage scaling
Using energy profiling tools to pinpoint operational states in which the greatest power savings can be made. The EnergyTrace™ technology, a power analysis tool from Texas Instruments, is a good example of this.

Benchmarking and validation of low-power edge AI inference operations

The purpose of implementing the techniques for edge AI model optimization and efficient edge inference is to achieve a noticeable reduction in overall power consumption. In order to validate that the techniques in use are having the desired effect, the developer needs to measure system performance.

The most important parameters to measure are:

TOPS/W – the number of tera-operations per second per Watt. This measures how much power consumption scales up as the intensity of the AI compute operation rises. In an optimized system, there will be headroom in the energy budget for the processor to perform at its maximum inference speed (TOPS).
Latency – the delay between loading data and producing an inference based on the data. The shorter the latency, the faster the system can revert to a low-power, energy-saving state.
Energy/inference – this is related to TOPS/W. The results from experimentation with different model optimizations will reveal the best trade-off between accuracy, speed and power consumption.

In measuring energy usage, it is important to take account of the real operating conditions in which the device under test will work – these conditions could be different from laboratory conditions. In the real world, the device might for instance be subject to interruptions from mission- or safety-critical functions which demand immediate compute time, or might operate in extreme temperatures which place extra stress on a battery power supply.

Testing of low-power inference in AI-enabled devices such as drones should take account of real-world conditions, including exposure to extreme temperatures, shock and vibration. (Image credit: Project Kei under Creative Commons licence.)

As well as validating performance specifications in the laboratory, embedded device manufacturers will want to perform long-term device testing to uncover potential problems that were outside the scope of the prototype design review. To assist in this, it is good practice to perform logging and profiling of devices on long-term test, or in the field, to pinpoint bottlenecks and other problems which can slow down edge AI inference, leading to an unexpected rise in power consumption.

Emerging trends in low-power edge AI

The development of low-power edge AI promises to bring substantial extra value to products across almost every market segment, from industrial, medical and consumer devices to the automotive sector, military and defense equipment, and security and surveillance.

The opportunities in embedded AI have led to a flowering of products, resources and organizations supporting the trend. Leading providers of AI software such as Google, Microsoft and Meta are known for cloud-based AI operations such as LLMs and AI agents such as Gemini and Copilot, but have also devoted substantial resources to enabling efficient edge inference.

This has been supported by the emergence of supportive industry organizations such as the Edge AI Foundation, and by service providers such as Edge Impulse. Microcontroller manufacturers from Infineon to STMicroelectronics, and Renesas to Alif Semiconductor have also introduced specialized products which include an NPU for acceleration of low-power edge inference.

A promising development is the introduction of a new type of AI accelerator: the neuromorphic processor, a hardware device which is modeled physically on the structure of the human brain. Start-up companies such as Blumind and Applied Brain Research have introduced neuromorphic processors.

Intel Labs is also carrying out neuromorphic research, producing experimental chips such as Loihi 2. Intel’s neuromorphic initiative is producing performance breakthroughs by co-designing optimized hardware with next-generation AI software.

The number of embedded AI implementations is likely to be increased by the availability of Automated Machine Learning (AutoML, hosted at www.automl.org). This provides methods and processes to make machine learning available to non-ML experts.

Today, embedded AI development depends on human ML experts to perform tasks including:

Pre-process and clean training data
Select and construct appropriate features
Select an appropriate model family
Optimize model hyperparameters
Design the topology of neural networks (if deep learning is used)
Post-process machine learning models
Analyze the results obtained

As the complexity of these tasks is often beyond non-ML-experts, the rapid growth of ML applications has created a demand for off-the-shelf ML methods that can be used easily and without expert knowledge. AutoML is a project aimed at enabling the automation of ML software development.

Balancing intelligence and efficiency at the edge

Low-power edge AI is the most important development in embedded computing in the 2020s. Nearly all embedded device manufacturers will be attempting to meet the challenge of running AI at low power at the edge.

This article has shown the main techniques that embedded system designers can use to implement low-power AI at the edge, including:

Careful model selection
Model quantization
Model pruning
Acceleration of inferencing
Dynamic power management

In implementing these techniques, developers will always need to have regard to the specific requirements of their application. In edge AI model optimization, there is always a trade-off between power consumption, memory footprint, latency, and accuracy.

The state of the art in edge AI inference is constantly changing as new hardware and software technologies and products emerge. To stay up to date with this exciting field, developers should keep turning to ipXchange.tech, the home of deep tech news for electronics design engineers.

And embedded system developers can make sure that they never miss the latest news from ipXchange by subscribing to the newsletter – just email info@ipxchange.tech today.

Comments

No comments yet

You must be signed in to post a comment.

How to optimize AI models for low-power inferencing at the edge

Why power efficiency matters in edge AI systems

Cloud-based models: too big for edge devices?

Model selection for efficient inference

Decide whether AI is necessary to the application

Quantization of AI models: reducing both precision and compute requirement

The types of quantization

Pruning and model compression

How pruning is performed

Model distillation for parameter reduction

Runtime optimizations on deployment to embedded hardware

Power management strategies

Benchmarking and validation of low-power edge AI inference operations

Emerging trends in low-power edge AI

Balancing intelligence and efficiency at the edge

Comments

Get the latest disruptive technology news

ipXchange Electronics components news for design engineers

Why power efficiency matters in edge AI systems

Cloud-based models: too big for edge devices?

Model selection for efficient inference

Decide whether AI is necessary to the application

Quantization of AI models: reducing both precision and compute requirement

The types of quantization

Pruning and model compression

How pruning is performed

Model distillation for parameter reduction

Runtime optimizations on deployment to embedded hardware

Power management strategies

Benchmarking and validation of low-power edge AI inference operations

Emerging trends in low-power edge AI

Balancing intelligence and efficiency at the edge

Comments

Get the latest disruptive technology news

ipXchange Electronics components news for design engineers

Email

Office phone