Enhancing Embedded Systems with Generative AI and Local LLMs
The widespread availability of tools like ChatGPT has familiarised many with the capabilities of Generative AI (GenAI), particularly chatbots. While high-end models often reside in the cloud, there is growing interest in exploring the feasibility and benefits of running Generative AI and Large Language Models (LLMs) locally on embedded devices.
This article examines the technical aspects, challenges, and potential applications of deploying advanced AI capabilities on resource-constrained hardware, drawing on the experiences and tests conducted by Michaël Uyttersprot, Senior Manager AI-ML & Vision EMEA at Avnet Silica.
What are Generative AI and Large Language Models?
Generative AI is a type of AI that focuses on creating something new, something that did not exist before, distinguishing it from traditional machine learning or deep learning, which typically involves pattern recognition. Examples include generating images or text. When discussing chatbots, the focus is primarily on generating content, such as text.
Large Language Models are central to text-based GenAI, often referred to as LLMs, or Small Language Models (SLMs) if they are smaller. These models are characterised by the number of parameters, with larger models having hundreds of billions of parameters. In contrast, models on embedded devices might range from 1 billion to 10 billion parameters.
How to enhance embedded systems with Generative Al and Local LLMs | Michael Uyttersprot at HWPMAX25
LOOKING FOR SUPPORT? CONTACT OUR AI EXPERTS
Benefits of Running Generative AI Models Locally
While cloud-based GenAI models offer high performance, running LLMs locally on an embedded device presents several compelling advantages. Firstly, running locally removes the dependency on a continuous internet connection, which introduces latency, enabling a more real-time interaction and providing a unique user experience.
Another key benefit of no internet connection is enhanced data privacy. Users often ask chatbots confidential questions, and running the model locally ensures that this sensitive information remains on the device. Also, unlike cloud services, which usually require subscriptions, local LLMs can utilise open-source and free models, eliminating ongoing costs.
Challenges of Deploying LLMs on Embedded Devices
Despite the benefits, running LLMs locally on embedded devices is quite complex due to inherent constraints.
Some embedded devices have restrictions on processing capabilities compared to PCs. They may also have limited RAM (4 or 8 GB) and may not have a dedicated GPU, which can be a significant constraint for loading and running LLM models.
The sheer size of the language models used can pose storage issues. The larger, high-end cloud models have hundreds of billions of parameters. In contrast, embedded systems are typically limited to models with far fewer parameters, such as those with between 1 billion and 10 billion parameters. The models need to be optimised to run in real-time; waiting a minute for a response after a text prompt is not practical.
Some embedded systems may not run standard operating systems, such as Linux or Windows, which can complicate integration and deployment. Driver compatibility can also be an issue. For battery-powered devices, the power consumption of running an LLM can be a restriction.
Generative AI at the Edge
See advantages of Generative AI at the Edge, learn more about Edge Gen AI locations, see Gen AI at the Edge in Action Edge and access Gen AI resources such as articles and case studies.

Figure 1: Podcast feasibility study with the host and SME expert avatars discussing generative AI and a live Q&A
Implementing GenAI on an Embedded Device
The idea of implementing Generative AI (GenAI) on an embedded device was inspired by Google’s Notebook LM, an AI-powered tool designed for podcast generation. The setup involved creating two avatars, a host and an expert, designed to discuss GenAI (Figure 1). The flow of the conversation was pre-defined (introduction, question/answer pairs, wrap-up). Still, the actual text and content generated for the questions and answers were always new due to the generative nature of the AI.
The implementation also allowed for human interaction, enabling a user to interrupt the podcast and ask questions. We designed the avatars to respond to the human’s question and then potentially steer the conversation back to the original topic of GenAI, if the question was unrelated.
Hardware and Software Used
The implementation used an AMD-embedded Ryzen 8000 box product from ASRock, running Windows IoT. This system included a GPU and AI engines (though not specifically used in this project), RAM, and an SSD.
The software architecture included a front-end web interface for visualisation and interaction with the avatars. A Python backend server managed the core functionality, comprising three key components: a speech-to-text (STT) engine, two different LLMs for each avatar, and a text-to-speech (TTS) engine (Figure 2).
The flow implementation involved the LLM generating text, which was then sent to the TTS engine for audio output. Crucially, the response generated by one avatar’s LLM was fed back as input to the other avatar’s LLM, enabling a conversational interaction.
For human interaction, the workflow involved the microphone capturing human speech, which was processed by the STT engine, converted into text, fed into an LLM, and then generated a text response. Finally, the text response was converted back into audio output by the TTS engine.
Core Technologies and Performance
Figure 2: Generative AI podcast conversation flow implementation
Several specific tools and models were evaluated for their suitability in addressing the core components of the investigation.
For the local LLM, Google’s Gemma 2 model (2 billion parameters) was found to be quite impressive and could fit on systems with even 2 GB of RAM. Other models like Gemma 3, Phi (Microsoft), and Mistral (Mistral AI) were also examined. Models with 9 billion parameters were also tested, showing differences in vocabulary complexity compared to smaller models.
Cloud-based LLMs were tested for comparison. The Llama (Meta AI), ChatGPT (OpenAI), and Claude 3/3.5 (Anthropic) models generally offer higher quality but involve subscription costs.
Piper (Rhasspy TTS engine) was used for local text-to-speech. Piper TTS supports around 30 languages and is open-source. While usable, its quality was noted as less naturally sounding compared to cloud options. However, improvements in local TTS quality running on CPUs are anticipated soon. Eleven (ElevenLabs) was used for the cloud-based TTS comparison, providing higher quality audio that sounded more natural. But, as this is a cloud service, it requires a subscription.
Whisper (OpenAI) was used for the local STT engine. It is very popular and available in different model sizes (tiny, base, small, medium), offering a trade-off between resource usage and speed. A key benefit of using Whisper is that it includes features like noise reduction. A Google speech recognition library was used for the cloud comparison, yielding results similar to those of Whisper.
The quality of the generated text from even a 2 billion parameter model like Gemma 2 was considered quite high, containing content comparable to what might be found across the internet and in books. While there is a quality difference compared to high-end cloud models, the results with smaller local models were deemed impressive and usable for embedded applications. Other future podcast-type implementations could include animated avatars with lip-syncing for more engaging interactions.
Figure 3: The evolution of Generative AI – from prediction to full-autonomy
Future Directions and Agentic AI
Beyond basic GenAI, future work includes exploring Agentic AI, where a software tool (like a chatbot avatar) can take actions and decisions to execute tasks, such as booking a hotel (Figure 3). This involves connecting the LLM (which primarily generates words) with other tools. Initiatives like the Model Context Protocol are emerging standards for linking various tools with LLMs. Examples include using voice commands via an avatar to control software tools, such as video editing software.
The potential applications for running GenAI and LLMs on embedded devices are broad. For instance, next-generation GenAI-enabled conferencing systems could take meeting notes and translate different languages in real-time. GenAI could also help visually impaired individuals access retail outlets and transportation hubs, providing natural language interaction.
Conclusion
The ability to implement GenAI models on embedded devices is increasingly feasible. While acknowledging the resource constraints and the need for optimisation due to the complexity of embedded systems compared to PCs, the benefits are significant.
Running locally offers data privacy, reduced costs, and lower latency. The quality achieved even with smaller models (1-10 billion parameters) is deemed sufficiently high for many applications, and this quality is expected to improve rapidly.
As LLMs continue to evolve and optimise for smaller footprints, we are likely to see a growing number of innovative GenAI applications emerge on embedded devices across various industries.
Working on an Artificial Intelligence project?
Our experts bring insights that extend beyond the datasheet, availability and price. The combined experience contained within our network covers thousands of projects across different customers, markets, regions and technologies. We will pull together the right team from our collective expertise to focus on your application, providing valuable ideas and recommendations to improve your product and accelerate its journey from the initial concept out into the world.
Overview
Generative AI Overview
Head over to our Generative AI overview page for more Generative AI articles, applications and resources.
