Session Timeout

Your session is about to expire. Do you want to extend the session?

00:

Extend My Session

Custom Meta Tags

Hero Banner

Sub Navigation

Title and intro (LC)

Running LLMs Locally

Local execution brings generative AI to embedded platforms

The large language model (LLM) is the foundation for the revolution in generative artificial intelligence (AI) that delivers natural interaction and even reasoning capability. The LLM also forms the basis for a new generation of agentic AI systems that can organise and perform complex tasks autonomously, adapting and responding to changing conditions.

Fundamentals

What is an LLM?

Intro (MM)

An LLM is a type of AI model that leverages scale, with parameter counts typically exceeding hundreds of millions, and the capabilities of the transformer neural-network structure to identify connections between elements in the training data through a mechanism called self-attention. By pretraining on a vast corpus of data and then fine-tuning using techniques such as reinforcement learning, LLMs have demonstrated higher degrees of flexibility and reasoning-like capabilities that set them apart from earlier types of AI based on artificial neural networks.

What are the deployment options for LLMs?

Much of the early effort in LLMs focused on cloud-based deployment to leverage the environment’s ability to run multiple computing engines in parallel. This scale-out paradigm has made it possible to implement LLMs that harness trillions of parameters to improve performance on complex tasks.

How to enhance embedded systems with Generative Al and Local LLMs | Michael Uyttersprot at HWPMAX25

Fundamentals cont (LC)

NEED SUPPORT? CONTACT OUR AI EXPERTS

LLMs available in the cloud are often based on proprietary technology, accessed using applications programming interfaces (APIs). That limits the ability of the developer to customise them for specific tasks. The development of open-source offerings such as Meta’s Llama series and others hosted by Hugging Face has made it possible for developers to explore more deployment options and use-cases, including local rather than cloud-based AI.

Why would someone want to run an LLM locally instead of using a cloud API?

Local execution of LLMs provides several advantages. A common concern regarding the use of cloud-LLM is whether it might compromise privacy. There may be no way to ensure that data uploaded to the cloud is not reused by the model in interactions with other users. There is also the risk that attackers may intercept data uploads.

Cost, communications reliability and latency present additional obstacles. Embedded and control systems often cannot tolerate the additional latency introduced by transmitting data to a remote cloud server. Industrial sites may not be able to guarantee the constant, high-bandwidth connections to the internet that cloud AI demands. And token-processing costs present the user with a potentially large bill for LLM usage. Local LLM operation overcomes these issues, though it may need high-speed processing hardware combined with models optimised for the environment.

What are common use-cases for local LLMs?

There are many situations where local LLMs can improve the capabilities of systems. The nature of the LLM makes it a suitable approach for implementing natural-language interfaces to robots and other systems that require an intuitive user interface. Through a combination of speech recognition with natural language processing, the LLM lets users interact with systems using spoken commands. LLMs can process other complex information, with the ability to generate reports and summaries that help users determine the best course of action to take.

In agentic workflows, an LLM can help a system react to changing conditions in a smarter way. That relieves developers of the need to create complex decision chains that try to anticipate every situation. Security and intrusion-analysis systems can harness the same power, spotting patterns and trends in data without the need to rely on existing threat signatures.

Can local LLMs be integrated into existing applications or workflows?

LLMs designed to run locally will often be integrated into existing application architectures. Robotics provides an example. A growing number of implementations use an LLM-based AI to handle high-level planning and coordinate lower-level motion policies. Conventional control loops manage the individual policies.

LLMs can be combined in other ways. They can add an intuitive user interface to a wide variety of embedded systems by translating natural language into sequences of commands and procedures that are preprogrammed into the core control systems. Adding an LLM-based interface can reduce the learning curve for industrial controls and similar systems that otherwise would need to be programmed using specialised languages.

Edge Gen AI (GBL)

Generative AI at the Edge

See advantages of Generative AI at the Edge, learn more about Edge Gen AI locations, see Gen AI at the Edge in Action Edge and access Gen AI resources such as articles and case studies.

Learn More

System Requirements and Implementation (LC)

System requirements

Can LLMs be run on embedded and edge-computing hardware?

Cloud LLMs often have large memory and computing requirements. With tens of billions of parameters, the Llama3-70B model needs 140GB of storage space to hold the full model, assuming all of the weights are encoded using a 16-bit floating-point format (FP16). A typical recommended hardware configuration for real-time response is to use four enterprise-grade graphics processing units (GPUs), each with 40GB of memory. However, some optimisations allow LLMs to run on embedded platforms, such as Qualcomm’s Dragonwing and Snapdragon multicore SoCs, NXP’s i.MX9 SoCs and Raspberry Pi 5.

You can learn more about the hardware used for Avnet Silica's own 'EdgeGenAI chatbot' and the Generative AI Podcast below:

What optimisations are used to enable local LLM processing?

Several optimisations are available to developers using open-source models, which have been applied to Meta’s Llama, Alibaba’s Qwen or the SmolLM models developed by Hugging Face. One of the most important optimisations for embedded and edge use resulted from extensive analyses of the trade-offs employed when pretraining LLMs. One key insight found by researchers is that extending pretraining time using larger quantities of high-quality, diverse data improves overall accuracy. This extended training enables the use of LLMs with smaller parameter counts. Microsoft’s Phi-2, SmolLM3 and TinyLlama, for example, can provide high levels of accuracy with just 3 billion parameters. When fine-tuned for specific applications and, possibly, coupled with techniques such as retrieval augmented generation (RAG), these smaller models can offer a better fit for embedded applications than larger, general-purpose models.

Quantisation, or weight microscaling, provides further reductions in memory space and computing overhead. By replacing FP16 weights with 8-bit integer weights, the storage needed for Llama2-70B is less than 40GB. Some LLMs optimised for embedded use have adopted 4-bit weights for further improvements. By reducing model size further and exploiting quantisation, it is possible to bring memory size to less than 2GB, allowing execution on a range of embedded platforms, including those that do not have AI-focused accelerators, such as the Raspberry Pi 5.

Implementation

What are the trade-offs between model size, performance and accuracy?

There are trade-offs between model size and accuracy and performance, measured in tokens per second. But they need to be assessed for the target application. Research has shown that increasing training time and data quantity on smaller models improves their accuracy, but this is at the cost of increased computing resources on the training system. Fine-tuning before performing optimisations, such as microscaling, should improve accuracy on a target task. It will also help overcome losses that may result from the lower-resolution calculations. Using RAG can also boost performance for small models that need to be fine-tuned for specific tasks, such as generating commands for a robot. RAG uses predefined text fragments to limit the outputs of the LLM, helping to reduce problems such as hallucination.

What tools or libraries are available for local LLM execution?

Several tools and libraries are now available for local LLM execution, including runtimes that can target Arm and Intel CPUs as well as GPUs and AI accelerators. The llama.cpp C++ library is one example of an embeddable LLM inference engine. Other tools are designed to host LLM inference engines, providing an easy-to-use management system for the AI. These include LMStudio and Ollama. The mlc-llm tool provides a compiler to port LLMs to a range of target hardware platforms.

Many pretrained models are available from hubs such as Hugging Face, as well as from processor vendors. Qualcomm, for example, has a set of pre-compiled models hosted on its AI Hub.

Where can I find evaluation hardware and support for applications that need local LLMs?

Avnet Silica has supported several customers with their local LLM projects and undertaken its own Generative AI at the Edge projects in recent years, including the 'Edge GenAI' chatbot, which has been demonstrated at several trade shows across EMEA, and the Generative AI podcast, most recently showcased at Hardware Pioneers Max in 2025.

Avnet Silica offers a range of embedded solutions suitable for generative AI. For example, it has the TRIA range of system-on-modules (SOMs) platforms that host high-performance AI engines, such as the Qualcomm Dragonwing. These modules enable businesses to tailor AI solutions for diverse industry applications while maintaining high performance and energy efficiency.

Avnet Silica offers a comprehensive ecosystem of support services that extends beyond hardware and software solutions and has experience in choosing, tuning and deploying LLMs to a range of embedded platforms and can advise customers on ways to optimise their use of local generative AI implementations.

Who is Avnet Silica's AI expert, Michaël Uyttersprot?

Michaël Uyttersprot is Avnet Silica's Market Segment Manager for Artificial Intelligence, Machine Learning and Vision. He has 20 years of experience in the industry, starting his career as an engineer in robotics. His current focus is on supporting the development and promotion of embedded vision and deep learning solutions to customers for projects involving AI and machine learning. Michael is seen as a thought leader in the field of AI and ML and has presented his work and thoughts to large audiences at industry events such as Hardware Pioneers Max. He also regularly joins webinars in partnership with major AI players in the semiconductor space, including NXP, STMicroelectronics and AMD.

Articles by Michael

Michael Uyttersprot

Working on a project (LC)

Working on an Artificial Intelligence project?

Our experts bring insights that extend beyond the datasheet, availability and price. The combined experience contained within our network covers thousands of projects across different customers, markets, regions and technologies. We will pull together the right team from our collective expertise to focus on your application, providing valuable ideas and recommendations to improve your product and accelerate its journey from the initial concept out into the world.

WE'D LOVE TO HEAR FROM YOU!

Enhancing Embedded Systems with Generative AI and Local LLMs

This article examines the technical aspects, challenges, and potential applications of deploying advanced AI capabilities on resource-constrained hardware, drawing on the experiences and tests conducted by Michaël Uyttersprot, Senior Manager AI-ML & Vision EMEA at Avnet Silica. Learn more about the benefits of running Generative AI models locally, the challenges of deploying LMMs on embedded devices, core technologies, and future directions of Generative AI.

Learn More

A pin badge that reads 'Large Language Model' sat in front of wiring

Generative AI at the Edge Chatbot Using Local LLMs (MM)

Generative AI at the Edge Chatbot Using Local LLMs

The Edge Gen AI Chatbot (demonstrated at Electronica 2024 and embedded world 2025) is a locally operated chatbot that runs directly on embedded devices, delivering fast, low-latency responses while ensuring enhanced privacy. Its modular software design allows flexibility across diverse hardware configurations, making it adaptable to specific application requirements. Additionally, it is compatible with a broad selection of TRIA System-on-Modules (SOMs) to optimise performance. Comprehensive software support is provided, offering robust resources and tools for seamless integration and customisation.

Learn More