Get models like Phi-2, Mistral and LLaVA running locally on a Raspberry Pi with Ollama
Have you ever thought about running your own Large Language Models (LLM) or Vision Language Models (VLM) on your own device? You probably have, but the thought of setting things up from scratch, having to manage the environment, downloading the right model weights, and the lingering doubt as to whether your device can even handle the model leaves you probably made you think.
Let’s go even further. Imagine using your own LLM or VLM on a device no bigger than a credit card: a Raspberry Pi. Impossible? No way. I mean, I’m writing this post after all, so it’s definitely possible.
Possible, yes. But why would you do it?
Borderline LLMs seem pretty far-fetched at the moment. But this particular niche use case is expected to evolve over time, and we will definitely see interesting edge solutions being deployed with a fully local generative AI solution running on the device at the edge.
It’s also about pushing the boundaries to see what’s possible. If it can be done at this extreme end of the computing scale, then it can be done at any level between a Raspberry Pi and a big, powerful server GPU.
Traditionally, cutting-edge AI has been closely linked to computer vision. Exploring the deployment of LLM and VLM at the edge adds an exciting dimension to this just-emerging field.
More importantly, I just wanted to do something fun with my recently acquired Raspberry Pi 5.
So how do you achieve all this on a Raspberry Pi? Use Ollama!
What is Ollama?
Ollama has become one of the best solutions for running local LLMs on your own personal computer without having to worry about setting up from scratch. With just a few commands, everything can be configured without any problems. Everything is self-contained and works wonderfully in my experience on multiple devices and models. It even exposes a REST API for model inference, so you can let it run on the Raspberry Pi and call it from your other apps and devices if you want.
There is also Ollama Web UI which is a beautiful piece of AI UI/UX that works seamlessly with Ollama for those who shy away from command line interfaces. It’s essentially a local ChatGPT interface, if you will.
Together, these two open source software products provide what I consider to be the best locally hosted LLM experience today.
Ollama and Ollama Web UI also support VLMs like LLaVA, which opens even more doors for this cutting-edge generative AI use case.
The technical requirements
All you need is the following:
- Raspberry Pi 5 (or 4 for a slower configuration) — Opt for the 8 GB RAM variant to fit the 7B models.
- SD Card – Minimum 16GB, the larger the size, the more models you can install. Have it already loaded with a suitable operating system such as Raspbian Bookworm or Ubuntu
- An Internet connection
As I mentioned earlier, running Ollama on a Raspberry Pi is already at the extreme end of the hardware spectrum. Essentially, any device more powerful than a Raspberry Pi, provided it’s running a Linux distribution and has a similar memory capacity, should theoretically be able to run Ollama and the models discussed in this article.
1. Installing Ollama
To install Ollama on a Raspberry Pi, we will avoid using Docker to save resources.
In the terminal, run
curl https://ollama.ai/install.sh | sh
You should see something similar to the image below after running the command above.
As the result says, navigate to 0.0.0.0:11434 to verify that Ollama is running. It is normal to see the message “WARNING: No NVIDIA GPU detected. Ollama will run in CPU mode only. since we are using a Raspberry Pi. But if you are following these instructions on something that is supposed to have an NVIDIA GPU, something has gone wrong.
For any issues or updates, refer to Ollama GitHub repository.
2. Run LLMs via command line
Take a look at the official Ollama template library for a list of models that can be run with Ollama. On an 8 GB Raspberry Pi, models above 7B are not suitable. Let’s use Phi-2, a 2.7 billion LLM from Microsoft, now licensed by MIT.
In the terminal, run
ollama run phi
Once you see something similar to the result below, you already have an LLM running on the Raspberry Pi! It’s so simple.
You can try other models like Mistral, Llama-2, etc., just make sure there is enough space on the SD card for the model weights.
Naturally, the larger the model, the slower the output will be. On Phi-2 2.7B I can get around 4 tokens per second. But with a Mistral 7B, the generation speed drops to around 2 tokens per second. A token is roughly equivalent to a single word.
We now have LLMs running on the Raspberry Pi, but we’re not done yet. The terminal is not for everyone. Let’s get Ollama’s web UI working too!
3. Installing and Running Ollama Web UI
We will follow the instructions on the Ollama Web UI Official GitHub Repository to install it without Docker. It recommends at a minimum that Node.js is >= 20.10, so we’ll follow that. It also recommends that Python be at least 3.11, but Raspbian OS already installed it for us.
First we need to install Node.js. In the terminal, run
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash - &&\
sudo apt-get install -y nodejs
Replace 20.x with a more appropriate version if necessary for future readers.
Then run the code block below.
git clone https://github.com/ollama-webui/ollama-webui.git
# Copying required .env file
cp -RPp example.env .env
# Building Frontend Using Node
npm run build
# Serving Frontend with the Backend
pip install -r requirements.txt --break-system-packages
This is a slight modification of what is provided on GitHub. Note that for simplicity and brevity, we do not follow best practices such as using virtual environments and we use the flag — break-system-packages. If you encounter an error like uvicorn not found, restart the terminal session.
If all goes well, you should be able to access the Ollama web UI on port 8080 via http://0.0.0.0:8080 on the Raspberry Pi, or via http://
Once you’ve created an account and logged in, you should see something similar to the image below.
If you downloaded model weights earlier, you should see them in the drop-down menu like below. Otherwise, you can go to Settings to upload a template.
The whole interface is very clean and intuitive, so I won’t explain much about it. It’s really a very well done open source project.
4. Running VLM via Ollama Web UI
As I mentioned at the beginning of this article, we can also run VLMs. Let’s launch LLaVA, a popular open source VLM that also happens to be supported by Ollama. To do this, download the weights by pulling ‘llava’ through the interface.
Unfortunately, unlike LLMs, the setup takes a while to interpret the image on the Raspberry Pi. The example below took around 6 minutes to process. Most of the time this is probably because the image is not properly optimized yet, but this will definitely change in the future. The token generation speed is approximately 2 tokens/second.
To conclude all this
At this point, we have almost achieved the objectives of this article. To recap, we have successfully used Ollama and Ollama Web UI to run LLMs and VLMs like Phi-2, Mistral and LLaVA on the Raspberry Pi.
I can certainly imagine quite a few use cases for locally hosted LLMs running on the Raspberry Pi (or other small Edge device), especially since 4 tokens/second seems to be an acceptable speed with streaming for some cases of use if we opt for models of the size of Phi-2.
The area of “small” LLMs and VLMs, named somewhat paradoxically given their “large” designation, is an active area of research with many model releases recently. Hopefully this emerging trend continues and more efficient and more compact models continue to be released! Definitely something to watch out for in the coming months.
Disclaimer: I have no affiliation with Ollama or Ollama Web UI. All views and opinions are my own and do not represent any organization.