Running an Open Language Model Locally
Host Ollama in Podman on Windows, benefit from an Nvidia graphic chip, and use the model in Visual Studio Code
In the scope of bbv’s Focus Day 2024 (see LinkedIn post), I started using open large language models (LLM) locally. While having tested (and liked!) GitHub Copilot in 2022, it has some drawbacks for me: it costs a monthly subscription fee and - for me more critical - transmits contents to the cloud 1. Therefore, I was happy to learn about the free alternative: Ollama.
So, this page is about getting Ollama running locally on a Windows computer using only free tools.
Installing Podman
You could also use Docker, of course. However, I’m not particular fan of Docker these days due to their licensing and because of Podman’s advantages: rootless and daemonless. So, first, install Podman. I found this page quite helpful for that: Podman Tutorials - Podman for Windows. Just follow the steps until (and including) “Starting Machine”.
Configure Container to Use the Nvidia graphics chip
This step is only needed in case you want to benefit from the LLM acceleration on an Nvidia graphics chip. See here if you want to find out, whether your model is supported by Ollama.
Installing Podman was easy. This step here cost me quite some sweat, but I think it’s worth doing it to have full hardware acceleration when running your local LLM. Here’s how I did it:
Install the Nvidia Container Toolkit
First, the Nvidia Container Toolkit needs to be installed on the host running the Podman containers. In other words, this has to be executed in the WSL created for Podman. The instruction can be found on Nvidia - Installing the NVIDIA Container Toolkit. Note that the Podman machine is based on a Redhat image, hence, follow the instructions using yum!
To enter the Podman machine use the following command (in a Windows command line):
$> wsl -d podman-machine-default
Then install the toolkit:
$> curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
tee /etc/yum.repos.d/nvidia-container-toolkit.repo
$> sudo yum install -y nvidia-container-toolkit
That’s all on this page. Don’t follow the “configuration” instructions! Since we’re using Podman (instead of Docker), a different interface is needed to access the graphics chip’s full capabilities in the container.
Namely the Container Device Interface is used when running a Podman container. The instructions can be found on Nvidia - Support for Container Device Interface. First, the installed Nvidia Container Toolkit is used to generate the CDI specificiations, i.e. the list of graphics card capabilities using:
$> sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
As listed on the Nvidia instructions page, the generated devices should now show the installed card:
$> nvidia-ctk cdi list
In my cast I had the following output:
If you see some similar output, you should now be ready for the Ollama container in the next section.
Configuring the Ollama Container
As a next step, the Ollama container shall be created, started and configured.
Start the Ollama Container
The container can be created and started using the command, adapted from Robert’s blog - Local LLMs on linux with ollama (this can now be executed on the host directly, not in the Podman machine):
$> podman run -d --name ollama --replace --restart=always --device nvidia.com/gpu=all \
--security-opt=label=disable -p 11434:11434 -v ollama:/root/.ollama \
--stop-signal=SIGKILL docker.io/ollama/ollama
So what was added compared to the command in the mentioned blog:
--device nvidia.com/gpu=all
This instructs Podman to now use the device which was detected by the Nvidia container toolkit tool.--security-opt=label=disable
This permits the container to share parts of the host OS
To summarize, this command starts the podman container
- detached, i.e. in the background (
-d
) - give the container the name “ollama” (
--name ollama
) - replace any existing container with the very same name (
--replace
) - restart the container if it exits (
--restart=always
) - share the Nvidia graphics card with the container (
--device nvidia.com/gpu=all --security-opt=label=disable
) - make Ollama’s model port 11434 available to the host (
-p 11434:11434
) - mount folder
ollama
on the host in the container at path/root/.ollama
(-v ollama:/root/.ollama
) \n (this is handy, because otherwise, rebuilding the container would immediately delete all the container’s contents, including the downloaded models!) - let the container be stopped by just the “KILL” signal (
--stop-signal=SIGKILL
) - create the container based on the image at
docker.io/ollama/ollama
Well done, now you’re ready with the container. Let’s set it up in the next step
Download Models
The container as such is empty. It just hosts the Ollama framework. As a next step we need to download our desired model. For this we enter the container with a shell:
$> podman exec -it ollama /bin/bash
We’re now in a Bash terminal in the ollama container. Just execute the ollama pull <model>
command to download your desired model. For a complete list of available models, checkout ollama.com - Models. I was using the starcoder2 recently. Hard to say which performs best. That’s up to you!
Since I have support for my Nvidia graphics card, I’ve chosen to use a larger model. So, to download the starcoder2:7b model, use
$> ollama pull starcoder2:7b
This takes some time. To now test your model interactively, just stay in the container’s shell and run
$> ollama run starcoder2:7b
You can now chat with your local model, as you’re used to chat with ChatGPT.
Note that you can download as many models as you like! Whenever you start the model you need to choose the desired one. Only harddisk space limits you.
To verify that your container is actually using the Nvidia graphics card acceleration, the container’s log provides some handy output. It can be accessed from your host computer’s terminal using
$> podman logs ollama
Once the setup was working fine, it showed on my computer
Configuring Visual Studio Code
As a final step, this locally running LLM can now be used in Visual Studio Code using the Continue extension. After the extension’s installation (local installation, not in any container, if you’re developing remotely!), the Continue right bar can be opend and the settings modified using the gear at the bottom right:
Under models just add an entry for the model you’ve downloaded (you can even add multiple models, just all you’ve pulled):
{
// this needs to be the exact same name of the model you've pulled
"model": "starcoder2:7b",
// this is the name which will show up in the drop-down list in
// Continue's right bar
"title": "StarCoder2:7b",
"completionOptions": {},
"apiBase": "http://localhost:11434",
"provider": "ollama"
}
You could also add a model using the “Plus” button at the bottom of Continue’s right bar.
Note:
- when adding models, the “Autodetect” didn’t really work for me. Therefore, I found the JSON editing quite handy
- the model runs locally, therefore this setting is not synced. The Continue extension’s configuration (
config.json
) is located in your (Windows) user’s profile folder in subfolder.continue
What are your experiences with Ollama models?
-
this could be avoided by paying a bit more ↩︎