Running an Open Language Model Locally

Host Ollama in Podman on Windows, benefit from an Nvidia graphic chip, and use the model in Visual Studio Code

2024-03-28

In the scope of bbv’s Focus Day 2024 (see LinkedIn post), I started using open large language models (LLM) locally. While having tested (and liked!) GitHub Copilot in 2022, it has some drawbacks for me: it costs a monthly subscription fee and - for me more critical - transmits contents to the cloud ¹. Therefore, I was happy to learn about the free alternative: Ollama.

So, this page is about getting Ollama running locally on a Windows computer using only free tools.

Installing Podman

You could also use Docker, of course. However, I’m not particular fan of Docker these days due to their licensing and because of Podman’s advantages: rootless and daemonless. So, first, install Podman. I found this page quite helpful for that: Podman Tutorials - Podman for Windows. Just follow the steps until (and including) “Starting Machine”.

Configure Container to Use the Nvidia graphics chip

This step is only needed in case you want to benefit from the LLM acceleration on an Nvidia graphics chip. See here if you want to find out, whether your model is supported by Ollama.

Installing Podman was easy. This step here cost me quite some sweat, but I think it’s worth doing it to have full hardware acceleration when running your local LLM. Here’s how I did it:

Install the Nvidia Container Toolkit

First, the Nvidia Container Toolkit needs to be installed on the host running the Podman containers. In other words, this has to be executed in the WSL created for Podman. The instruction can be found on Nvidia - Installing the NVIDIA Container Toolkit. Note that the Podman machine is based on a Redhat image, hence, follow the instructions using yum!

To enter the Podman machine use the following command (in a Windows command line):

$> wsl -d podman-machine-default

Then install the toolkit:

$> curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
      tee /etc/yum.repos.d/nvidia-container-toolkit.repo
$> sudo yum install -y nvidia-container-toolkit

That’s all on this page. Don’t follow the “configuration” instructions! Since we’re using Podman (instead of Docker), a different interface is needed to access the graphics chip’s full capabilities in the container.

Namely the Container Device Interface is used when running a Podman container. The instructions can be found on Nvidia - Support for Container Device Interface. First, the installed Nvidia Container Toolkit is used to generate the CDI specificiations, i.e. the list of graphics card capabilities using:

$> sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

As listed on the Nvidia instructions page, the generated devices should now show the installed card:

$> nvidia-ctk cdi list

In my cast I had the following output:

/posts/ollama-podman-windows/nvidia-ctk_list_output.png

If you see some similar output, you should now be ready for the Ollama container in the next section.

Configuring the Ollama Container

As a next step, the Ollama container shall be created, started and configured.

Start the Ollama Container

The container can be created and started using the command, adapted from Robert’s blog - Local LLMs on linux with ollama (this can now be executed on the host directly, not in the Podman machine):

$> podman run -d --name ollama --replace --restart=always --device nvidia.com/gpu=all \
      --security-opt=label=disable -p 11434:11434 -v ollama:/root/.ollama \
      --stop-signal=SIGKILL docker.io/ollama/ollama

So what was added compared to the command in the mentioned blog:

--device nvidia.com/gpu=all
This instructs Podman to now use the device which was detected by the Nvidia container toolkit tool.
--security-opt=label=disable
This permits the container to share parts of the host OS

To summarize, this command starts the podman container

detached, i.e. in the background (-d)
give the container the name “ollama” (--name ollama)
replace any existing container with the very same name (--replace)
restart the container if it exits (--restart=always)
share the Nvidia graphics card with the container (--device nvidia.com/gpu=all --security-opt=label=disable)
make Ollama’s model port 11434 available to the host (-p 11434:11434)
mount folder ollama on the host in the container at path /root/.ollama (-v ollama:/root/.ollama) \n (this is handy, because otherwise, rebuilding the container would immediately delete all the container’s contents, including the downloaded models!)
let the container be stopped by just the “KILL” signal (--stop-signal=SIGKILL)
create the container based on the image at docker.io/ollama/ollama

Well done, now you’re ready with the container. Let’s set it up in the next step

Download Models

The container as such is empty. It just hosts the Ollama framework. As a next step we need to download our desired model. For this we enter the container with a shell:

$> podman exec -it ollama /bin/bash

We’re now in a Bash terminal in the ollama container. Just execute the ollama pull <model> command to download your desired model. For a complete list of available models, checkout ollama.com - Models. I was using the starcoder2 recently. Hard to say which performs best. That’s up to you!

Since I have support for my Nvidia graphics card, I’ve chosen to use a larger model. So, to download the starcoder2:7b model, use

$> ollama pull starcoder2:7b

This takes some time. To now test your model interactively, just stay in the container’s shell and run

$> ollama run starcoder2:7b

You can now chat with your local model, as you’re used to chat with ChatGPT.

Note that you can download as many models as you like! Whenever you start the model you need to choose the desired one. Only harddisk space limits you.

To verify that your container is actually using the Nvidia graphics card acceleration, the container’s log provides some handy output. It can be accessed from your host computer’s terminal using

$> podman logs ollama

Once the setup was working fine, it showed on my computer

Configuring Visual Studio Code

As a final step, this locally running LLM can now be used in Visual Studio Code using the Continue extension. After the extension’s installation (local installation, not in any container, if you’re developing remotely!), the Continue right bar can be opend and the settings modified using the gear at the bottom right:

Under models just add an entry for the model you’ve downloaded (you can even add multiple models, just all you’ve pulled):

    {
      // this needs to be the exact same name of the model you've pulled
      "model": "starcoder2:7b",
      // this is the name which will show up in the drop-down list in
      // Continue's right bar
      "title": "StarCoder2:7b",
      "completionOptions": {},
      "apiBase": "http://localhost:11434",
      "provider": "ollama"
    }

You could also add a model using the “Plus” button at the bottom of Continue’s right bar.

Note:

when adding models, the “Autodetect” didn’t really work for me. Therefore, I found the JSON editing quite handy
the model runs locally, therefore this setting is not synced. The Continue extension’s configuration (config.json) is located in your (Windows) user’s profile folder in subfolder .continue

What are your experiences with Ollama models?

this could be avoided by paying a bit more ↩︎