Running Ollama models on a Radeon RX 5500

Written by Luke Arntz on January 23, 2025

tags:

I had an old Radeon RX 5500 sitting around and wanted to see if I could get ollama to use it. It is not officially supported, but there is a way to get it working.

Setup

This was setup on a clean install of Fedora 41. I’m using Open WebUI to access the models, but will not go into detail on how that is setup nor how to install Fedora. Instead, I’ll explain how I got ollama using an officially unsupported Radeon RX 5500. This GPU only has 4Gi of RAM, but works pretty well with smaller models like deepseek-r1 1b. Also, I’d like to point out that this setup does not is not containerized. I’ve added the versions of all software involved below in the details section.

Installing Fedora Packages

The first step is to install the ROCm Fedora packages. One reason I chose Fedora is that is generally has up-to-date packages available for install. The ROCm version on Fedora 41 is 6.2 which supports the RX 5500.

larntz@fedora:~$ sudo dnf install rocminfo rocm-clinfo rocm-opencl rocm-hip

The versions that were installed are:

rocminfo-6.2.1-1.fc41.x86_64
rocm-clinfo-6.2.1-5.fc41.x86_64
rocm-opencl-6.2.1-5.fc41.x86_64
rocm-hip-6.2.1-5.fc41.x86_64

We’re going to need some info from the output of rocminfo. The system I ran this on has an AMD CPU with integrated graphics so I needed to pull information from rocminfo in order to configure ollama.

On my system the RX 5500 was Agent 2:

*******
Agent 2
*******
  Name:                    gfx1010
  Uuid:                    GPU-XX
  Marketing Name:          AMD Radeon RX 5500
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
<snip>

The important bits of info here are name gfx1010 and node 1. We need to know the name gfx1010 in order to select the correct driver. We need to know the node if we want to force ollam to use this GPU.

Installing Ollama

The next thing I did was install ollama. The easiest way to do this is running the command on the ollama download page. This is a curl command piped to a shell, but there are also manual install instructions. For testing purposes I chose to run the curl command.

larntz@fedora:~$ curl -fsSL https://ollama.com/install.sh | sh

Installing ollama this way will set it up to start automatically via systemd. Looking the ollama logs (sudo journalctl -u ollama) at this point will show the following warning:

level=WARN source=amd_linux.go:378 msg="amdgpu is not supported (supported types:[gfx1030 gfx1100 gfx1101 gfx1102 gfx900 gfx906 gfx908 gfx90a gfx940 gfx941 gfx942])"

Not what we want. To fix this we’ll need to set a couple environment variables in the systemd unit and create a symlink to trick ollama into using a compatible driver.

larntz@fedora:~$ sudo ln -s /usr/local/lib/ollama/rocblas/library/TensileLibrary_lazy_gfx{1030,1010}.dat 

larntz@fedora:~$ ls -l /usr/local/lib/ollama/rocblas/library/TensileLibrary_lazy_gfx1010.dat
lrwxrwxrwx. 1 root root 69 Jan 23 13:47 /usr/local/lib/ollama/rocblas/library/TensileLibrary_lazy_gfx1010.dat -> /usr/local/lib/ollama/rocblas/library/TensileLibrary_lazy_gfx1030.dat

Configuring Ollama

Next we need to edit the systemd unit for ollama. We need to tell it which driver to use and, in my case, which GPU to use. The systemd unit file is located at /etc/systemd/system/ollama.service. To specify the driver to use set this environment variable HSA_OVERRIDE_GFX_VERSION=10.1.0. In my specific case I am also setting ROCR_VISIBLE_DEVICES=1 to tell ollama which GPU to use. I’ve also added OLLAMA_HOST=0.0.0.0 so the ollama API is available over the network.

This is my ollama.service systemd unit:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/home/larntz/.local/bin:/home/larntz/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin"
Environment="HSA_OVERRIDE_GFX_VERSION=10.1.0"
Environment="ROCR_VISIBLE_DEVICES=1"
Environment="OLLAMA_HOST=0.0.0.0"

[Install]
WantedBy=default.target

Now, after running sudo systemctl daemon-reload and sudo systemctl restart ollama you should find the following line in the ollama logs (sudo journalctl -u ollama).

level=INFO source=types.go:131 msg="inference compute" id=0 library=rocm variant="" compute=gfx1012 driver=0.0 name=1002:7340 total="4.0 GiB" available="3.6 GiB"

Details

GPU: Radeon RX 5500 (4Gi)
CPU: AMD Ryzen 5 5600G with Radeon Graphics (6 core, 12 threads)
Distro: Fedora 41 Workstation
Installed Fedora packages: rocminfo, rocm-hip, rocm-clinfo, rocm-opencl
Ollama: Version is 0.5.7
Open WebUI 0.5.6

Running Models

Given the limited memory on this card it’s best to stick with small models. When reviewing the available models you’ll notice a drop down that lists the available model sizes. You’ll want to choose smaller models that can fit in the 3.6GiB of memory available on the card. You can use larger models, and ollama will only load some layers into the GPU and this will result in slower responses.

I did a quick test using three models. I used the same prompts for each model, “write a fizz buzz function in go”.

llama3.2:latest

This model is a 3b model that is about 2GiB in size.

write a fizz buzz function in go

The response was generated at 44 tokens per second and took about 4 seconds.

Using the CPU the response was 19 tokens per second in 13 seconds.

The result was:

package main

import "fmt"

// Function to print the FizzBuzz sequence up to n
func fizzbuzz(n int) {
    for i := 1; i <= n; i++ {
        if i % 15 == 0 {
            fmt.Printf("FizzBuzz\n")
        } else if i % 3 == 0 {
            fmt.Printf("Fizz\n")
        } else if i % 5 == 0 {
            fmt.Printf("Buzz\n")
        } else {
            fmt.Printf("%d\n", i)
        }
    }
}

func main() {
    // Test the FizzBuzz function
    fizzbuzz(100)
}

deepseek-r1:1.5b

This model is 1.5b tokens and is only 1.1GiB in size.

write a fizz buzz function in go

The response from this model was 57 tokens per second and took about 20 seconds but included a lot of explanation about what fizz buzz is and how to approach solving the problem. The deepseek-r1 models seem to do a lot “thinking” before responding (15-20s per response).

Using the CPU the response was 36 tokens per second in 47 seconds.

func fizzbuzz(n int) {
    for i := 1; i <= n; i++ {
        if i % 15 == 0 {
            fmt.Printf("FizzBuzz")
        } else if i % 3 == 0 {
            fmt.Printf("Fizz")
        } else if i % 5 == 0 {
            fmt.Printf("Buzz")
        } else {
            fmt.Printf("%d", i)
        }
    }
}

deepseek-r1:7b

This model is too large to fit in the GPU ram at 4.7GiB.

When loading the model ollama will output logs that show it’s not fully loaded onto the GPU.

Jan 23 16:09:32 fedora ollama[59636]: llm_load_tensors: offloading 17 repeating layers to GPU
Jan 23 16:09:32 fedora ollama[59636]: llm_load_tensors: offloaded 17/29 layers to GPU
Jan 23 16:09:32 fedora ollama[59636]: llm_load_tensors:   CPU_Mapped model buffer size =  2180.12 MiB
Jan 23 16:09:32 fedora ollama[59636]: llm_load_tensors:        ROCm0 model buffer size =  2280.33 MiB

write a fizz buzz function in go

The response from this model was quite a bit slower at 11 tokens per second and almost 3m to finish it’s response.

Using the CPU this model responded at 9 tokens per second and took about 3m to complete the response.

package main

func FizzBuzz(n int) {
    for i := 1; i <= n; i++ {
        if i%15 == 0 {
            print("FizzBuzz")
        } else if i%3 == 0 {
            print("Fizz")
        } else if i%5 == 0 {
            print("Buzz")
        } else {
            print(i)
        }
    }
}

func main() {
    FizzBuzz(15)
}

Conclusion

Using this GPU for small models is roughly about twice as fast as the CPU alone. If you have a use case for small models this would probably work relatively well, and there’s a decent speed increase from using a relatively old GPU. Larger models that do not fit in the GPU RAM will likely not see much of a speed boost.

References

Top

Running Ollama models on a Radeon RX 5500

Contents

Setup

Installing Fedora Packages

Installing Ollama

Configuring Ollama

Details

Running Models

llama3.2:latest

write a fizz buzz function in go

deepseek-r1:1.5b

write a fizz buzz function in go

deepseek-r1:7b

write a fizz buzz function in go

Conclusion

References

Related Articles