gfx1151 (Radeon 8060S / Strix Halo): Vulkan capability/perf mismatch vs RADV in llama.cpp workloads

### Problem description

On AMD Ryzen AI Max+ 395 (gfx1151, Radeon 8060S), AMDVLK path appears to expose/behave with a lower Vulkan capability profile for llama.cpp than RADV, resulting in major prompt-processing (pp) regression.

In equivalent llama.cpp Vulkan runs on the same machine:

- AMD open-source driver path reports:
  - `driverID = DRIVER_ID_AMD_OPEN_SOURCE`
  - `driverInfo = 2025.Q2.1 (LLPC)`
  - `shared memory: 32768`
- RADV path reports:
  - `driverID = DRIVER_ID_MESA_RADV`
  - `driverName = radv`
  - `shared memory: 65536`

This correlates with large prompt-processing deltas (same model/flags):

- Qwen3-Coder-Next 80B-A3B Q4_K_M
  - AMD open-source path: ~378 pp512
  - RADV path: ~507–522 pp512

Token generation changes are smaller; prompt-processing is where the biggest hit appears.

This looks related to AMDVLK behavior already discussed in #413 (allocation limits), but this report is focused on capability/perf mismatch on gfx1151 in normal inference runs.

### Hardware / software

- Machine: GMKtec EVO-X2
- CPU: AMD Ryzen AI Max+ 395
- iGPU: Radeon 8060S (gfx1151, UMA)
- RAM: 128 GB LPDDR5X unified memory
- OS: Fedora 43
- Kernel: 6.18+ and 6.19 tested in community reports
- llama.cpp builds tested: `05fa625ea` and nearby
- Benchmark command shape:
  - `llama-bench -m <model> --n-gpu-layers 99 -p 512 -n 128`

### Steps to reproduce

1. On gfx1151 system, run Vulkan benchmark under default AMD open-source path:

```bash
/tmp/llama-vulkan/build/bin/llama-bench \
  -m /path/to/Qwen3-Coder-Next-Q4_K_M.gguf \
  --n-gpu-layers 99 -p 512 -n 128
```

2. Capture reported Vulkan device line and results.

3. Run same benchmark forcing RADV ICD and disabling loader layers:

```bash
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.x86_64.json \
VK_LOADER_LAYERS_DISABLE=all \
/tmp/llama-vulkan/build/bin/llama-bench \
  -m /path/to/Qwen3-Coder-Next-Q4_K_M.gguf \
  --n-gpu-layers 99 -p 512 -n 128
```

4. Compare device-reported shared memory and pp/tg.

### Observed behavior

- Default path shows 32 KB shared memory and significantly lower pp.
- Forced RADV path shows 64 KB shared memory and much higher pp on same hardware.

### Expected behavior

- AMDVLK path on gfx1151 should expose capability/limits consistent with hardware expectations and avoid large perf cliffs vs RADV for the same Vulkan workload.

### Proposed resolution / asks

1. Verify gfx1151 Vulkan-reported limits/caps on AMDVLK (especially workgroup shared memory and related compute limits).
2. Confirm whether current AMDVLK path for gfx1151 is expected to report 32 KB in this context.
3. If this is a driver bug/regression, provide fix target version.
4. If this is expected behavior, please document rationale and recommended mitigations for LLM compute workloads.

### Related issues

- Related AMDVLK allocation-limit issue: #413
- Related llama.cpp thread: https://github.com/ggml-org/llama.cpp/issues/15054


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gfx1151 (Radeon 8060S / Strix Halo): Vulkan capability/perf mismatch vs RADV in llama.cpp workloads #420

Problem description

Hardware / software

Steps to reproduce

Observed behavior

Expected behavior

Proposed resolution / asks

Related issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

gfx1151 (Radeon 8060S / Strix Halo): Vulkan capability/perf mismatch vs RADV in llama.cpp workloads #420

Description

Problem description

Hardware / software

Steps to reproduce

Observed behavior

Expected behavior

Proposed resolution / asks

Related issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions