u/CyberShellSecurity

Hey everyone,

I’m trying to get a hot-swapping setup running using llama-swap and llama-server, but I’m hitting a wall. My hardware is a bit of a mixed bag:

GPU 0: NVIDIA RTX 2000 Ada (16GB)
GPU 1: NVIDIA RTX 3060 (12GB)

I’m trying to host Llama 3.1 8B and Gemma-4 E4B with large context windows (65k and 128k respectively).

The Problem: When the agent (Hermes) tries to call the model, I get: HTTP 502: unable to start process: upstream command exited prematurely but successfully.

It seems like llama-server is receiving my flags, printing the help menu, and closing with exit code 0. I’ve tried tweaking the --tensor-split and --flash-attn, but no luck.

My config:

# llama-swap config.yaml
models:
  llama-31-8b:
    cmd: |
      llama-server --port ${PORT} --model /path/to/llama3.1.gguf -ngl 99 -c 65000 --tensor-split 0,1 -ctk q8_0 -ctv q8_0
  gemma-4/E4B-it-BF16:
    cmd: |
      llama-server --port ${PORT} --model /path/to/gemma4.gguf -ngl 99 -c 128000 -sm graph --tensor-split 16,12 -ctk q8_0 -ctv q8_0

Has anyone run into this "successful exit" crash before? Am I missing a mandatory flag for Llama 3.1 or Gemma-4 in the latest builds?

Here are all the models I have but haven't configured it yet:

DeepSeek-V2-Lite.Q8_0.gguf              Qwen3.6-27B-Q6_K.gguf
LFM2-24B-A2B.Q8_0.gguf                  bge-large-en-v1.5.Q8_0.gguf
Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf  gemma-4-26B-A4B-it-UD-Q6_K.gguf
Qwen3.5-9B-Q6_K.gguf                    gemma-4-E2B-it-BF16.gguf
Qwen3.5-9B-Q8_0.gguf                    gemma-4-E4B-it-BF16.gguf
Qwen3.5-9B-UD-Q6_K_XL.gguf

"Exited prematurely but successfully" on mismatched GPUs (Ada 2000 + 3060)

[ Removed by Reddit ]