u/Thrumpwart

▲ 4 r/prolog

9950X3D2 SWI-Prolog Benchmarks

So I got myself an AMD 9950X3D2 with 3D V-Cache on both dies.

This thing is pretty fast...

terminal: ~/bench$ swipl run.pl

Program Time GC

――――――――――――――――――――――――――――――――

boyer 0.330 0.029

browse 0.302 0.000

chat_parser 0.333 0.000

crypt 0.364 0.000

derive 0.355 0.000

fast_mu 0.322 0.000

flatten 0.307 0.000

log10 0.322 0.000

meta_qsort 0.321 0.000

mu 0.335 0.000

nand 0.339 0.000

nreverse 0.412 0.000

ops8 0.315 0.000

perfect 0.329 0.000

poly_10 0.368 0.000

prover 0.328 0.000

qsort 0.301 0.000

queens_8 0.332 0.000

query 0.333 0.000

reducer 0.325 0.000

sendmore 0.353 0.000

serialise 0.302 0.000

simple_analyzer 0.331 0.000

tak 0.342 0.000

times10 0.368 0.000

divide10 0.396 0.000

unify 0.312 0.000

zebra 0.369 0.000

sieve 0.365 0.000

queens_clpfd 0.321 0.000

pingpong 0.383 0.000

fib 0.552 0.000

moded_path 0.510 0.000

det 0.498 0.000

eval 0.363 0.000

average 0.355 0.001

NReverse benchmark

--- Naive Reverse Benchmark (10000 items) ---

Time taken: 0.639 seconds

Total Inferences: 50,015,162

LIPS: 78244493.92

reddit.com
u/Thrumpwart — 6 days ago

Attention Drift: What Autoregressive Speculative Decoding Models Learn

Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call \textbf{attention drift}: as the drafter generates successive tokens within a speculation chain, attention progressively moves from the prompt onto its own recently-generated tokens. We observe this across both \emph{EAGLE3} drafters and \emph{MTP heads}, suggesting drift is a property of drafter designs. We trace this to the un-normalized residual path between chain steps: the drafter's hidden state magnitude grows monotonically with chain depth, which exhibits dynamics consistent with additional pre-norm transformer layers stacked on the target rather than as a standalone autoregressive predictor. In order to limit the growth, we propose two architectural changes: Post-norm on the drafter hidden states and per-hidden-state RMSNorm after capturing target hidden states. Our interventions improve acceptance length over the current leading model, pre-norm EAGLE3, by up to 2× under template perturbation, 1.18× on long-context tasks, and 1.10× on seven standard benchmarks spanning multi-turn chat, math, and coding. Our changes also allow shorter train-time-test depths to generalize over longer drafting sequences.

arxiv.org
u/Thrumpwart — 8 days ago

I look forward to the Local LLM community getting llama.cpp to run on these. Could be a good value.

u/Thrumpwart — 21 days ago
▲ 282 r/LocalLLM+1 crossposts

Came across hipfire the other day. It's a brand new inference engine focused on all AMD GPU's (not just the latest).

Github.

It uses a special mq4 quantization method. The hipfire creator is pumping out models on huggingface.

I don't know enough about quantization to know how good these quants are in terms of quality, but as an RDNA3 aficionado I'm happy AMD is getting some attention.

Localmaxxing is a new LLM benchmarking site, and shows some pretty dramatic speedups for hipfire inference.

Edit: I should have just said hipfire - I don't think this is connected to AMD officially.

u/Thrumpwart — 22 days ago