r/prolog
9950X3D2 SWI-Prolog Benchmarks
So I got myself an AMD 9950X3D2 with 3D V-Cache on both dies.
This thing is pretty fast...
terminal: ~/bench$ swipl run.pl
Program Time GC
――――――――――――――――――――――――――――――――
boyer 0.330 0.029
browse 0.302 0.000
chat_parser 0.333 0.000
crypt 0.364 0.000
derive 0.355 0.000
fast_mu 0.322 0.000
flatten 0.307 0.000
log10 0.322 0.000
meta_qsort 0.321 0.000
mu 0.335 0.000
nand 0.339 0.000
nreverse 0.412 0.000
ops8 0.315 0.000
perfect 0.329 0.000
poly_10 0.368 0.000
prover 0.328 0.000
qsort 0.301 0.000
queens_8 0.332 0.000
query 0.333 0.000
reducer 0.325 0.000
sendmore 0.353 0.000
serialise 0.302 0.000
simple_analyzer 0.331 0.000
tak 0.342 0.000
times10 0.368 0.000
divide10 0.396 0.000
unify 0.312 0.000
zebra 0.369 0.000
sieve 0.365 0.000
queens_clpfd 0.321 0.000
pingpong 0.383 0.000
fib 0.552 0.000
moded_path 0.510 0.000
det 0.498 0.000
eval 0.363 0.000
average 0.355 0.001
NReverse benchmark
--- Naive Reverse Benchmark (10000 items) ---
Time taken: 0.639 seconds
Total Inferences: 50,015,162
LIPS: 78244493.92
New optimization idea for N-Prolog
I’ve been experimenting with a new optimization idea for N-Prolog.
The basic idea is:
- detect predicates that are effectively functional,
- infer their input/output modes statically,
- and compile them into simpler C-style function calls.
As a first experiment, I manually wrote optimized C code using N-Prolog’s embedded C feature and compared the performance against the normal predicate execution path.
Benchmark: qsort repeated 10,000 times.
N-Prolog 5.19:
?- measure(run(qsort,10000)).
Elapsed Time=0.326355 (second) 15749722(LIPS)
SWI-Prolog:
?- time(run(qsort,10000)).
% 6,160,005 inferences, 0.358 CPU in 0.358 seconds
% (100% CPU, 17189389 Lips)
The optimized version became roughly 3–4x faster than the ordinary predicate-oriented execution path.
The motivation is that many Prolog programs contain predicates that are logically relational, but operationally behave almost like functions. qsort is a typical example.
If such predicates can be detected automatically through mode inference and static analysis, they may be compiled into much simpler execution paths while still preserving ordinary Prolog semantics for general predicates.
I’m currently working on the design of the mode inference engine itself.
Notes and experimental code are in mode.pl if anyone is interested.
I’d also be interested to hear whether similar ideas were explored historically in Prolog systems beyond the usual Mercury-style mode systems.
The Road to Speeding Up N-Prolog
Recently I had an interesting realization while practicing drums.
I’ve been trying to play Burn like Ian Paice, and I discovered that fast drumming is not about brute force. The key is eliminating unnecessary motion and using rebound efficiently.
Oddly enough, this led me to rethink optimization in my Prolog compiler.
About a year ago I introduced TCO into N-Prolog and achieved decent performance improvements for nqueens. However, qsort still resisted optimization because predicates like partition/4 appear relational:
partition([X|L], Y, [X|L1], L2) :-
X < Y, !, partition(L, Y, L1, L2).
But then I realized something important:
These predicates are only partially relational.
For example:
?- partition(X,2,[1],[2,3]).
X = [1,2,3].
This “reverse execution” works only incompletely. In practice, predicates like this behave much more like one-way functions than true relations.
I started calling them mut (“mutants”): predicates syntactically written in Prolog form, but whose computation direction is essentially fixed.
That opens an interesting possibility.
Instead of compiling everything into general relational machinery with full unification/backtracking overhead, mut predicates might be statically analyzed and translated into simple C functions.
In other words:
- ordinary relational predicates remain Prolog
- function-like predicates become optimized function code
I suspect operators such as <, arithmetic constraints, and mode usage may allow automatic direction inference.
Ironically, I originally tried to overcome this performance wall with parallelism. Now I’m beginning to think that removing unnecessary generality may be the more important optimization.
Drumming and compiler optimization turned out to share the same lesson:
eliminate unnecessary movement.