r/ScientificComputing

Two identical MPI jobs slow down drastically on Intel Alder Lake but not on Threadripper. Is it normal?

Hi everyone,

I regularly run multiple parallel MPI jobs simultaneously on my workstations. I have two systems:

Intel i7-12700 (12 cores: 8 P-cores + 4 E-cores), OS: Ubuntu 20.04
AMD Threadripper 3960X (24 cores, 48 threads), OS: Ubuntu 18.04

I wrote a simple C++ MPI test program that runs with mpirun -np 2. On both machines, a single instance finishes in about 12 seconds.

The problem appears when I run two instances at the same time (both mpirun -np 2):

Threadripper: Both finish in ~12 seconds (no slowdown)
Intel: Both take ~30 seconds (significant slowdown)

I tried pinning processes to specific cores using taskset and --cpu-set in mpirun. The processes do land on the correct cores (I verified with ps), but the slowdown persists.

Is this expected behavior for Alder Lake? Could the hybrid P-core/E-core architecture be causing memory bandwidth contention? Or am I missing something else?

I'm trying to figure out if my Intel system is performing normally or if I should be hunting for a configuration issue.

Additional notes:

My code shows reasonable&normal speed-up with increasing core numbers on both systems
The Intel PC has only one memory stick
The AMD PC has multiple memory sticks
My test code is not memory intensive (mostly CPU math)

I can provide more details if needed. I'm not super knowledgeable about CPU architectures, so apologies in advance.

Thanks for any insights!

reddit.com

u/hconel — 5 days ago

▲ 24 r/ScientificComputing+1 crossposts

A month ago I worked out a kernel-fusion technique that fuses long sequential GPU dispatch chains into a single dispatch. I tested it across six standard compute workloads — Rastrigin, N-body, Monte Carlo Pi, three RL environments, and transformer decoding — and built a public benchmark fleet at gpubench.dev. 92 unique devices across 7 GPU vendors so far. Medians: 71× Apple Silicon, 56× NVIDIA, 20× phones. Peaks: 226× / 402× / 103×. Two preprints, headline claims 720× CUDA over PyTorch (T4) and 159× WebGPU over PyTorch (M2), confirmed across CUDA / WebGPU / JAX / Triton. Everything live at kernelfusion.dev.

Once that was built and benchmarked, I mentioned the technique to my brother-in-law — he's a physicist and researcher — and asked him for a real-world target. His answer: radiobiology track-structure simulation. The math underneath cancer radiotherapy planning (proton therapy, FLASH, microdosimetry) and the radiation problem in long-duration spaceflight (cosmic-ray DNA damage budgets for Mars-class missions). He pointed me at Geant4-DNA specifically, because there's decades of published reference data — meaning a port can actually be checked, not just demoed.

I had Claude Code do the migration. After the first runs validated against Geant4-DNA 11.3.0 (CSDA range, energy conservation, ions per primary all within rounding), I asked it to add a 4D viewer. That's the clip above — 50,000 radicals from a single 10 keV electron, scrubbed from t=0 to 1 μs.

One GPU thread per primary electron, full interaction chain in one fused compute dispatch, Karamitros 2011 IRT chemistry in a Web Worker, SSB/DSB scoring on a 21×21 B-DNA fiber grid.

Live: https://webgpudna.com/see

Code (MIT): https://github.com/abgnydn/webgpu-dna

I'm a software engineer, not a radiobiologist — the validation harness is also Claude-generated, I'm trusting it more than I can independently verify it. If anyone wants to look at the WGSL or the comparison harness, I'd value that.

u/Entphorse — 9 days ago

▲ 18 r/ScientificComputing+1 crossposts

PhysCC: A DSL Compiler for Physics Simulations (SYCL, MPI, AVX2)

I’ve been working on PhysCC, an open-source tool designed to bridge the gap between high-level physics equations and low-level hardware optimization.

The problem: Writing boilerplate for SYCL, MPI, or AVX2 stencils is tedious. The solution: You write a simple equation like u = u + dt * lap(u) and PhysCC generates the optimized backend code.

Key Features:

Multi-backend support (Single-core, OpenMP, MPI, SYCL, CUDA).
AI-informed pass: It analyzes the PDE type (Hyperbolic, Parabolic, Elliptic) and suggests optimal work-group sizes for Intel Iris Xe.
Built-in visualization script for heatmaps.

It’s still a work in progress, but I’d love to hear your thoughts on the codegen or the feature extraction logic!
https://github.com/NikosPappas/PhysCC

u/Pure_Treat6246 — 4 days ago