r/ScientificComputing

▲ 7 r/ScientificComputing+1 crossposts

Two identical MPI jobs slow down drastically on Intel Alder Lake but not on Threadripper. Is it normal?

Hi everyone,

I regularly run multiple parallel MPI jobs simultaneously on my workstations. I have two systems:

  • Intel i7-12700 (12 cores: 8 P-cores + 4 E-cores), OS: Ubuntu 20.04
  • AMD Threadripper 3960X (24 cores, 48 threads), OS: Ubuntu 18.04

I wrote a simple C++ MPI test program that runs with mpirun -np 2. On both machines, a single instance finishes in about 12 seconds.

The problem appears when I run two instances at the same time (both mpirun -np 2):

  • Threadripper: Both finish in ~12 seconds (no slowdown)
  • Intel: Both take ~30 seconds (significant slowdown)

I tried pinning processes to specific cores using taskset and --cpu-set in mpirun. The processes do land on the correct cores (I verified with ps), but the slowdown persists.

Is this expected behavior for Alder Lake? Could the hybrid P-core/E-core architecture be causing memory bandwidth contention? Or am I missing something else?

I'm trying to figure out if my Intel system is performing normally or if I should be hunting for a configuration issue.

Additional notes:

  • My code shows reasonable&normal speed-up with increasing core numbers on both systems
  • The Intel PC has only one memory stick
  • The AMD PC has multiple memory sticks
  • My test code is not memory intensive (mostly CPU math)

I can provide more details if needed. I'm not super knowledgeable about CPU architectures, so apologies in advance.

Thanks for any insights!

reddit.com
u/hconel — 5 days ago
▲ 24 r/ScientificComputing+1 crossposts

A month ago I worked out a kernel-fusion technique that fuses long sequential GPU dispatch chains into a single dispatch. I tested it across six standard compute workloads — Rastrigin, N-body, Monte Carlo Pi, three RL environments, and transformer decoding — and built a public benchmark fleet at gpubench.dev. 92 unique devices across 7 GPU vendors so far. Medians: 71× Apple Silicon, 56× NVIDIA, 20× phones. Peaks: 226× / 402× / 103×. Two preprints, headline claims 720× CUDA over PyTorch (T4) and 159× WebGPU over PyTorch (M2), confirmed across CUDA / WebGPU / JAX / Triton. Everything live at kernelfusion.dev.

Once that was built and benchmarked, I mentioned the technique to my brother-in-law — he's a physicist and researcher — and asked him for a real-world target. His answer: radiobiology track-structure simulation. The math underneath cancer radiotherapy planning (proton therapy, FLASH, microdosimetry) and the radiation problem in long-duration spaceflight (cosmic-ray DNA damage budgets for Mars-class missions). He pointed me at Geant4-DNA specifically, because there's decades of published reference data — meaning a port can actually be checked, not just demoed.

I had Claude Code do the migration. After the first runs validated against Geant4-DNA 11.3.0 (CSDA range, energy conservation, ions per primary all within rounding), I asked it to add a 4D viewer. That's the clip above — 50,000 radicals from a single 10 keV electron, scrubbed from t=0 to 1 μs.

One GPU thread per primary electron, full interaction chain in one fused compute dispatch, Karamitros 2011 IRT chemistry in a Web Worker, SSB/DSB scoring on a 21×21 B-DNA fiber grid.

Live: https://webgpudna.com/see

Code (MIT): https://github.com/abgnydn/webgpu-dna

I'm a software engineer, not a radiobiologist — the validation harness is also Claude-generated, I'm trusting it more than I can independently verify it. If anyone wants to look at the WGSL or the comparison harness, I'd value that.

u/Entphorse — 9 days ago
▲ 18 r/ScientificComputing+1 crossposts

PhysCC: A DSL Compiler for Physics Simulations (SYCL, MPI, AVX2)

I’ve been working on PhysCC, an open-source tool designed to bridge the gap between high-level physics equations and low-level hardware optimization.

The problem: Writing boilerplate for SYCL, MPI, or AVX2 stencils is tedious. The solution: You write a simple equation like u = u + dt * lap(u) and PhysCC generates the optimized backend code.

Key Features:

  • Multi-backend support (Single-core, OpenMP, MPI, SYCL, CUDA).
  • AI-informed pass: It analyzes the PDE type (Hyperbolic, Parabolic, Elliptic) and suggests optimal work-group sizes for Intel Iris Xe.
  • Built-in visualization script for heatmaps.

It’s still a work in progress, but I’d love to hear your thoughts on the codegen or the feature extraction logic!
https://github.com/NikosPappas/PhysCC

u/Pure_Treat6246 — 4 days ago