r/HPC | reddlx

▲ 73 r/HPC+1 crossposts

SoftMig – software GPU slicing for SLURM (no hardware MIG needed, works on any CUDA 12+ GPU)

We built this at the University of Alberta because we had a pile of L40S, A40, and other GPUs that SLURM couldn't meaningfully slice. Hardware MIG only covers a handful of models, requires draining nodes to reconfigure, and locks you into rigid layouts. Result: full 48GB cards going out for jobs that needed 12GB. Classic HPC waste.

SoftMig is a SLURM-native software slicing layer — a fork of HAMi-core adapted for cluster environments. It enforces per-job memory ceilings and compute throttling via LD_PRELOAD, with prolog/epilog hooks handling the job lifecycle. Works on any CUDA 12+ GPU.

A 48GB L40S becomes:

1 full GPU
2 × 24GB half-slices
4 × 12GB quarter-slices
...or whatever layout your site defines

Change layouts through SLURM policy. No node drain, no reboot.

A few things it does that hardware MIG can't:

Mix slice sizes on the same GPU (e.g. a half + two quarters on one card)
No lost capacity — hardware MIG burns memory to its own infrastructure; SoftMig slices the full pool
Compute is sliced too, not just memory — SM access is throttled proportionally per job

Heads up on build/install: The docs are written for Digital Research Alliance of Canada / Compute Canada cluster environments, so if you're deploying elsewhere you may need to adapt things. Claude Code or Cursor work well for navigating the compilation and integration steps if you're not in that ecosystem.

MIT licensed. GitHub: https://github.com/ualberta-rcg/softmig

Happy to answer questions — we've been running v1 in production on Vulcan and v2 is now in testing.

u/VanRahim — 2 days ago

▲ 24 r/HPC

HPC/AI infra: career advice

Hi all

I’m looking for some honest career advice from people working in HPC/AI infrastructure.

Background:

~10 years working with Linux infrastructure, HPC and cloud environments
Experience with HPC clusters, schedulers, OpenStack, Kubernetes, Terraform, automation, hybrid cloud, cloudbursting, NVIDIA GPUs (not at scale), etc.
Mostly in research/scientific environments
Last ~5 years working in consulting, which meant pivoting frequently between projects and technologies depending on customer needs

Because of that, my profile evolved into a mix of:

HPC systems
cloud/platform engineering
Kubernetes/OpenStack infrastructure
automation and distributed systems

Rather than being deeply specialized in a single area like GPU, networking or schedulers.

Recently I’ve been trying to move more toward AI infrastructure/platform engineering roles, to companies product focused, and over the last months I interviewed some companies like NVIDIA, Mistral AI, NSCALE, etc.

However, I’ve consistently failed either during HR stages or technical rounds (mostly the 2nd).

One thing I’m struggling with is understanding whether:

my profile is actually relevant for the current AI infrastructure market,
or if my background is too “consulting-oriented (lack of deep knowledge)” compared to what these companies expect.

My recent work has been more Kubernetes/OpenStack/platform-oriented rather than pure bare-metal HPC, although the workloads and environments are still performance-sensitive and research-focused.

I’d appreciate honest feedback from people in similar domains:

What gaps do you usually see in profiles like mine?
What would you study or build next? (ofc, having access to GPUs at scale is not always easy)
Is HPC still a strong niche in the AI era, or should I reposition more aggressively toward cloud/platform engineering?
Is breadth from consulting perceived negatively compared to deeper specialization?

I’m especially interested in advice from people working in:

AI infrastructure
GPU clusters
platform engineering
large-scale Kubernetes/HPC environments

Thanks!

reddit.com

u/9d0cd7d2 — 3 days ago

▲ 35 r/HPC+1 crossposts

Hi all,

I recently received a small grant of around $6800 to buy a workstation for my lab at the university. I work in computational engineering / numerical methods, mainly CPU-based simulations and algorithms.

I know this is not a huge budget for a high-performance workstation, but I see it as a starting point to slowly build the lab. I’m based in a small island state, so I also need to account for shipping/import costs, meaning the actual budget for the machine itself will probably be a bit less.

At the moment, my work is much more CPU/RAM-heavy than GPU-heavy. So my main requirement is to get as much RAM as possible. I would like to start with at least 128 GB RAM, but if there is a realistic way to get 256 GB within this budget, that would be ideal.

For the CPU, I was thinking along the lines of an AMD Ryzen Threadripper, but I’m open to suggestions. I’m not sure whether it is better to go for a newer/lower-end Threadripper, older higher-core-count workstation parts, or even something else entirely.

For the GPU, I don’t need anything very powerful right now. A basic GPU would probably be enough, as long as the system can be upgraded later. In the future, I may have students working on parallelized versions of the codes, GPU acceleration, or machine learning, but that is not the immediate priority.

A few questions:

What kind of workstation configuration would you recommend for this budget?
Should I prioritize CPU cores, RAM capacity, memory bandwidth, or platform expandability?
Is Threadripper the right direction, or should I consider EPYC / Xeon / used workstation hardware?
What would be the best way to make the system expandable in the future?
If I get additional small grants later, would it make more sense to upgrade this machine with more RAM/GPU, or start adding small compute nodes?

Initially, the workstation will probably be used by two people. Later, after upgrades, it may support more students in the lab.

Any advice on practical configurations, pitfalls, or good upgrade paths would be appreciated.

reddit.com

u/Dependent-Mud-6146 — 11 days ago

▲ 8 r/HPC

I took a postgraduate applied HPC course from my Physics department. It included running code on my university's system, I've done parallelisation (OpenMP, MPI) in C and machine learning (PyTorch etc.). How to market this properly for the job market? So far I've only gotten interest from 2 job opportunities so I'm guessing I should do a project or such involving distributed data analysis or such ?

reddit.com

u/EconomistAdmirable26 — 5 days ago

▲ 0 r/HPC

Hi all,

I am very new to the world of HPC, I just want a resource that will let me run some Jupyter notebooks that I'm using for my research faster. I've requested and gotten access to my university's free system but when I try to open a Jupyter Notebook server (with just the basic settings) I'm getting the following error message:

sbatch: error: Batch job submission failed: Unexpected message received

I can't find this error on any forums and I'm not sure why I'm getting it-- I think the connection might be timing out (it takes about a minute before giving me the error) but I've tried it on a couple of different wifi networks and it isn't helping. Has anyone else had this issue?

reddit.com

u/Aware_Inflation7136 — 6 days ago

▲ 0 r/HPC

Command - squeue -u xxxx

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

1181523_[22-101%25 ct56 easydock xxxx PD 0:00 1 (Priority)

Command - squeue -p ct56 -t PD --sort=-p,i | wc -l

192 (it is increasing every hour that passes by)

Command - sprio -u xxxx

JOBID PARTITION USER PRIORITY SITE AGE FAIRSHARE JOBSIZE PARTITION TRES

1181523 ct56 xxxx 10007 0 5 0 0 10000 cpu=2,mem=0

It has been stuck for the past few hours. Last night I kept thinking it was a glitch and cancelled, but it was already age 15 or 16 afaik this morning. This new job is now at the age of 5. Anyway, could I overcome this?

reddit.com

u/ProperInsurance3124 — 12 days ago

▲ 44 r/HPC

If you haven't already heard of Copy.Fail, you're about to. New exploit that gets a local user to root instantly, 100% of the time on affected systems.

https://copy.fail

So far we have found one mitigation. Add this to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub: (on Rocky 9, modify for your distro)

 initcall_blacklist=algif_aead_init

Update GRUB, then reboot, and the exploit should no longer work.

If anyone knows better mitigations (or even better, mitigations that don't require a reboot), please post here, as I suspect they'll be popular very quickly...

u/615wonky — 14 days ago

▲ 30 r/HPC

Hi, this was reported to me today

https://github.com/V4bel/dirtyfrag

Currently the systems which are vulnerable are advised to blacklist:

esp4, esp6, and rxrpc (obviously if it makes sense to do so in your environment)

After the module unload, you also would have to drop page-cache

u/walee1 — 6 days ago

▲ 26 r/HPC

This is a research-focused HPC PhD with strong links to numerical analysis, large-scale simulation, scientific machine learning, and AI-driven computational methods. Projects span areas such as PDE solvers, multiphysics simulation, data-intensive computing, optimization, uncertainty quantification, and scalable algorithms on modern HPC architectures.

The programme is developed jointly with academic departments, research centers, and industrial partners, with an emphasis on real computational challenges and high-impact applications.

Research domains include:

scientific computing and numerical methods
HPC software and parallel algorithms
AI/ML for computational science
computational engineering and physics
climate, biomedical, and industrial simulation

More information and application details:

https://www.dm.unipi.it/phd-hpsc/call-for-applications-to-the-ph-d-programme-in-hpsc-42nd-cycle/

#HPC #ScientificComputing #ParallelComputing #NumericalAnalysis #ComputationalScience #MachineLearning #PhD

u/__LH — 7 days ago

▲ 15 r/HPC

I am working on the C++ computational core of some CAE software that runs cross platform and which uses QT for the UI.
I develop primarily in Mac OS X on a M4 Max Studio with Windows 11 ARM64 and Ubuntu ARM64 VMs hosted by Parallels. I use VS Code on all platforms and clang with LLVM OpenMP ( not Apple Clang which does not support OpenMP)

When doing some benchmarking on Mac OS I noticed that OpenMP code would perform extremely well when solving , say, a benchmark, but when running a more complex models I would see the CPU usage drop to 25% and the time taken for a solution would be quite long. It turns out OpenMP threads were running (only) on the 4 slower E-cores instead of the 12 P-cores. I could see that behavior in "Instruments".

I found the solution was the code pattern below - the thread is elevated to a P-core before doing any expensive work.
I realize that you can use OMP_PLACES to force OpenMP to only use specific cores, but that's somewhat machine/processor specific.

#ifdef Q_OS_MACOS
#pragma omp parallel if (!omp_in_parallel())
{
    pthread_set_qos_class_self_np(QOS_CLASS_USER_INITIATED, 0);
    #pragma omp for schedule(dynamic)
    for(int i=0;i&lt;n;++i){...

Another issue was that when my test app was in the background the OpenMP threads could be forced to be running only on E-Cores by Mac OS "App Nap". This can be avoided by using Objective-C code to disable "App Nap" in the "run" of a "Worker" thread.

void Worker::run()
{
#ifdef Q_OS_MACOS

    id&lt;NSObject&gt; activity = [[NSProcessInfo processInfo]
        beginActivityWithOptions:NSActivityUserInitiatedAllowingIdleSystemSleep
        reason:@"long CAE computation"];
#endif
    try {
        // ... runFunction_ ...
    } catch (...) { ... }
#ifdef Q_OS_MACOS
    [[NSProcessInfo processInfo] endActivity:activity];
#endif
}

reddit.com

u/Due-Math8225 — 13 days ago