u/Due-Math8225 — reddlx

I am working on the C++ computational core of some CAE software that runs cross platform and which uses QT for the UI.
I develop primarily in Mac OS X on a M4 Max Studio with Windows 11 ARM64 and Ubuntu ARM64 VMs hosted by Parallels. I use VS Code on all platforms and clang with LLVM OpenMP ( not Apple Clang which does not support OpenMP)

When doing some benchmarking on Mac OS I noticed that OpenMP code would perform extremely well when solving , say, a benchmark, but when running a more complex models I would see the CPU usage drop to 25% and the time taken for a solution would be quite long. It turns out OpenMP threads were running (only) on the 4 slower E-cores instead of the 12 P-cores. I could see that behavior in "Instruments".

I found the solution was the code pattern below - the thread is elevated to a P-core before doing any expensive work.
I realize that you can use OMP_PLACES to force OpenMP to only use specific cores, but that's somewhat machine/processor specific.

#ifdef Q_OS_MACOS
#pragma omp parallel if (!omp_in_parallel())
{
    pthread_set_qos_class_self_np(QOS_CLASS_USER_INITIATED, 0);
    #pragma omp for schedule(dynamic)
    for(int i=0;i&lt;n;++i){...

Another issue was that when my test app was in the background the OpenMP threads could be forced to be running only on E-Cores by Mac OS "App Nap". This can be avoided by using Objective-C code to disable "App Nap" in the "run" of a "Worker" thread.

void Worker::run()
{
#ifdef Q_OS_MACOS

    id&lt;NSObject&gt; activity = [[NSProcessInfo processInfo]
        beginActivityWithOptions:NSActivityUserInitiatedAllowingIdleSystemSleep
        reason:@"long CAE computation"];
#endif
    try {
        // ... runFunction_ ...
    } catch (...) { ... }
#ifdef Q_OS_MACOS
    [[NSProcessInfo processInfo] endActivity:activity];
#endif
}