I am working on the C++ computational core of some CAE software that runs cross platform and which uses QT for the UI.
I develop primarily in Mac OS X on a M4 Max Studio with Windows 11 ARM64 and Ubuntu ARM64 VMs hosted by Parallels. I use VS Code on all platforms and clang with LLVM OpenMP ( not Apple Clang which does not support OpenMP)
When doing some benchmarking on Mac OS I noticed that OpenMP code would perform extremely well when solving , say, a benchmark, but when running a more complex models I would see the CPU usage drop to 25% and the time taken for a solution would be quite long. It turns out OpenMP threads were running (only) on the 4 slower E-cores instead of the 12 P-cores. I could see that behavior in "Instruments".
I found the solution was the code pattern below - the thread is elevated to a P-core before doing any expensive work.
I realize that you can use OMP_PLACES to force OpenMP to only use specific cores, but that's somewhat machine/processor specific.
#ifdef Q_OS_MACOS
#pragma omp parallel if (!omp_in_parallel())
{
pthread_set_qos_class_self_np(QOS_CLASS_USER_INITIATED, 0);
#pragma omp for schedule(dynamic)
for(int i=0;i<n;++i){...
Another issue was that when my test app was in the background the OpenMP threads could be forced to be running only on E-Cores by Mac OS "App Nap". This can be avoided by using Objective-C code to disable "App Nap" in the "run" of a "Worker" thread.
void Worker::run()
{
#ifdef Q_OS_MACOS
id<NSObject> activity = [[NSProcessInfo processInfo]
beginActivityWithOptions:NSActivityUserInitiatedAllowingIdleSystemSleep
reason:@"long CAE computation"];
#endif
try {
// ... runFunction_ ...
} catch (...) { ... }
#ifdef Q_OS_MACOS
[[NSProcessInfo processInfo] endActivity:activity];
#endif
}