u/NeutronHiFi

SuperTinyKernel RTOS ULP example running on STM32F407G-DISC1

There is a recurring pattern I see in embedded power optimization work: developers build an otherwise excellent application, then spend weeks hacking around their RTOS to reduce idle current. They disable SysTick in an ISR, add hand-rolled busy-wait loops, pepper the code with __WFI() in unsafe places, and end up with something that technically saves power but is brittle, hard to maintain, and breaks on every firmware update.

The root problem is that most RTOSes treat power management as an afterthought - a bolt-on feature layered over a scheduler that was designed to always keep the CPU busy. SuperTinyKernel RTOS (STK) was built differently. Ultra-low power scheduling is a first-class design goal, and the APIs reflect that.

This article walks through the exact mechanisms STK exposes for ULP scheduling, how to wire them together correctly, and the thinking behind each design decision.

The Core Insight: The CPU is Idle most of the time

On a typical sensor node or IoT device, the workload is overwhelmingly bursty. A task wakes up, does a few microseconds of computation, posts a result somewhere, and goes back to sleep for hundreds of milliseconds. The CPU is actively executing code for perhaps 0.1-1% of wall-clock time.

On MCU like the STM32F407, a naive round-robin RTOS without power management wastes nearly all of that idle time spinning in a scheduler loop at 168 MHz and full PLL power, consuming 50-100 mA when the application's true steady-state current budget might be under 1 mA.

STK's answer is a scheduling architecture where every idle period - no matter how short - is an opportunity to cut power, and the RTOS gives you precise, deterministic control over exactly what happens during that idle period.

Pillar 1: The IEventOverrider::OnSleep() hook

The foundation of STK's ULP strategy is a single virtual function:

class LowPowerOverrider final : public stk::IPlatform::IEventOverrider
{
public:
    bool OnSleep(stk::Timeout sleep_ticks) override
    {
        // sleep_ticks = kernel ticks until the next scheduled task wake-up
        // Your power management logic goes here.
        // Return true  → you handled the idle; STK loops back.
        // Return false → hand control back to the platform driver.
    }
};

OnSleep() is called by the kernel every time all tasks are sleeping simultaneously, including inside the very first idle period, before the first task even runs. The argument sleep_ticks is not an estimate or a hint, it is the exact number of kernel ticks remaining until the earliest scheduled wake-up across all tasks and all timers.

This matters because it allows you to make a precise, data-driven decision about which power state to enter, rather than guessing or using a fixed policy. You register the overrider before starting the scheduler:

static LowPowerOverrider lp_overrider;
kernel.GetPlatform()->SetEventOverrider(&lp_overrider);
kernel.Start(); // never returns in KERNEL_STATIC mode

From that point on, the kernel will never spin-wait in its own idle loop without first offering you the chance to do something smarter.

Pillar 2: Tickless Idle (KERNEL_TICKLESS)

The OnSleep() hook alone still has a problem: if SysTick is firing every millisecond, the CPU wakes up every millisecond regardless of what you do inside OnSleep(). You save power during the brief window between SysTick interrupts, but the wakeup cadence is fixed at 1 kHz.

STK solves this with the KERNEL_TICKLESS flag:

const uint8_t KernelMode = KERNEL_STATIC
                         | KERNEL_SYNC
                         | KERNEL_TICKLESS;

static Kernel<KernelMode,
              4 + stk::time::TimerHost::TASK_COUNT,
              SwitchStrategyRR,
              PlatformDefault> kernel;

When KERNEL_TICKLESS is active, the scheduler suppresses SysTick firings entirely during idle periods. Instead of waking up every millisecond, the CPU stays in whatever sleep state you chose for the full sleep_ticks duration. A single configuration line in stk_config.h enables this:

#define STK_TICKLESS_IDLE (1)

The implication for power is significant. On STM32F407, an ADC-and-LED application where tasks sleep for 1–2 seconds at a time can spend the vast majority of its time at STOP-mode current (sub-milliamp range) rather than at full-run current with periodic wakeups.

Pillar 3: A three-tier sleep strategy

With sleep_ticks as the decision variable, you can implement a graduated power policy inside OnSleep(). Different sleep durations warrant different hardware sleep modes because each mode has a different entry overhead, wake-up latency, and power saving.

A proven three-tier approach for Cortex-M4:

sleep_ticks ≤  8  →  Tier 1: Scheduler idle — no sleep instruction
sleep_ticks 9–39  →  Tier 2: Cortex-M4 SLEEP (__WFI)
sleep_ticks ≥ 40  →  Tier 3: Cortex-M4 STOP (RTC wakeup, PLL restore)

Tier 1 handles very short idle windows where the fixed overhead of entering and exiting a sleep mode (instruction pipeline flush, cache warm-up, interrupt latency) would cost more cycles than it saves. Issuing __WFI still gates the CPU clock until the next interrupt, which is a net win even for a few milliseconds.

Tier 2 covers idle windows long enough to benefit from gating peripheral clocks and the CPU pipeline, but short enough that stopping the PLL is not worth it. The PLL remains locked, wake-up latency is a handful of CPU cycles, the SysTick interrupt fires at the next tick boundary and wakes the CPU normally.

Tier 3 is the real power saver. For idle windows of 40 ms or more, the CPU enters STM32 STOP mode - the deepest sleep state that still preserves SRAM and register contents. The main PLL is powered down. Because SysTick is also suppressed in tickless mode, nothing will wake the CPU prematurely. Instead, a hardware timer (the RTC wakeup timer running on LSI through the RTC/16 prescaler, which survives STOP) is armed to fire just before the scheduled wakeup deadline:

// Inside OnSleep(), before entering STOP mode:
Board::RtcWakeupArm(sleep_ticks);   // arm RTC to fire one tick early
STK_UNUSED(kernel->Suspend());       // stop SysTick accounting
Board::AdcSuspend();                 // cut ADC quiescent current
Board::CpuEnterDeepSleepMode();      // SLEEPDEEP + __WFI; restores PLL on return
Board::RtcWakeupDisarm();
Board::AdcResume();
kernel->Resume(sleep_ticks);        // advance tick counter; prevent time skew

The kernel->Suspend() / kernel->Resume(sleep_ticks) pair is critical and easy to get wrong in a hand-rolled solution. Suspend() stops the SysTick counter cleanly (returning the ticks until the next wakeup). Resume(sleep_ticks) advances the kernel's tick counter by exactly the number of ticks that were spent in STOP mode, so no task sees time drift and all deadlines remain correct. Without this, stk::Sleep() and stk::SleepUntil() would accumulate timing errors across every STOP entry.

Pillar 4: Precise, drift-free task sleeping

Even the best idle policy is wasted if tasks introduce timing drift that forces them to wake more often than necessary. STK offers two sleep primitives:

// Simple: sleep for a fixed duration from "now"
stk::Sleep(stk::GetTicksFromMs(1000));

// Precise: sleep until an absolute tick timestamp
stk::SleepUntil(g_Timeline += stk::GetTicksFromMs(1000));

SleepUntil() is the correct choice for periodic tasks. Consider a LED task that is supposed to switch every 1 second. If the task does a small amount of work before sleeping, a plain Sleep(1000) will drift by the duration of that work on every cycle. Over hundreds of cycles the LED rhythm becomes perceptibly irregular, and more importantly, the task may wake slightly earlier than the STOP mode threshold, causing a Tier 3 idle to degrade to a Tier 2 idle, which negates the deep-sleep savings.

SleepUntil() eliminates this. The task records an absolute timeline tick at startup and increments it on every cycle:

void Run() override
{
    g_Timeline = stk::GetTicks();
    while (true)
    {
        // ... do work ...
        stk::SleepUntil(g_Timeline += stk::GetTicksFromMs(1000));
        // ... hand off to next task ...
    }
}

The sleep duration seen by the scheduler is always exactly 1000 ticks, regardless of how long the work took. The idle window between tasks is fully predictable, which lets the Tier 3 threshold be applied reliably.

Pillar 5: Event-driven task architecture with zero busy-polling

The other side of ULP scheduling is what tasks do when they are not sleeping: they must not busy-poll. Every busy-polling loop keeps the CPU in run mode and prevents the scheduler from seeing a true idle window.

STK's synchronization primitives are all designed to block efficiently and without spinning when the condition is not yet met:

  • EventFlags lets multiple tasks coordinate on a bitmask of up to 31 independent flags. A task that needs to wait for another task's output simply calls Wait() with the relevant flag bits. The calling task is descheduled immediately; no CPU time is consumed until the flags are set:

​

// Task A waits until Task B signals FLAG_SENSOR_READY
uint32_t result = g_Flags.Wait(FLAG_SENSOR_READY, EventFlags::OPT_WAIT_ANY);
if (!EventFlags::IsError(result)) {
    // process sensor data
}
  • PipeT<T, N> (a typed blocking pipe) decouples producers from consumers with zero polling. A log task that processes sensor readings can block indefinitely on the pipe until data arrives, consuming zero CPU and zero SysTick wakeups in the meantime:

&#8203;

// LogTask: blocks until a sample is available, or 10 s elapse
if (g_AdcPipe.Read(sample, stk::GetTicksFromMs(10000))) {
    // process sample
}
  • Semaphore, Mutex, ConditionVariable, and RWMutex all follow the same pattern: callers block without spinning. The scheduler only wakes a task when the resource or condition it is waiting for is actually available.

The combined effect is that tasks are either sleeping (contributing to the idle window) or running for a bounded, predictable burst (doing actual work). Neither state involves the CPU spinning uselessly.

Pillar 6: Timer-based callbacks without a dedicated task

A common anti-pattern in RTOS applications is using a dedicated task for each periodic activity - an "ADC task" that wakes every 2 seconds, reads a sensor, and immediately goes back to sleep. This consumes a task slot and forces the scheduler to track one more stack and one more wake-up deadline.

STK's TimerHost solves this without a dedicated user task. Periodic callbacks are registered as timer objects; the TimerHost manages them inside its own internal task:

class AdcTimer final : public stk::time::TimerHost::Timer
{
    void OnExpired(stk::time::TimerHost *) override
    {
        AdcSample sample;
        sample.raw       = Board::AdcRead();
        sample.timestamp = stk::GetTicks();
        g_AdcPipe.TryWrite(sample); // non-blocking
    }
};

static AdcTimer adc_timer;
timer_host.Start(adc_timer, stk::GetTicksFromMs(2000), stk::GetTicksFromMs(2000));

The callback executes in the TimerHost's handler task context. It must complete quickly and must not block, but for most sensor-sampling workloads (a few microseconds of ADC conversion time plus a pipe write), this is entirely acceptable and saves a task slot compared to a dedicated task.

Pillar 7: ISR-safe kernel suspend and resume

Some applications need to freeze all task activity in response to an external event, entering a low-power hibernation state until a hardware trigger occurs. STK supports this with ISR-safe Suspend() and Resume():

// In an EXTI ISR (USER button press):
stk::IKernelService *kernel = stk::IKernelService::GetInstance();

if (!g_KernelSuspended) {
    kernel-&gt;Suspend();       // stops SysTick; all tasks frozen
    g_KernelSuspended = true;
} else {
    kernel-&gt;Resume(0);       // 0 ticks: "woke from indefinite sleep"
    g_KernelSuspended = false;
}

When g_KernelSuspended is set, OnSleep() detects it and enters STOP mode unconditionally wthout arming the RTC timer, because there is no scheduled task to wake for. The CPU will only exit STOP when an external interrupt fires (in this case, the button EXTI). This gives maximum power saving during a user-initiated pause, and the pattern is fully safe to use from an ISR.

Full ULP scheduling checklist:

Achieving true ultra-low power with STK is the result of all these pieces working together:

  1. Enable KERNEL_TICKLESS - stop paying for SysTick interrupts during idle.
  2. Implement IEventOverrider::OnSleep() - own the idle period; choose the correct hardware sleep mode based on sleep_ticks.
  3. Use kernel->Suspend() / kernel->Resume(sleep_ticks) in STOP mode - prevent time skew; maintain tick integrity.
  4. Use stk::SleepUntil() for periodic tasks - drift-free sleeping that keeps idle windows predictable and maximally long.
  5. Use blocking synchronization primitives (EventFlags, PipeT, Semaphore, etc.) - eliminate busy-polling; every waiting task contributes to the idle window.
  6. Use TimerHost for periodic callbacks - avoid dedicated tasks for simple periodic work; keep the task count low.
  7. Suspend peripherals inside OnSleep() - cut quiescent currents (ADC, radio, sensors) for the duration of each sleep period; restore them before returning.

None of these points requires special hardware tricks or microcontroller-specific hacks. The scheduling architecture does the heavy lifting, the hardware-specific parts (which sleep mode to enter, which timer to arm for wake-up) are isolated in OnSleep() and can be ported to any Cortex-M device by replacing a handful of register writes.

Results

On the STM32F407G-DISC1 running the configuration described here: three LED tasks sleeping 1 second each, a temperature sensor sampled every 2 seconds - CPU spends only a few microseconds per second actively executing code. The rest of the time it is in STOP mode at sub-milliamp system current, waking precisely when a task needs to run and immediately returning to deep sleep when the burst of work is done.

That is what "truly low power" looks like in an RTOS application: not a reduction in run-mode current, but a dramatic reduction in the fraction of time the CPU spends in run mode at all.

STK makes this achievable without fighting the scheduler.

There are dedicated ULP examples which you can check for implementation details:

Repository: SuperTinyKernel RTOS on GitHub

reddit.com
u/NeutronHiFi — 10 days ago
▲ 4 r/NeutronHiFi+1 crossposts

Just recently SuperTinyKernel RTOS (STK) got a major update which made it very easy to migrate to STK for projects based on FreeRTOS or CMSIS RTOS2 interface.

Achieved FreeRTOS API coverage is ~99% (only some rarely-used debugging stuff is missing), CMSIS RTOS2 API coverage is full. Full-featured MPU-related API is yet to arrive but by default STK supports privilege separation.

It is now possible to try STK's scheduling strategies in your FreeRTOS/CMSIS RTOS2 firmware without really changing anything in the existing code base. If some of the RTOS2 or FreeRTOS's APIs were limiting your development and you wanted more fine-grained programming tools then probably STK's API is a direct match for that.

STK's presents its API as building blocks. This approach allows to achieve an efficient and high-performance firmware design. It was also proved in practice by implementing FreeRTOS and CMSIS RTOS2 interfaces. From programming view, FreeRTOS looks like a CISC command-set while STK is a RISC - building blocks from which you can build anything, even another RTOS ;)

There are FreeRTOS and CMSIS RTOS2 examples doing similar work and demonstrating API usage via STK's backend.

From a performance perspective, STK's scheduling kernel is more efficient and shows higher throughput of useful payload for tasks (recent benchmark results), so you can get additional performance out of your existing firmware/system design by using STK's wrapper as a drop-in replacement.

So, with new changes you can either give a try to STK using its FreeRTOS or CMSIS RTOS2 wrapper with a slow transition to its original C++ or C API which consumes less RAM than FreeRTOS.

reddit.com
u/NeutronHiFi — 18 days ago

If you’ve ever tried using a high-end USB DAC with a game console, you know the struggle: Sony (PS4/5) and Nintendo (NSW1/2) usually stick to the older UAC1 standard leaving many modern DACs silent or unrecognized. Additionally, PS4/5 sets connected USB DAC to a very low unusable volume level.

Neutron HiFi DAC V1 + Neutron HiFi Isolator V1

Latest Firmware 69 update for Neutron HiFi DAC V1 officially bridges that gap, adding dedicated support for PlayStation 4/5 (PS4, PS5) and Nintendo Switch 1 & 2 (NSW, NSW2) game consoles.

UAC1 mode is selectable with NConfigurator 1.9.2. This allows the DAC to be perfectly recognized by game consoles that don't support the newer UAC2 protocol.

Now you get all Neutron’s hardware-level DSP features active while gaming! Here are some useful insights on how you can benefit from Neutron's DSP during a gameplay:

  • Adaptive Loudness Compensation (ALC): This is huge for late-night gaming. As you lower the volume, the DAC automatically adjusts the frequency balance to compensate for human hearing sensitivity. You get full-bodied bass and clear treble even at whisper-quiet levels.
  • Crossfeed (Fatigue Reduction): Crossfeed reduces the extreme stereo separation typical of headphones by subtly blending the left and right channels (simulating how we hear speakers in a room). This makes the audio feel much more natural and significantly reduces ear fatigue during marathon gaming sessions.
  • Surround Sound (Ambiophonics R.A.C.E.): It expands the stereo image, providing a much wider soundstage. In competitive shooters or open-world RPGs, this significantly improves spatial awareness and immersion. Note though, RACE is for external speakers, not headphones.
  • Parametric EQ (PEQ): You can load your favorite AutoEq profiles directly onto the DAC or just adjust tonality to your liking depending on the game you are playing.

DAC V1 is based on a low-power NXP MCU with DSP processing being extremely efficient ensuring that it won't kill NSW's battery in handheld mode.

New UAC1 option:

https://preview.redd.it/12fqegk95ukg1.png?width=1125&format=png&auto=webp&s=3f5e61ab854200d990fa0f10c6994ebfa772c5d6

PS4/5 volume problem can be solved via new Fixed Volume option or existing Apple volume scale option (with this option you get just -8 dB volume drop):

https://preview.redd.it/1cagt34b5ukg1.png?width=1125&format=png&auto=webp&s=c56ab1f9f23ccbdef3c943eae7f0d0bd61f109c4

Release notes for Firmware 69 and NConfigurator can be found on Neutron Forum.

If any questions in relation to DAC V1 usage with mentioned game consoles, or you wish to share experience, welcome for a discussion in this dedicated thread!

reddit.com
u/NeutronHiFi — 3 months ago

Do you know SuperTinyKernel RTOS?

It is a high-performance, bare-metal RTOS designed for resource-constrained environments. STK does not abstract peripherals (so you have to use vendor's HAL), provides a lightweight multitasking for embedded applications (no bloated API, not project-intrusive), and is very easy to use (no complex configuration or board-specific port are needed). In benchmark tests STK outperforms FreeRTOS by allowing more CPU time for tasks (less overhead from the side of the scheduling logic).

Technical resources:

You can explore capabilities in detail in project's GitHub repo but briefly STK can:

  • Soft and Hard Real-Time (HRT) support: STK supports preemptive scheduling for “soft real-time” tasks, you can also enable hard real-time mode (KERNEL_HRT) for periodic tasks with guaranteed deadlines.
  • Static and dynamic tasks: Define all tasks at startup (KERNEL_STATIC) or allow tasks to be created and destroyed at runtime (KERNEL_DYNAMIC).
  • Scheduling strategies: Round-Robin (RR), Fixed-Priority (FP, similar to FreeRTOS), Smooth Weighted Round-Robin (SWRR), Rate-Monotonic (RM), Deadline-Monotonic (DM), including Worst Case Reaction Time (WCRT) analysis, Earliest Deadline First (EDF), custom (via ITaskSwitchStrategy).
  • Mixed-criticality: Supports MCAS (2-level) and MCAS4 (4-level) adaptive strategies featuring SWRR-based group scheduling, automatic cascade escalation/recovery, and elastic CPU share adaptation driven by per-group EWMA execution-pressure estimation.
  • Tick or Tickless modes: Fixed-interval periodic interrupts (Tick) for simplicity, or dynamic timer-based wakeups (Tickless, KERNEL_TICKLESS) to maximize CPU sleep duration and power efficiency.
  • Synchronization API: CriticalSection, SpinLock, Mutex, Event, ConditionVariable, Semaphore, Pipe primitives for inter-task, inter-core synchronization. Synchronization is optional via KERNEL_SYNC kernel mode.
  • Memory API: Deterministic, fragmentation-free allocator in stk::memory namespace.
  • Low-power friendliness: STK puts MCU into a low-power mode when there are no runnable tasks (task calls Sleep).
  • Tiny footprint: Minimal C++ abstractions (no STL, no heavy namespaces) keep the kernel small and simple.
  • Safety-critical systems ready: No dynamic heap memory allocation (satisfies MISRA C++:2008 Rule 18-4-1).
  • Portability: Supports ARM Cortex-M and RISC-V RV32 MCUs.
  • Extensibility: C++ interfaces allow easy extensibility of STK, e.g. custom scheduling strategy.
  • Multi-core support: Fully implemented for Cortex-M and RISC-V.
  • C++ and C API: Can be used easily in C++ and C projects.
  • CMSIS-RTOS2 wrapper: Full CMSIS-RTOS2 wrapper maps the standard ARM CMSIS-RTOS2 C API onto STK.
  • FreeRTOS wrapper: Full FreeRTOS wrapper (freertos_stk.cpp) maps the standard FreeRTOS C API onto STK, enabling drop-in migration of existing FreeRTOS codebases with minimal or no application changes.
  • Traceability: Supports tracing of tasks scheduling with a SEGGER SystemView.
  • x86 development mode: Compile & debug your code on a PC before flashing to the MCU, which helps with early testing and unit tests.
  • 100% test coverage: Every source-code line of scheduler logic is covered by unit tests.
  • QEMU test coverage: All repository commits are automatically covered by unit tests executed on QEMU for Cortex-M0 and M4.
  • Open-source License - MIT: Open and completely free for commercial, educational, closed-source, open-source projects.

r/stm32 developers:

There are ready to use STM32 examples for STM32F051 MCU (STM32F0DISCOVERY dev board), STM32F103 (NUCLEO-F103RB dev board), STM32F407 (STM32F4DISCOVERY dev board): https://github.com/SuperTinyKernel-RTOS/stk/tree/main/build/example/project/eclipse/stm

r/RISCV embedded developers:

STK got fully verified support for embedded RISC-V MCU just recently. Earlier, the implementation was validated against QEMU only. Raspberry Pico 2 W board in RISC-V mode was used for the implementation validation. There are RPI examples (see below) for RISC-V architecture.

r/esp32 developers:

If you are willing to go for a bare-metal firmware, then STK can provide a convenient multi-threading for the latest ESP32 MCUs with RISC-V architecture. There is an example of config for ESP32-H2/C6 MCUs: https://github.com/SuperTinyKernel-RTOS/stk/tree/main/stk/src/arch/risc-v

r/raspberrypipico developers:

STK supports both architectures - ARM Cortex-M and RISC-V, thus your firmware, if implements multi-threading with STK, will be ready for a future RISC-V only MCU. There are Eclipse CDT examples for Rapsberry Pico 2 W board: https://github.com/SuperTinyKernel-RTOS/stk/tree/main/build/example/project/eclipse/rpi

In general, STK offers probably one of the easiest ways to add multithreading to the firmware. It is only a thread scheduler, it does not offer platform abstraction therefore no board-specific porting is needed and you can keep using BSP of your bare-metal project, there is no any interference with it from STK's side.

STK is developed in C++ but it also has C API for easy use in C projects, you do not need to develop your own C++ to C wrapper, see /interop/c in the repo and blinky_c example.

This is a dedicated thread for STK. Please, suggest new features, ask questions, share your project details which is using STK.

Welcome for a kind and respectful discussion! 🙂

reddit.com
u/NeutronHiFi — 5 months ago