r/mlscaling

FlashAttention (FA1–FA4) in PyTorch - educational implementations focused on algorithmic differences

I recently updated my FlashAttention-PyTorch repo so it now includes educational implementations of FA1, FA2, FA3, and FA4 in plain PyTorch.

The main goal is to make the progression across versions easier to understand from code.

This is not meant to be an optimized kernel repo, and it is not a hardware-faithful recreation of the official implementations. The point is to expose the algorithmic ideas and design changes without immediately going deep into CUDA/Hopper/Blackwell-specific details.

Roughly, the repo now shows:

FA1: tiled online softmax baseline
FA2: split-Q / query-tile ownership, deferred normalization
FA3: explicit staged pipeline with ping-pong tile buffers, plus a simplified educational FP8 forward path
FA4: explicit scheduler with main / softmax / correction phases, and conditional/selective rescaling

So the same exact attention math is preserved, but the orchestration changes version by version.

I wrote it for people who want to understand:

"What actually changed from FA1 → FA2 → FA3 → FA4?""

without having to start from highly optimized CUDA kernels.

Repo: https://github.com/shreyansh26/FlashAttention-PyTorch

Would be interested in feedback on whether the code makes the version-to-version differences intuitive.

u/shreyansh26 — 8 hours ago

▲ 13 r/mlscaling

Schmidhuber & Meta AI Present The "Neural Computer": A New Frontier Where Computation, Memory, And I/O Move Into A Learned Runtime State.

##TL;DR:

Conventional computers execute explicit programs. Agents act over external environments. World models learn environment dynamics. Neural Computers (NCs) ask whether some of runtime itself can move into the learning system.

##Abstract: >We propose a new frontier: Neural Computers (NCs) -- an emerging machine form that unifies computation, memory, and I/O in a learned runtime state. Unlike conventional computers, which execute explicit programs, agents, which act over external execution environments, and world models, which learn environment dynamics, NCs aim to make the model itself the running computer. > >Our long-term goal is the Completely Neural Computer (CNC): the mature, general-purpose realization of this emerging machine form, with stable execution, explicit reprogramming, and durable capability reuse. As an initial step, we study whether early NC primitives can be learned solely from collected I/O traces, without instrumented program state. Concretely, we instantiate NCs as video models that roll out screen frames from instructions, pixels, and user actions (when available) in CLI and GUI settings. > >These implementations show that learned runtimes can acquire early interface primitives, especially I/O alignment and short-horizon control, while routine reuse, controlled updates, and symbolic stability remain open. We outline a roadmap toward CNCs around these challenges. If overcome, CNCs could establish a new computing paradigm beyond today's agents, world models, and conventional computers.

##Layman's Explanation:

A "Neural Computer" is built by adapting video generation architectures to train a World Model of an actual computer that can directly simulate a computer interface. Instead of interacting with a real operating system, these models can take in user actions like keystrokes and mouse clicks alongside previous screen pixels to predict and generate the next video frames. Trained solely on recorded input and output traces, it successfully learned to render readable text and control a cursor, proving that a neural network can run as its own visual computing environment without a traditional operating system.

######Link to the Paper: https://arxiv.org/pdf/2604.06425

######Link to the GitHub: https://github.com/metauto-ai/NeuralComputer

######Link to the Official Blogpost: https://metauto.ai/neuralcomputer/

u/44th--Hokage — 13 hours ago

▲ 6 r/3Blue1Brown+3 crossposts

Global mixing for local compute

If you have a sparse neural network it make take several layer for global mixing to occur.

Or you may try to construct a high width layer from several small width layers to take advantage of their smaller (∝ n²) parameter count.

Only to find you have constructed multiple independent small width neural networks due to a lack of global mixing.

You can take advantage of the one-to-all connectivity of fast transforms to provide the global mixing to fix this.

(AI assisted text): https://archive.org/details/fast-walsh-hadamard-transform-in-neural-networks-clean-view

Java code for mixing multiple width 16 (2 headed ReLU) layers into one layer and then making a neural network out of those:

https://archive.org/details/sw-net-16-b

u/oatmealcraving — 20 hours ago

FlashAttention (FA1–FA4) in PyTorch - educational implementations focused on algorithmic differences

Schmidhuber &amp; Meta AI Present The "Neural Computer": A New Frontier Where Computation, Memory, And I/O Move Into A Learned Runtime State.

Global mixing for local compute

Schmidhuber & Meta AI Present The "Neural Computer": A New Frontier Where Computation, Memory, And I/O Move Into A Learned Runtime State.