u/Which_Pitch1288 — reddlx

Last week, I read about how vibe coders were burning 100 million tokens for just a few dollars in research, and I wrote an article about it.

So basically, I did deep technical research into the tools and methods people use for this (basically anyone can replicate it), how the process works, and how it’s also being used for training smaller models and in the process they make million dollars.

here is the deep research over it if anyone is interested

https://x.com/HarshalsinghCN/status/2056626175959826692?s=20

let me know your views about this, also this is long article not for doomscrollers

u/Which_Pitch1288 — 23 hours ago

▲ 26 r/learnmachinelearning

continual learning experiment on tts

running a small experiment.

problem: tiny TTS models like Kokoro 82M forget the old voices the moment you fine-tune them on a new one. classic catastrophic forgetting.

fix: don't fine-tune the whole model. swap one of its layers for a memory bank with ~1M slots. when you add a new voice, only update the ~32 slots that voice actually uses. everything else stays frozen.

old voices: untouched.
new voices: land in empty slots.
you can keep adding forever.

(porting Lin et al's sparse memory finetuning from Meta.originally for LLMs. trying this on tiny TTS )

wish me luck

u/Which_Pitch1288 — 2 days ago

▲ 293 r/learnmachinelearning

I derived every gradient in GPT-2 by hand and trained it on a NumPy autograd engine I built from scratch

spent a few weeks rebuilding nanoGPT without using torch.backward() or jax.grad. wrote my own tiny autograd in pure NumPy, derived every backward pass on paper first, verified against PyTorch at every step.

calling it numpygrad

it's basically Karpathy's micrograd, but on tensors and with all the ops a transformer actually needs (matmul, broadcasting, LayerNorm, fused softmax-cross-entropy, causal attention, weight tying).

a few things that genuinely surprised me:

LayerNorm backward has three terms, not two. the variance depends on every input, so there's a cross-term most people miss. lost a full day to a sign error here.
np.add.at is not the same as dW[ids] += dY**.** the second one silently drops gradients when the same token id appears twice in a batch. which is always.
the softmax + cross-entropy fused gradient is genuinely beautiful — all the fractions cancel and you get (softmax(logits) - one_hot(targets)) / N. derive it on paper at least once in your life.
weight tying matters for backward too. the lm_head and token embedding share a matrix, so gradients from both uses must accumulate into the same buffer. forget this and your embedding gets half the signal.

the final check: loaded real GPT-2 124M weights into my NumPy model, ran WikiText-103 and LAMBADA, got the same perplexity as PyTorch to every digit (26.57 / 21.67 / 38.00%).

derivations, gradchecks, layer parity tests, training curves all in the repo. if you've ever wanted to actually understand what .backward() is doing, this is the long way around but you come out the other side knowing.

https://github.com/harrrshall/numpygrad

u/Which_Pitch1288 — 5 days ago