u/Fit_Sir_5296

▲ 1 r/coolgithubprojects+2 crossposts

I am building an LLM from scratch, no libraries, no shortcuts, and started from the lowest level possible: character based tokenization.

I built a bigram model, applied softmax wrong, caught it myself, and somewhere in that process realized why word level tokenization was completely abandoned after 2018.

GPT does not read words. It reads subwords. Chunks. And the algorithm behind it, BPE, was invented in 1994 for compressing files. Not for language. Not for AI.

Wrote the whole thing down honestly, mistakes included.

link: https://medium.com/towards-artificial-intelligence/071d7f1ab870
Github: https://github.com/07Codex07
LinkedIn: https://www.linkedin.com/in/vinayak-sahu-8999a9259

Happy to answer any questions in the comments.

u/Fit_Sir_5296 — 12 days ago