
Byte-level LM with 284k params reaches 1.15 bpb on full TinyStories after 1 epoch
I’ve been experimenting with a lightweight byte-level language model architecture based around cumulative memory + delta update blocks instead of standard attention-heavy designs.
I trained it on the full TinyStories dataset (~2.2B bytes) for 1 epoch.
Results for the smaller version (~284k trainable params):
- Validation accuracy: 0.7443
- Validation loss: 0.7980
- Validation bits-per-byte: 1.1512
Larger version (~1.09M params):
- Validation accuracy: 0.7636
- Validation loss: 0.7416
- Validation bits-per-byte: 1.0699
Architecture characteristics:
- Byte-level (256 vocab)
- Sequence length: 256
- ~8 repeated cumulative/delta processing blocks
- Lightweight TensorFlow implementation
- No retrieval system
- Focus on temporal state evolution and cumulative memory dynamics
The core idea is treating language more like evolving causal state/trajectory rather than explicit token-to-token retrieval.
Still very experimental and only tested on TinyStories so far, but I thought the parameter efficiency was interesting enough to share.
Would love suggestions for harder datasets or useful ablations to test next.
I can post some code if requested. ezpz
Train bytes: 2,227,753,162 | records: 8,668,300 | steps/epoch: 33,860
Valid bytes: 22,502,601 | records: 87,558 | val_steps: 342
33860/33860 ━━━━━━━━━━━━━━━━━━━━ 1887s 55ms/step - accuracy: 0.7341 - bits_per_byte: 1.2041 - loss: 0.8346 - val_accuracy: 0.7443 - val_bits_per_byte: 1.1512 - val_loss: 0.7980
Saved model weights to checkpoints/mora_full_tinystories.weights.h5
Model: "delta_lm_6"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃
Layer (type)
┃
Output Shape
┃
Param #
┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ embedding_6 (Embedding) │ (256, 256, 64) │ 16,384 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ sequential_48 (Sequential) │ (256, 256, 64) │ 33,475 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ sequential_49 (Sequential) │ (256, 256, 64) │ 33,475 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ sequential_50 (Sequential) │ (256, 256, 64) │ 33,475 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ sequential_51 (Sequential) │ (256, 256, 64) │ 33,475 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ sequential_52 (Sequential) │ (256, 256, 64) │ 33,475 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ sequential_53 (Sequential) │ (256, 256, 64) │ 33,475 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ sequential_54 (Sequential) │ (256, 256, 64) │ 33,475 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ sequential_55 (Sequential) │ (256, 256, 64) │ 33,475 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
Total params:
852,554 (3.25 MB)
Trainable params:
284,184 (1.08 MB)
Non-trainable params:
0 (0.00 B)
Optimizer params:
568,370 (2.17 MB)
Here's an example of the generation these 284k params can do:
Loaded weights: checkpoints/mora_full_tinystories.weights.h5
Once upon a time, there was a family who loved to play with the car and said, "Thank you, Mom. I will not see it. She was so happy and thanked the bird fly away. The bird said, "I am sorry, mom. I didn't mean to make the sun was bright and had lots of fun. The bird was not scared anymore.
<|endoftext|>
Once upon a time, there was a little boy named Tim. Tim loved to play with a ball. The bird said, "Yes, I want to