u/Ancient-Sorbet-6875

I’ve been experimenting with a lightweight byte-level language model architecture based around cumulative memory + delta update blocks instead of standard attention-heavy designs.

I trained it on the full TinyStories dataset (~2.2B bytes) for 1 epoch.

Results for the smaller version (~284k trainable params):

Validation accuracy: 0.7443
Validation loss: 0.7980
Validation bits-per-byte: 1.1512

Larger version (~1.09M params):

Validation accuracy: 0.7636
Validation loss: 0.7416
Validation bits-per-byte: 1.0699

Architecture characteristics:

Byte-level (256 vocab)
Sequence length: 256
~8 repeated cumulative/delta processing blocks
Lightweight TensorFlow implementation
No retrieval system
Focus on temporal state evolution and cumulative memory dynamics

The core idea is treating language more like evolving causal state/trajectory rather than explicit token-to-token retrieval.

Still very experimental and only tested on TinyStories so far, but I thought the parameter efficiency was interesting enough to share.

Would love suggestions for harder datasets or useful ablations to test next.

I can post some code if requested. ezpz

Train bytes: 2,227,753,162 | records: 8,668,300 | steps/epoch: 33,860
Valid bytes: 22,502,601 | records: 87,558 | val_steps: 342
33860/33860 ━━━━━━━━━━━━━━━━━━━━ 1887s 55ms/step - accuracy: 0.7341 - bits_per_byte: 1.2041 - loss: 0.8346 - val_accuracy: 0.7443 - val_bits_per_byte: 1.1512 - val_loss: 0.7980
Saved model weights to checkpoints/mora_full_tinystories.weights.h5

Model: "delta_lm_6"


┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃
 Layer (type)                    
┃
 Output Shape           
┃
       Param # 
┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ embedding_6 (Embedding)         │ (256, 256, 64)         │        16,384 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ sequential_48 (Sequential)      │ (256, 256, 64)         │        33,475 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ sequential_49 (Sequential)      │ (256, 256, 64)         │        33,475 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ sequential_50 (Sequential)      │ (256, 256, 64)         │        33,475 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ sequential_51 (Sequential)      │ (256, 256, 64)         │        33,475 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ sequential_52 (Sequential)      │ (256, 256, 64)         │        33,475 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ sequential_53 (Sequential)      │ (256, 256, 64)         │        33,475 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ sequential_54 (Sequential)      │ (256, 256, 64)         │        33,475 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ sequential_55 (Sequential)      │ (256, 256, 64)         │        33,475 │
└─────────────────────────────────┴────────────────────────┴───────────────┘


 Total params: 
852,554 (3.25 MB)


 Trainable params: 
284,184 (1.08 MB)


 Non-trainable params: 
0 (0.00 B)


 Optimizer params: 
568,370 (2.17 MB)

Here's an example of the generation these 284k params can do:

Loaded weights: checkpoints/mora_full_tinystories.weights.h5
Once upon a time, there was a family who loved to play with the car and said, "Thank you, Mom. I will not see it. She was so happy and thanked the bird fly away. The bird said, "I am sorry, mom. I didn't mean to make the sun was bright and had lots of fun. The bird was not scared anymore.
&lt;|endoftext|&gt;
Once upon a time, there was a little boy named Tim. Tim loved to play with a ball. The bird said, "Yes, I want to

https://preview.redd.it/goqedtozhj0h1.png?width=3221&format=png&auto=webp&s=fa0ceda62e10e14d7cf06d7b7f0a36ffa41c745e

Byte-level LM with 284k params reaches 1.15 bpb on full TinyStories after 1 epoch