u/EffectiveMedium2683

I've been working on a process - I call it Deep Identity Distillation - to create extremely high-fidelity world language models based on real individuals. I started with Herbert Casson for a proof of concept, using GLM-OCR to get good quality OCR of the original PDFs I found on internet archive, then a VLM was given the original images (3 pages, 1 previous and one next as context for current page) as well as the transcribed text and access to a text editor I built for LLMs and the VLM (in this case Qwen3.5:27b) made sure it was as high fidelity as possible to the original texts. After that, I had Gemma3:26b-a4b go through and identify any mentions of memories, dates, etc. in the writing. Those extracted memories and dates were put aside.

Then I did continued pretraining on Mistral 7b. I chose that model specifically because it's malleable in my experience. After CPT, I ran it through a small instruction-tuning fine-tune and then had it generate A TON of samples by responding to all of the prompts in the wildchat dataset. Then I had Gemini 3.1 review each prompt/answer pair and judge whether or not it was in keeping with the original personality (as can be inferred from historical knowledge). Then that was used to generate a fine-tuning dataset. I reverted back to the base that I did continued pretraining on, did SFT on that refined dataset, and then had that model rewrite a deepseek r1 dataset (only the responses) in its own style. Then I fine-tuned on the original prompts/rewritten responses pairs.

Finally, I set up a harness to run that CoT fine-tune against RAG while it reasoned. Fine-tuned a filter model that looked at current context + stream of thought + retrieved memories (both facts and experiences etc as extracted originally) and selectively injected them into the CoT. For example:

CassonLLM: <thought>

The user is inquiring about the relationship between velocity of communication and the quality of human understanding. | Retrieved Memory (Injected): | "The summer of 1924, the correspondence with Dr. Arbuthnot regarding the telegraphic delays in the Levant. I remember the profound sense of anticipation that a letter carried—the physical weight of the ink, the days of silence that allowed a thought to settle in the mind before it was met by another...

I ran that to generate 100k prompt/response pairs, then had Gemini 3.1 flash once again judge each response on how close it was to what one might expect based on the Herbert Casson corpus. Then I took the verified highly identity aligned responses and fine-tuned the model again.

---

Long story short:

High-Fi OCR: Uses Vision LLMs to transcribe original manuscripts with 100% accuracy.
Memory Extraction: A custom filter pulls every date, event, and personal anecdote from the source text into a dedicated database.
Style Training: I use Continued Pretraining (CPT) and Supervised Fine-Tuning (SFT) so the model learns the person’s unique linguistic "DNA" rather than just mimicking it.
Cognitive RAG: The model is trained to "think" (Chain of Thought) while selectively pulling in real memories during the reasoning process.

I want to do this with Benjamin Franklin, but the scale is FAR, FAR grander. Anyone want to help? I was thinking of actually using Talkie 13b as the base this time...

Qwen3.6 35b-a3b 🤯