u/TheRiddler79

For anyone training, how large of a data set to use to accomplish whatever your training for, where do you get it, and what size model do you train with it?

It's probably going to sound a little insane, and I'll certainly get shade and downloads for no particular reason, but, I took a massive data set of basically all of my chats from the last 3 years between every platform, as well as all of my legal briefs, all of my research and everything in my Google docs ( as in self-created versus downloaded like Google Drive) that was post 2023.

I ended up with over 250 million words, which I then reduced multiple different ways until I had distilled roughly 14 million words of completely unique completely distilled, not repeated question and answer form training data that equals about 19 million tokens.

I'm not quite sure where the sweet spot is for a database of this size because I don't make any claims of the quality, I just know that is rather large for like some random person, so I was curious if anybody had any specific experience with Q L O R A or l o r a, I assume full training is completely out of the question for anything practical.

Before anybody tells me that the data must be trash or can't be that large or whatever, keep in mind that's irrelevant, which I prefaced with the fact that I made no claim to the quality of the data. I'm simply curious as to the sweet spot for the size of a model for that much data before it doesn't start breaking the Baseline logic.

I I think it would be hard to explain to a normal person why I spend my day staring at screens like this🤣☠️😅