u/Fyren-1131

▲ 1 r/LLM

What's the status on ethically sourced datasets for language models?

Hi there,

I'm a software developer / musician from Scandinavia. As my profession is software development, and my hobby is music composition/playing, I am kind of drawn between the two ends of opinion of the current iteration of LLMs.

My colleagues are excited about implementing AI best practises where it makes sense - cautiously optimistic, if you will. My musician friends are vehemently against AI to the point where it's a banned conversation topic to keep the peace.

I get where both groups are coming from, especially the music part. The obscene energy draw is one thing, but - and this is what lead me to make this post - the more personally aggravating matter is the intellectual property theft that is underlying every single popular LLM currently (gemini, claude, grok, gpt). They're all based on it, as far as I understand.

The way I understand it - all of these use some massive corpus of texts that was scraped, that included both public and copyrighted material. Are there any ethically sourced datasets that are in use? Datasets that only include public and/or licensed material?

I know of development of datacenters and AI solutions that are trying to turn the energy usage into a net positive through heating solutions (Infomaniak in Geneva), so I was hoping to be similarily positively surprised to learn of datasets also being focused on. It's a bit sad to see people blindly positive and enthusiastic about generating texts, images, videos -- ignorant to the fact that the companies behind those tools literally stole the artistry needed to concoct the very images being generated.

reddit.com
u/Fyren-1131 — 9 hours ago