u/kallekro

Automated Plagiarism Machines

We all know that LLMs are "stochastic parrots", sophisticated mirrors that reflect the data they were fed in training. They do so in a very clever way, where they essentially transform millions of disparate sources into a single output. By adjusting the "temperature" parameter, we can introduce more randomness into the next-token prediction, making LLMs seem "creative" and effectively masking the fact that they are copying someone else's homework.

Most programmers are not too concerned about this, likely because in a way, it is the continuation of the culture: writing detailed answers on StackOverflow, copying snippets, publishing and using open source libraries.
Of course the original author will no longer be credited; credit goes to the LLM. But this seems to be acceptable in the programming community. Even though most people seems to prefer toggling off the option to allow GitHub to train on their personal data.

This inherent tendency toward plagiarism seems to many to be much more egregious in other contexts:
- LLM "art" that is obviously copying someone's art style.
- Design that looks exactly like all the other vibe coded apps.
- Music without soul.
- Writing that sounds like AI.

But beyond aesthetics, the problem becomes hard to overlook in niche fields, where you will find writing that is only represented by a few people. A good example is cutting edge scientific research. Here it is not unlikely that a topic is explored only in a single published paper, written by a handful of researchers. There are many other such examples, where the number of contributors of the data that was used for training is very low. When an LLM generates such data, the theft is no longer as abstract.

We seem to tolerate these automated plagiarism machines particularly because it is hard to discern exactly what was plagiarized, and who the original authors were. But when the pool of sources is small, it becomes a lot more obvious. Does that mean that it is morally distinct to generate this type of content? Or does it just expose our subconscious hypocritical views?

reddit.com
u/kallekro — 8 hours ago