u/BigNaturalTilts

I’m obviously wrong for this opinion but I believe booru tags are a far better descriptor of visual medium than natural language. Simply listing the contents in an image is far more clearer than “the light dramatically plays against blah blah” which I think is just subjective abstruseness.

Most new models now are using massive text encoders which is excellent for understanding, but there are too many ways to naturally describe an image.

Same for video, we could have time stamped tags describing scenes in a comma separated booru style method. Removes ambiguity.

Can anyone tell me why the open source community chose natural language over booru style?

Why did we move away from booru tags?