
A question for those AI critics who oppose only video generation and AI images: what do you think about omni models that combine text, video, and images, and thereby become better overall?
For example https://deepmind.google/models/gemini-omni/
Of course, how much adding images and videos improves the AI itself is a debatable topic, but it can notice that video and text improve quite significantly, so the opposite should also be true.