u/Gloomy_Recognition_4

I like spending my free time testing new AI tools and seeing where they might fit into real computer vision workflows. This time I experimented with synthetic training data generation for Driver Monitoring Systems using Seedance 2.0.

The inspiration came from Vision Banana: https://vision-banana.github.io/

The idea that really caught my attention is simple but powerful: many vision tasks can be represented as RGB outputs. A segmentation mask, an instance mask, a depth map, or another dense prediction target can all be treated as an image-like output.

So I tried to apply this thinking to video.

The workflow:

Generate a realistic synthetic driver monitoring video
Use the same video to generate a semantic segmentation mask
Use the same video to generate an instance segmentation mask
Combine the outputs into a dataset-like structure

The mosaic video shows the result:

RGB video + semantic mask + instance mask, aligned frame by frame.

The scene is a fictional driver gradually becoming drowsy behind the wheel. This kind of scenario is useful for DMS development, but difficult to collect and annotate at scale with real-world data.

Of course, generated annotations still need QA. They are not perfect ground truth.

But for prototyping, rare-case simulation, and early dataset generation, this feels like a very promising direction.

The interesting part is that the final output is not just a nice synthetic video. It can become structured training data:

RGB frames from the generated video
semantic classes from the semantic mask
object regions and bounding boxes from the instance mask
YOLO / COCO-style annotations after post-processing

I wrote a more detailed blog post about the experiment here:

https://www.antal.ai/blog/synthetic_dms_training_data.html

Synthetic DMS Training Data Generation with Video Models