Synthetic DMS Training Data Generation with Video Models
I like spending my free time testing new AI tools and seeing where they might fit into real computer vision workflows. This time I experimented with synthetic training data generation for Driver Monitoring Systems using Seedance 2.0.
The inspiration came from Vision Banana: https://vision-banana.github.io/
The idea that really caught my attention is simple but powerful: many vision tasks can be represented as RGB outputs. A segmentation mask, an instance mask, a depth map, or another dense prediction target can all be treated as an image-like output.
So I tried to apply this thinking to video.
The workflow:
- Generate a realistic synthetic driver monitoring video
- Use the same video to generate a semantic segmentation mask
- Use the same video to generate an instance segmentation mask
- Combine the outputs into a dataset-like structure
The mosaic video shows the result:
RGB video + semantic mask + instance mask, aligned frame by frame.
The scene is a fictional driver gradually becoming drowsy behind the wheel. This kind of scenario is useful for DMS development, but difficult to collect and annotate at scale with real-world data.
Of course, generated annotations still need QA. They are not perfect ground truth.
But for prototyping, rare-case simulation, and early dataset generation, this feels like a very promising direction.
The interesting part is that the final output is not just a nice synthetic video. It can become structured training data:
- RGB frames from the generated video
- semantic classes from the semantic mask
- object regions and bounding boxes from the instance mask
- YOLO / COCO-style annotations after post-processing
I wrote a more detailed blog post about the experiment here: