u/jonathancheckwise

Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

About a year ago I was running a single open-source AI image detector in production for a fact-checking pipeline. The accuracy on paper was solid, the accuracy on real submitted images was not. The same image classified differently across reruns when I varied preprocessing. Images from generators released after the model’s training cutoff were systematically misclassified. False positives on heavily compressed authentic photos were uncomfortably high.

I moved to an ensemble of six open-source models plus one fine-tuned model, with a layer of non-ML signals on top. The combined system is meaningfully more stable in production than any single model in the set. Writing this up because the ensemble approach is widely discussed in CV literature but the practical “which roles does each model fill” question is rarely covered in a deployment context.

The roles I ended up assigning to the six base models, not the specific names because the field moves too fast for that to be useful for long, are roughly: one model strong on diffusion-generated images (Stable Diffusion family, DALL-E family), one strong on GAN artifacts (StyleGAN derivatives), one focused on frequency-domain features that are robust to JPEG compression, one trained on a different data distribution to catch the obvious failure mode of single-model bias, one specialized on faces (where most generators concentrate effort and where most detection has edge cases), and one general-purpose model with broad coverage acting as a fallback.

These do not always agree. Disagreement between models is actually the most useful signal the ensemble produces. When all six agree, confidence is high. When they split, the image goes to human review or to the fine-tuned model that I update on each new generator. The fine-tuning pipeline runs continuously, with a new snapshot whenever a major new generator is released or quality degrades on a known one. In practice that has been every few weeks.

The non-ML layer matters more than I expected. C2PA metadata when present, generator-specific EXIF traces, compression history if reconstructable, watermark signatures from the major providers when those are detectable. None of these are reliable on their own because adversarial actors strip metadata, but they meaningfully tighten the ensemble’s confidence when they corroborate.

Where it still fails. Images that have been through multiple compression cycles after generation are hard. Images edited post-generation in standard tools blur the lines between AI-generated and AI-assisted in ways the binary classification framing does not really handle. Some of the latest video-frame extraction generators are catching us flat-footed because their per-frame artifacts are different from still-image generators.

Question for the sub: anyone running ensembles of this shape, what is your retraining cadence and how do you decide when to retire a model from the ensemble versus just adding a new one? My current heuristic is to retire only when a model is consistently the outlier on disagreement cases, but I have no idea if that is principled or convenient.

u/jonathancheckwise

Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

MSingle-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

[D] Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

[D] Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

[D] Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

I bet the AI Act would be enforced

Notes on automating source reliability scoring (three axes, three failure modes)

EU AI Act: not all of it was delayed

I run an AI-based fact-checking platform and I refuse to let the LLM produce the verdict. Here's why.

Checkwise: EU-made fact-checking tool.