u/jonathancheckwise

Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

About a year ago I was running a single open-source AI image detector in production for a fact-checking pipeline. The accuracy on paper was solid, the accuracy on real submitted images was not. The same image classified differently across reruns when I varied preprocessing. Images from generators released after the model’s training cutoff were systematically misclassified. False positives on heavily compressed authentic photos were uncomfortably high.

I moved to an ensemble of six open-source models plus one fine-tuned model, with a layer of non-ML signals on top. The combined system is meaningfully more stable in production than any single model in the set. Writing this up because the ensemble approach is widely discussed in CV literature but the practical “which roles does each model fill” question is rarely covered in a deployment context.

The roles I ended up assigning to the six base models, not the specific names because the field moves too fast for that to be useful for long, are roughly: one model strong on diffusion-generated images (Stable Diffusion family, DALL-E family), one strong on GAN artifacts (StyleGAN derivatives), one focused on frequency-domain features that are robust to JPEG compression, one trained on a different data distribution to catch the obvious failure mode of single-model bias, one specialized on faces (where most generators concentrate effort and where most detection has edge cases), and one general-purpose model with broad coverage acting as a fallback.

These do not always agree. Disagreement between models is actually the most useful signal the ensemble produces. When all six agree, confidence is high. When they split, the image goes to human review or to the fine-tuned model that I update on each new generator. The fine-tuning pipeline runs continuously, with a new snapshot whenever a major new generator is released or quality degrades on a known one. In practice that has been every few weeks.

The non-ML layer matters more than I expected. C2PA metadata when present, generator-specific EXIF traces, compression history if reconstructable, watermark signatures from the major providers when those are detectable. None of these are reliable on their own because adversarial actors strip metadata, but they meaningfully tighten the ensemble’s confidence when they corroborate.

Where it still fails. Images that have been through multiple compression cycles after generation are hard. Images edited post-generation in standard tools blur the lines between AI-generated and AI-assisted in ways the binary classification framing does not really handle. Some of the latest video-frame extraction generators are catching us flat-footed because their per-frame artifacts are different from still-image generators.

Question for the sub: anyone running ensembles of this shape, what is your retraining cadence and how do you decide when to retire a model from the ensemble versus just adding a new one? My current heuristic is to retire only when a model is consistently the outlier on disagreement cases, but I have no idea if that is principled or convenient.

reddit.com
u/jonathancheckwise — 2 days ago

Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

About a year ago I was running a single open-source AI image detector in production for a fact-checking pipeline. The accuracy on paper was solid, the accuracy on real submitted images was not. The same image classified differently across reruns when I varied preprocessing. Images from generators released after the model’s training cutoff were systematically misclassified. False positives on heavily compressed authentic photos were uncomfortably high.

I moved to an ensemble of six open-source models plus one fine-tuned model, with a layer of non-ML signals on top. The combined system is meaningfully more stable in production than any single model in the set. Writing this up because the ensemble approach is widely discussed in CV literature but the practical “which roles does each model fill” question is rarely covered in a deployment context.

The roles I ended up assigning to the six base models, not the specific names because the field moves too fast for that to be useful for long, are roughly: one model strong on diffusion-generated images (Stable Diffusion family, DALL-E family), one strong on GAN artifacts (StyleGAN derivatives), one focused on frequency-domain features that are robust to JPEG compression, one trained on a different data distribution to catch the obvious failure mode of single-model bias, one specialized on faces (where most generators concentrate effort and where most detection has edge cases), and one general-purpose model with broad coverage acting as a fallback.

These do not always agree. Disagreement between models is actually the most useful signal the ensemble produces. When all six agree, confidence is high. When they split, the image goes to human review or to the fine-tuned model that I update on each new generator. The fine-tuning pipeline runs continuously, with a new snapshot whenever a major new generator is released or quality degrades on a known one. In practice that has been every few weeks.

The non-ML layer matters more than I expected. C2PA metadata when present, generator-specific EXIF traces, compression history if reconstructable, watermark signatures from the major providers when those are detectable. None of these are reliable on their own because adversarial actors strip metadata, but they meaningfully tighten the ensemble’s confidence when they corroborate.

Where it still fails. Images that have been through multiple compression cycles after generation are hard. Images edited post-generation in standard tools blur the lines between AI-generated and AI-assisted in ways the binary classification framing does not really handle. Some of the latest video-frame extraction generators are catching us flat-footed because their per-frame artifacts are different from still-image generators.

Question for the sub: anyone running ensembles of this shape, what is your retraining cadence and how do you decide when to retire a model from the ensemble versus just adding a new one? My current heuristic is to retire only when a model is consistently the outlier on disagreement cases, but I have no idea if that is principled or convenient.

reddit.com
u/jonathancheckwise — 2 days ago

Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

About a year ago I was running a single open-source AI image detector in production for a fact-checking pipeline. The accuracy on paper was solid, the accuracy on real submitted images was not. The same image classified differently across reruns when I varied preprocessing. Images from generators released after the model’s training cutoff were systematically misclassified. False positives on heavily compressed authentic photos were uncomfortably high.

I moved to an ensemble of six open-source models plus one fine-tuned model, with a layer of non-ML signals on top. The combined system is meaningfully more stable in production than any single model in the set. Writing this up because the ensemble approach is widely discussed in CV literature but the practical “which roles does each model fill” question is rarely covered in a deployment context.

The roles I ended up assigning to the six base models, not the specific names because the field moves too fast for that to be useful for long, are roughly: one model strong on diffusion-generated images (Stable Diffusion family, DALL-E family), one strong on GAN artifacts (StyleGAN derivatives), one focused on frequency-domain features that are robust to JPEG compression, one trained on a different data distribution to catch the obvious failure mode of single-model bias, one specialized on faces (where most generators concentrate effort and where most detection has edge cases), and one general-purpose model with broad coverage acting as a fallback.

These do not always agree. Disagreement between models is actually the most useful signal the ensemble produces. When all six agree, confidence is high. When they split, the image goes to human review or to the fine-tuned model that I update on each new generator. The fine-tuning pipeline runs continuously, with a new snapshot whenever a major new generator is released or quality degrades on a known one. In practice that has been every few weeks.

The non-ML layer matters more than I expected. C2PA metadata when present, generator-specific EXIF traces, compression history if reconstructable, watermark signatures from the major providers when those are detectable. None of these are reliable on their own because adversarial actors strip metadata, but they meaningfully tighten the ensemble’s confidence when they corroborate.

Where it still fails. Images that have been through multiple compression cycles after generation are hard. Images edited post-generation in standard tools blur the lines between AI-generated and AI-assisted in ways the binary classification framing does not really handle. Some of the latest video-frame extraction generators are catching us flat-footed because their per-frame artifacts are different from still-image generators.

Question for the sub: anyone running ensembles of this shape, what is your retraining cadence and how do you decide when to retire a model from the ensemble versus just adding a new one? My current heuristic is to retire only when a model is consistently the outlier on disagreement cases, but I have no idea if that is principled or convenient.

reddit.com
u/jonathancheckwise — 2 days ago

Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

About a year ago I was running a single open-source AI image detector in production for a fact-checking pipeline. The accuracy on paper was solid, the accuracy on real submitted images was not. The same image classified differently across reruns when I varied preprocessing. Images from generators released after the model’s training cutoff were systematically misclassified. False positives on heavily compressed authentic photos were uncomfortably high.

I moved to an ensemble of six open-source models plus one fine-tuned model, with a layer of non-ML signals on top. The combined system is meaningfully more stable in production than any single model in the set. Writing this up because the ensemble approach is widely discussed in CV literature but the practical “which roles does each model fill” question is rarely covered in a deployment context.

The roles I ended up assigning to the six base models, not the specific names because the field moves too fast for that to be useful for long, are roughly: one model strong on diffusion-generated images (Stable Diffusion family, DALL-E family), one strong on GAN artifacts (StyleGAN derivatives), one focused on frequency-domain features that are robust to JPEG compression, one trained on a different data distribution to catch the obvious failure mode of single-model bias, one specialized on faces (where most generators concentrate effort and where most detection has edge cases), and one general-purpose model with broad coverage acting as a fallback.

These do not always agree. Disagreement between models is actually the most useful signal the ensemble produces. When all six agree, confidence is high. When they split, the image goes to human review or to the fine-tuned model that I update on each new generator. The fine-tuning pipeline runs continuously, with a new snapshot whenever a major new generator is released or quality degrades on a known one. In practice that has been every few weeks.

The non-ML layer matters more than I expected. C2PA metadata when present, generator-specific EXIF traces, compression history if reconstructable, watermark signatures from the major providers when those are detectable. None of these are reliable on their own because adversarial actors strip metadata, but they meaningfully tighten the ensemble’s confidence when they corroborate.

Where it still fails. Images that have been through multiple compression cycles after generation are hard. Images edited post-generation in standard tools blur the lines between AI-generated and AI-assisted in ways the binary classification framing does not really handle. Some of the latest video-frame extraction generators are catching us flat-footed because their per-frame artifacts are different from still-image generators.

Question for the sub: anyone running ensembles of this shape, what is your retraining cadence and how do you decide when to retire a model from the ensemble versus just adding a new one? My current heuristic is to retire only when a model is consistently the outlier on disagreement cases, but I have no idea if that is principled or convenient.

reddit.com
u/jonathancheckwise — 2 days ago

Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

About a year ago I was running a single open-source AI image detector in production for a fact-checking pipeline. The accuracy on paper was solid, the accuracy on real submitted images was not. The same image classified differently across reruns when I varied preprocessing. Images from generators released after the model’s training cutoff were systematically misclassified. False positives on heavily compressed authentic photos were uncomfortably high.

I moved to an ensemble of six open-source models plus one fine-tuned model, with a layer of non-ML signals on top. The combined system is meaningfully more stable in production than any single model in the set. Writing this up because the ensemble approach is widely discussed in CV literature but the practical “which roles does each model fill” question is rarely covered in a deployment context.

The roles I ended up assigning to the six base models, not the specific names because the field moves too fast for that to be useful for long, are roughly: one model strong on diffusion-generated images (Stable Diffusion family, DALL-E family), one strong on GAN artifacts (StyleGAN derivatives), one focused on frequency-domain features that are robust to JPEG compression, one trained on a different data distribution to catch the obvious failure mode of single-model bias, one specialized on faces (where most generators concentrate effort and where most detection has edge cases), and one general-purpose model with broad coverage acting as a fallback.

These do not always agree. Disagreement between models is actually the most useful signal the ensemble produces. When all six agree, confidence is high. When they split, the image goes to human review or to the fine-tuned model that I update on each new generator. The fine-tuning pipeline runs continuously, with a new snapshot whenever a major new generator is released or quality degrades on a known one. In practice that has been every few weeks.

The non-ML layer matters more than I expected. C2PA metadata when present, generator-specific EXIF traces, compression history if reconstructable, watermark signatures from the major providers when those are detectable. None of these are reliable on their own because adversarial actors strip metadata, but they meaningfully tighten the ensemble’s confidence when they corroborate.

Where it still fails. Images that have been through multiple compression cycles after generation are hard. Images edited post-generation in standard tools blur the lines between AI-generated and AI-assisted in ways the binary classification framing does not really handle. Some of the latest video-frame extraction generators are catching us flat-footed because their per-frame artifacts are different from still-image generators.

Question for the sub: anyone running ensembles of this shape, what is your retraining cadence and how do you decide when to retire a model from the ensemble versus just adding a new one? My current heuristic is to retire only when a model is consistently the outlier on disagreement cases, but I have no idea if that is principled or convenient.

reddit.com
u/jonathancheckwise — 2 days ago

MSingle-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

About a year ago I was running a single open-source AI image detector in production for a fact-checking pipeline. The accuracy on paper was solid, the accuracy on real submitted images was not. The same image classified differently across reruns when I varied preprocessing. Images from generators released after the model’s training cutoff were systematically misclassified. False positives on heavily compressed authentic photos were uncomfortably high.

I moved to an ensemble of six open-source models plus one fine-tuned model, with a layer of non-ML signals on top. The combined system is meaningfully more stable in production than any single model in the set. Writing this up because the ensemble approach is widely discussed in CV literature but the practical “which roles does each model fill” question is rarely covered in a deployment context.

The roles I ended up assigning to the six base models, not the specific names because the field moves too fast for that to be useful for long, are roughly: one model strong on diffusion-generated images (Stable Diffusion family, DALL-E family), one strong on GAN artifacts (StyleGAN derivatives), one focused on frequency-domain features that are robust to JPEG compression, one trained on a different data distribution to catch the obvious failure mode of single-model bias, one specialized on faces (where most generators concentrate effort and where most detection has edge cases), and one general-purpose model with broad coverage acting as a fallback.

These do not always agree. Disagreement between models is actually the most useful signal the ensemble produces. When all six agree, confidence is high. When they split, the image goes to human review or to the fine-tuned model that I update on each new generator. The fine-tuning pipeline runs continuously, with a new snapshot whenever a major new generator is released or quality degrades on a known one. In practice that has been every few weeks.

The non-ML layer matters more than I expected. C2PA metadata when present, generator-specific EXIF traces, compression history if reconstructable, watermark signatures from the major providers when those are detectable. None of these are reliable on their own because adversarial actors strip metadata, but they meaningfully tighten the ensemble’s confidence when they corroborate.

Where it still fails. Images that have been through multiple compression cycles after generation are hard. Images edited post-generation in standard tools blur the lines between AI-generated and AI-assisted in ways the binary classification framing does not really handle. Some of the latest video-frame extraction generators are catching us flat-footed because their per-frame artifacts are different from still-image generators.

Question for the sub: anyone running ensembles of this shape, what is your retraining cadence and how do you decide when to retire a model from the ensemble versus just adding a new one? My current heuristic is to retire only when a model is consistently the outlier on disagreement cases, but I have no idea if that is principled or convenient.

reddit.com
u/jonathancheckwise — 2 days ago

[D] Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

About a year ago I was running a single open-source AI image detector in production for a fact-checking pipeline. The accuracy on paper was solid, the accuracy on real submitted images was not. The same image classified differently across reruns when I varied preprocessing. Images from generators released after the model’s training cutoff were systematically misclassified. False positives on heavily compressed authentic photos were uncomfortably high.

I moved to an ensemble of six open-source models plus one fine-tuned model, with a layer of non-ML signals on top. The combined system is meaningfully more stable in production than any single model in the set. Writing this up because the ensemble approach is widely discussed in CV literature but the practical “which roles does each model fill” question is rarely covered in a deployment context.

The roles I ended up assigning to the six base models, not the specific names because the field moves too fast for that to be useful for long, are roughly: one model strong on diffusion-generated images (Stable Diffusion family, DALL-E family), one strong on GAN artifacts (StyleGAN derivatives), one focused on frequency-domain features that are robust to JPEG compression, one trained on a different data distribution to catch the obvious failure mode of single-model bias, one specialized on faces (where most generators concentrate effort and where most detection has edge cases), and one general-purpose model with broad coverage acting as a fallback.

These do not always agree. Disagreement between models is actually the most useful signal the ensemble produces. When all six agree, confidence is high. When they split, the image goes to human review or to the fine-tuned model that I update on each new generator. The fine-tuning pipeline runs continuously, with a new snapshot whenever a major new generator is released or quality degrades on a known one. In practice that has been every few weeks.

The non-ML layer matters more than I expected. C2PA metadata when present, generator-specific EXIF traces, compression history if reconstructable, watermark signatures from the major providers when those are detectable. None of these are reliable on their own because adversarial actors strip metadata, but they meaningfully tighten the ensemble’s confidence when they corroborate.

Where it still fails. Images that have been through multiple compression cycles after generation are hard. Images edited post-generation in standard tools blur the lines between AI-generated and AI-assisted in ways the binary classification framing does not really handle. Some of the latest video-frame extraction generators are catching us flat-footed because their per-frame artifacts are different from still-image generators.

Question for the sub: anyone running ensembles of this shape, what is your retraining cadence and how do you decide when to retire a model from the ensemble versus just adding a new one? My current heuristic is to retire only when a model is consistently the outlier on disagreement cases, but I have no idea if that is principled or convenient.

reddit.com
u/jonathancheckwise — 2 days ago

[D] Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

About a year ago I was running a single open-source AI image detector in production for a fact-checking pipeline. The accuracy on paper was solid, the accuracy on real submitted images was not. The same image classified differently across reruns when I varied preprocessing. Images from generators released after the model’s training cutoff were systematically misclassified. False positives on heavily compressed authentic photos were uncomfortably high.

I moved to an ensemble of six open-source models plus one fine-tuned model, with a layer of non-ML signals on top. The combined system is meaningfully more stable in production than any single model in the set. Writing this up because the ensemble approach is widely discussed in CV literature but the practical “which roles does each model fill” question is rarely covered in a deployment context.

The roles I ended up assigning to the six base models, not the specific names because the field moves too fast for that to be useful for long, are roughly: one model strong on diffusion-generated images (Stable Diffusion family, DALL-E family), one strong on GAN artifacts (StyleGAN derivatives), one focused on frequency-domain features that are robust to JPEG compression, one trained on a different data distribution to catch the obvious failure mode of single-model bias, one specialized on faces (where most generators concentrate effort and where most detection has edge cases), and one general-purpose model with broad coverage acting as a fallback.

These do not always agree. Disagreement between models is actually the most useful signal the ensemble produces. When all six agree, confidence is high. When they split, the image goes to human review or to the fine-tuned model that I update on each new generator. The fine-tuning pipeline runs continuously, with a new snapshot whenever a major new generator is released or quality degrades on a known one. In practice that has been every few weeks.

The non-ML layer matters more than I expected. C2PA metadata when present, generator-specific EXIF traces, compression history if reconstructable, watermark signatures from the major providers when those are detectable. None of these are reliable on their own because adversarial actors strip metadata, but they meaningfully tighten the ensemble’s confidence when they corroborate.

Where it still fails. Images that have been through multiple compression cycles after generation are hard. Images edited post-generation in standard tools blur the lines between AI-generated and AI-assisted in ways the binary classification framing does not really handle. Some of the latest video-frame extraction generators are catching us flat-footed because their per-frame artifacts are different from still-image generators.

Question for the sub: anyone running ensembles of this shape, what is your retraining cadence and how do you decide when to retire a model from the ensemble versus just adding a new one? My current heuristic is to retire only when a model is consistently the outlier on disagreement cases, but I have no idea if that is principled or convenient.

reddit.com
u/jonathancheckwise — 2 days ago

[D] Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

About a year ago I was running a single open-source AI image detector in production for a fact-checking pipeline. The accuracy on paper was solid, the accuracy on real submitted images was not. The same image classified differently across reruns when I varied preprocessing. Images from generators released after the model’s training cutoff were systematically misclassified. False positives on heavily compressed authentic photos were uncomfortably high.

I moved to an ensemble of six open-source models plus one fine-tuned model, with a layer of non-ML signals on top. The combined system is meaningfully more stable in production than any single model in the set. Writing this up because the ensemble approach is widely discussed in CV literature but the practical “which roles does each model fill” question is rarely covered in a deployment context.

The roles I ended up assigning to the six base models, not the specific names because the field moves too fast for that to be useful for long, are roughly: one model strong on diffusion-generated images (Stable Diffusion family, DALL-E family), one strong on GAN artifacts (StyleGAN derivatives), one focused on frequency-domain features that are robust to JPEG compression, one trained on a different data distribution to catch the obvious failure mode of single-model bias, one specialized on faces (where most generators concentrate effort and where most detection has edge cases), and one general-purpose model with broad coverage acting as a fallback.

These do not always agree. Disagreement between models is actually the most useful signal the ensemble produces. When all six agree, confidence is high. When they split, the image goes to human review or to the fine-tuned model that I update on each new generator. The fine-tuning pipeline runs continuously, with a new snapshot whenever a major new generator is released or quality degrades on a known one. In practice that has been every few weeks.

The non-ML layer matters more than I expected. C2PA metadata when present, generator-specific EXIF traces, compression history if reconstructable, watermark signatures from the major providers when those are detectable. None of these are reliable on their own because adversarial actors strip metadata, but they meaningfully tighten the ensemble’s confidence when they corroborate.

Where it still fails. Images that have been through multiple compression cycles after generation are hard. Images edited post-generation in standard tools blur the lines between AI-generated and AI-assisted in ways the binary classification framing does not really handle. Some of the latest video-frame extraction generators are catching us flat-footed because their per-frame artifacts are different from still-image generators.

Question for the sub: anyone running ensembles of this shape, what is your retraining cadence and how do you decide when to retire a model from the ensemble versus just adding a new one? My current heuristic is to retire only when a model is consistently the outlier on disagreement cases, but I have no idea if that is principled or convenient.

reddit.com
u/jonathancheckwise — 2 days ago

Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

About a year ago I was running a single open-source AI image detector in production for a fact-checking pipeline. The accuracy on paper was solid, the accuracy on real submitted images was not. The same image classified differently across reruns when I varied preprocessing. Images from generators released after the model’s training cutoff were systematically misclassified. False positives on heavily compressed authentic photos were uncomfortably high.

I moved to an ensemble of six open-source models plus one fine-tuned model, with a layer of non-ML signals on top. The combined system is meaningfully more stable in production than any single model in the set. Writing this up because the ensemble approach is widely discussed in CV literature but the practical “which roles does each model fill” question is rarely covered in a deployment context.

The roles I ended up assigning to the six base models, not the specific names because the field moves too fast for that to be useful for long, are roughly: one model strong on diffusion-generated images (Stable Diffusion family, DALL-E family), one strong on GAN artifacts (StyleGAN derivatives), one focused on frequency-domain features that are robust to JPEG compression, one trained on a different data distribution to catch the obvious failure mode of single-model bias, one specialized on faces (where most generators concentrate effort and where most detection has edge cases), and one general-purpose model with broad coverage acting as a fallback.

These do not always agree. Disagreement between models is actually the most useful signal the ensemble produces. When all six agree, confidence is high. When they split, the image goes to human review or to the fine-tuned model that I update on each new generator. The fine-tuning pipeline runs continuously, with a new snapshot whenever a major new generator is released or quality degrades on a known one. In practice that has been every few weeks.

The non-ML layer matters more than I expected. C2PA metadata when present, generator-specific EXIF traces, compression history if reconstructable, watermark signatures from the major providers when those are detectable. None of these are reliable on their own because adversarial actors strip metadata, but they meaningfully tighten the ensemble’s confidence when they corroborate.

Where it still fails. Images that have been through multiple compression cycles after generation are hard. Images edited post-generation in standard tools blur the lines between AI-generated and AI-assisted in ways the binary classification framing does not really handle. Some of the latest video-frame extraction generators are catching us flat-footed because their per-frame artifacts are different from still-image generators.

Question for the sub: anyone running ensembles of this shape, what is your retraining cadence and how do you decide when to retire a model from the ensemble versus just adding a new one? My current heuristic is to retire only when a model is consistently the outlier on disagreement cases, but I have no idea if that is principled or convenient.

reddit.com
u/jonathancheckwise — 2 days ago

Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

About a year ago I was running a single open-source AI image detector in production for a fact-checking pipeline. The accuracy on paper was solid, the accuracy on real submitted images was not. The same image classified differently across reruns when I varied preprocessing. Images from generators released after the model’s training cutoff were systematically misclassified. False positives on heavily compressed authentic photos were uncomfortably high.

I moved to an ensemble of six open-source models plus one fine-tuned model, with a layer of non-ML signals on top. The combined system is meaningfully more stable in production than any single model in the set. Writing this up because the ensemble approach is widely discussed in CV literature but the practical “which roles does each model fill” question is rarely covered in a deployment context.

The roles I ended up assigning to the six base models, not the specific names because the field moves too fast for that to be useful for long, are roughly: one model strong on diffusion-generated images (Stable Diffusion family, DALL-E family), one strong on GAN artifacts (StyleGAN derivatives), one focused on frequency-domain features that are robust to JPEG compression, one trained on a different data distribution to catch the obvious failure mode of single-model bias, one specialized on faces (where most generators concentrate effort and where most detection has edge cases), and one general-purpose model with broad coverage acting as a fallback.

These do not always agree. Disagreement between models is actually the most useful signal the ensemble produces. When all six agree, confidence is high. When they split, the image goes to human review or to the fine-tuned model that I update on each new generator. The fine-tuning pipeline runs continuously, with a new snapshot whenever a major new generator is released or quality degrades on a known one. In practice that has been every few weeks.

The non-ML layer matters more than I expected. C2PA metadata when present, generator-specific EXIF traces, compression history if reconstructable, watermark signatures from the major providers when those are detectable. None of these are reliable on their own because adversarial actors strip metadata, but they meaningfully tighten the ensemble’s confidence when they corroborate.

Where it still fails. Images that have been through multiple compression cycles after generation are hard. Images edited post-generation in standard tools blur the lines between AI-generated and AI-assisted in ways the binary classification framing does not really handle. Some of the latest video-frame extraction generators are catching us flat-footed because their per-frame artifacts are different from still-image generators.

Question for the sub: anyone running ensembles of this shape, what is your retraining cadence and how do you decide when to retire a model from the ensemble versus just adding a new one? My current heuristic is to retire only when a model is consistently the outlier on disagreement cases, but I have no idea if that is principled or convenient.

reddit.com
u/jonathancheckwise — 2 days ago

Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

About a year ago I was running a single open-source AI image detector in production for a fact-checking pipeline. The accuracy on paper was solid, the accuracy on real submitted images was not. The same image classified differently across reruns when I varied preprocessing. Images from generators released after the model’s training cutoff were systematically misclassified. False positives on heavily compressed authentic photos were uncomfortably high.

I moved to an ensemble of six open-source models plus one fine-tuned model, with a layer of non-ML signals on top. The combined system is meaningfully more stable in production than any single model in the set. Writing this up because the ensemble approach is widely discussed in CV literature but the practical “which roles does each model fill” question is rarely covered in a deployment context.

The roles I ended up assigning to the six base models, not the specific names because the field moves too fast for that to be useful for long, are roughly: one model strong on diffusion-generated images (Stable Diffusion family, DALL-E family), one strong on GAN artifacts (StyleGAN derivatives), one focused on frequency-domain features that are robust to JPEG compression, one trained on a different data distribution to catch the obvious failure mode of single-model bias, one specialized on faces (where most generators concentrate effort and where most detection has edge cases), and one general-purpose model with broad coverage acting as a fallback.

These do not always agree. Disagreement between models is actually the most useful signal the ensemble produces. When all six agree, confidence is high. When they split, the image goes to human review or to the fine-tuned model that I update on each new generator. The fine-tuning pipeline runs continuously, with a new snapshot whenever a major new generator is released or quality degrades on a known one. In practice that has been every few weeks.

The non-ML layer matters more than I expected. C2PA metadata when present, generator-specific EXIF traces, compression history if reconstructable, watermark signatures from the major providers when those are detectable. None of these are reliable on their own because adversarial actors strip metadata, but they meaningfully tighten the ensemble’s confidence when they corroborate.

Where it still fails. Images that have been through multiple compression cycles after generation are hard. Images edited post-generation in standard tools blur the lines between AI-generated and AI-assisted in ways the binary classification framing does not really handle. Some of the latest video-frame extraction generators are catching us flat-footed because their per-frame artifacts are different from still-image generators.

Question for the sub: anyone running ensembles of this shape, what is your retraining cadence and how do you decide when to retire a model from the ensemble versus just adding a new one? My current heuristic is to retire only when a model is consistently the outlier on disagreement cases, but I have no idea if that is principled or convenient.

reddit.com
u/jonathancheckwise — 2 days ago

Single-model AI image detection failed in production. Here’s what 6 models in ensemble actually look like

About a year ago I was running a single open-source AI image detector in production for a fact-checking pipeline. The accuracy on paper was solid, the accuracy on real submitted images was not. The same image classified differently across reruns when I varied preprocessing. Images from generators released after the model’s training cutoff were systematically misclassified. False positives on heavily compressed authentic photos were uncomfortably high.

I moved to an ensemble of six open-source models plus one fine-tuned model, with a layer of non-ML signals on top. The combined system is meaningfully more stable in production than any single model in the set. Writing this up because the ensemble approach is widely discussed in CV literature but the practical “which roles does each model fill” question is rarely covered in a deployment context.

The roles I ended up assigning to the six base models, not the specific names because the field moves too fast for that to be useful for long, are roughly: one model strong on diffusion-generated images (Stable Diffusion family, DALL-E family), one strong on GAN artifacts (StyleGAN derivatives), one focused on frequency-domain features that are robust to JPEG compression, one trained on a different data distribution to catch the obvious failure mode of single-model bias, one specialized on faces (where most generators concentrate effort and where most detection has edge cases), and one general-purpose model with broad coverage acting as a fallback.

These do not always agree. Disagreement between models is actually the most useful signal the ensemble produces. When all six agree, confidence is high. When they split, the image goes to human review or to the fine-tuned model that I update on each new generator. The fine-tuning pipeline runs continuously, with a new snapshot whenever a major new generator is released or quality degrades on a known one. In practice that has been every few weeks.

The non-ML layer matters more than I expected. C2PA metadata when present, generator-specific EXIF traces, compression history if reconstructable, watermark signatures from the major providers when those are detectable. None of these are reliable on their own because adversarial actors strip metadata, but they meaningfully tighten the ensemble’s confidence when they corroborate.

Where it still fails. Images that have been through multiple compression cycles after generation are hard. Images edited post-generation in standard tools blur the lines between AI-generated and AI-assisted in ways the binary classification framing does not really handle. Some of the latest video-frame extraction generators are catching us flat-footed because their per-frame artifacts are different from still-image generators.

Question for the sub: anyone running ensembles of this shape, what is your retraining cadence and how do you decide when to retire a model from the ensemble versus just adding a new one? My current heuristic is to retire only when a model is consistently the outlier on disagreement cases, but I have no idea if that is principled or convenient.

reddit.com
u/jonathancheckwise — 2 days ago

I bet the AI Act would be enforced

Some founder context for a Sunday morning. I am building a small AI company solo from Geneva, in the information-quality space. Most EU founders I have talked to over the past year took it as given that the AI Act would get watered down, postponed, or quietly ignored. I took the opposite bet. I built the whole stack as though every transparency, audit, and watermarking obligation would actually be enforced from August 2026 onwards.

A few months in, I want to be honest about what that bet has cost so far.

The infrastructure choices are the most expensive part. EU-hosted everything: VPS in Geneva, GPU inference on Scaleway in Paris, LLM provider is Mistral, vector store is self-hosted Qdrant, auth and storage on self-hosted Appwrite. Each of those is more expensive or more brittle than the US-cloud-first defaults most accelerators recommend. I have spent weekends on infra problems that would have been one-click somewhere else.

The architectural cost was the second. Every claim my product evaluates needs to produce an audit trail an end user, or in theory a regulator, can read. That ruled out the family of “let the LLM produce the verdict” architectures that most fact-checking and decision-support tools default to. The scoring lives in deterministic Python instead. Significantly more engineering work than prompting a model to rate something, and you cannot prompt your way out of a bad rule.

Then the May 7 Omnibus deal happened, and it proved both camps partly right at once. High-risk Annex III moved 16 months, to December 2027. The founders who bet on delay got their delay on that piece. But the Article 50 deployer obligations stayed at August 2 2026: deepfake labeling, chatbot disclosure, AI-generated content transparency on public interest matters. The provider watermarking obligation got pushed three months, to December 2. The political case for any further delay has been spent.

Where I land. The cost of compliance-first is real and I undersold it to myself when I started. The infra premium, the architecture rigidity, the policy reading, all paid up front. I do think it pays back, partly because retrofitting a non-compliant system after enforcement starts will be more expensive than building compliantly was, and partly because EU enterprise buyers and institutional partners are starting to ask compliance questions early rather than late. But the payback is theoretical for now. I am six months into a thirty-six month bet.

Curious if anyone else in this sub took the same direction, or if you bet the other way and are now figuring out what to retrofit. And especially: anyone with concrete contact with the supervisory authorities of their member state, would love to know how seriously they are staffing enforcement.

reddit.com
u/jonathancheckwise — 3 days ago
▲ 1 r/OSINT

Notes on automating source reliability scoring (three axes, three failure modes)

Sharing notes from a year of trying to automate parts of source reliability scoring in a fact-checking pipeline. None of this replaces a human analyst with context, but pieces of it can do useful triage work at scale where humans can’t keep up with volume. Writing this up because it’s the kind of thing the OSINT community discusses better than anyone else, and I’d be curious to compare notes with people who do this in the field daily.

I ended up with three axes that I evaluate independently then combine with weights that vary by claim category. The axes are domain reputation, content recency, and cross-source confirmation. Each one fails in characteristic ways and each one taught me something the hard way.

Domain reputation is the most tempting and the most dangerous axis. The temptation is to maintain a curated list of trusted domains scored on a 0 to 10 scale: AFP at 9, nytimes.com at 8, randomblogger.substack.com at 2, and so on. This works for most claims and produces respectable triage. Where it breaks is what I call article-vs-domain variance. A normally credible outlet can run a poorly-sourced opinion piece. A normally unreliable outlet can run a properly-sourced investigation. Domain-level scoring will flag the first as trustworthy and the second as junk, and both calls will be wrong. My fix was to keep domain reputation as one input but never the deciding one, and to surface the gap between domain score and article-level signals as a flag for human review rather than absorbing it into a single number.

Content recency is the axis that looks easy and isn’t. The naive version is publication date: newer is better. This breaks immediately because the relevant freshness depends on the claim type. For a scientific claim, the most authoritative source is often a meta-analysis that’s three years old, not a press release from yesterday. For a political quote, the original transcription matters more than the seventh outlet’s summary. For an active event, anything older than 24 hours is borderline useless. I ended up with category-specific freshness functions: a decay curve for news claims, a step function for scientific claims (peer-reviewed vs not), a flat weight for definitional claims. Still imperfect, but vastly more honest than a single recency parameter.

Cross-source confirmation is the most powerful axis when it works and the most misleading when it doesn’t. The principle: a claim confirmed by N independent sources is stronger than the same claim from any one of them. The problem is independence is hard to verify automatically. Eight outlets running the same wire story are not eight independent confirmations, they are one source amplified. Two outlets owned by the same parent group with the same editorial line are not two independent confirmations either. My current approach is to cluster sources by likely independence (publisher, ownership, geographic origin, language family) and count distinct clusters rather than distinct URLs. It is still gameable, and a sufficiently coordinated influence operation can defeat it, but it kills the simplest forms of citation laundering.

A couple of general lessons took longer to learn than they should have. The first: surfacing the scoring per axis to the end consumer matters more than producing a single composite score. Investigators trust a system that shows them where confidence is coming from, and stop trusting one that hands them an opaque verdict. The composite is for triage. The breakdown is for decision-making.

The second: calibration on real cases beats theoretical purity every time. I had axis weights I was proud of on paper that produced terrible results on actual disputed claims. The fix was to assemble a labeled set of cases where I knew the right answer and tune until the system tracked human judgment, not until the math felt elegant.

What axes are you using formally or informally that I haven’t named here, and where have you seen automated scoring systems fail in ways that matter?

reddit.com
u/jonathancheckwise — 4 days ago

EU AI Act: not all of it was delayed

Six days ago the Council and Parliament reached political agreement on the Digital Omnibus on AI, the package that amends the AI Act. The reporting around it has been all over the place, so writing this up for other founders who are not lawyers and want the plain version.

The short version: the August 2026 deadline is partly delayed, partly not. Knowing which part affects you matters.

What got pushed back.

High-risk AI systems under Annex III (employment, education, biometrics, critical infrastructure, law enforcement, justice, migration, essential services) move from August 2, 2026 to December 2, 2027.

A 16-month postponement.

High-risk AI embedded in products covered by Annex I sectoral safety law (medical devices, machinery, toys, connected products) move to August 2, 2028.
The watermarking obligation for providers of generative AI under Article 50(2) gets a four-month grace period. Compliance is now due December 2, 2026 instead of August 2.

What did not move.

All other Article 50 transparency obligations still apply from August 2, 2026. That includes deployer obligations: labeling deepfakes, disclosing AI-generated text on matters of public interest, informing users they are interacting with an AI system. If your product surfaces AI-generated content to end users, those obligations hit you in less than three months, omnibus or no omnibus.

Article 5 prohibitions remain. The omnibus actually adds new ones: nudifier tools, non-consensual intimate imagery, and CSAM generation will be banned from December 2026.

What’s still uncertain.

The deal still needs formal adoption by Council and Parliament, then legal-linguistic revision, then Official Journal publication. Expected in the coming weeks but not done.

National enforcement authorities are still being set up in most member states. How aggressively they’ll come out of the gate is a separate question from what the law says on paper.

The Code of Practice on transparency of AI-generated content is still being finalized, with a final version expected by June.

What I’m doing as a solo founder.

Treating December 2, 2026 as the binding date for watermarking and machine-readable marking of generated content. Treating August 2, 2026 as the binding date for any user-facing disclosure obligations. Not betting on a second delay since the political argument for delay has been spent.

Curious if anyone else in the sub got a different read on what changed, or what they think the deployer obligations will actually look like in practice once the Code of Practice is final.

reddit.com
u/jonathancheckwise — 8 days ago

I run an AI-based fact-checking platform and I refuse to let the LLM produce the verdict. Here's why.

After a year building a production fact-checking system, the single most counter-intuitive design decision I keep defending is this: the LLM in our pipeline never produces a numeric score, never produces a true/false verdict, never produces anything that gets surfaced to the user as a judgment. The LLM extracts structured factual flags from source material. A deterministic Python scoring layer turns those flags into a verdict tier. That’s it.

This is uncomfortable to explain because everyone, including potential customers, assumes that “AI-powered fact-checking” means the AI gives the verdict. The pitch would be cleaner if I let the LLM say “this claim is 73% likely false” and called it a day. But here’s why I won’t.

LLM scoring instability is real and underdocumented. Run the same prompt with the same model on the same claim five times and you get verdicts ranging from “mostly false” to “partially true” depending on sampling temperature and the order in which sources appear in the context window. This is fine for creative writing. It is catastrophic when a journalist needs to defend their decision to publish or kill a story. “Our scoring varies by 30% based on stochastic sampling” is not a sentence you can put in front of an editorial board.

LLM verdicts are also unauditable. When the LLM says “false,” there is no way to point at which sources mattered, which signals pushed the score, which weights applied. The reasoning chain is opaque even with chain-of-thought prompting, because the chain itself is generated probabilistically and may rationalize after the fact rather than reflect the actual computation. Journalists I’ve spoken with don’t want a confident AI verdict. They want a verifiable verdict.

Those are different things.

The split I landed on is this. The LLM is good at extraction. Given a source document and a claim, it can flag “this source confirms X,” “this source contradicts Y,” “this source is silent on Z” with reasonable consistency. These flags are structured (booleans or short categorical labels), not numeric scores. The Python scoring layer takes those flags, applies pre-defined weights based on source credibility (independently computed from MBFC, NewsGuard, RSF, Wikidata cross-referencing), and produces a verdict tier. The weights are documented. The scoring rules are deterministic. The same input always produces the same output. Anyone can audit which sources contributed how much to a given verdict.

The trade-off is real. The system is less flexible than letting the LLM “reason” freely. Edge cases where the claim doesn’t fit the categorical extraction schema sometimes produce awkward outputs. The scoring weights themselves are a design choice that embeds assumptions, and changing them requires deliberate engineering rather than retraining. But these are honest constraints, visible to the user, rather than hidden non-determinism dressed up as objectivity.
I think this matters beyond fact-checking. Any high-stakes domain where AI is being used to produce decisions (credit scoring, hiring filters, medical triage, legal triage) faces the same fundamental choice: let the LLM produce the score and hope nobody notices the stochasticity, or constrain the LLM to extraction and put the decision logic somewhere auditable. The industry mostly does the first thing because it ships faster. I think the second approach is the only one defensible long-term, especially under the EU AI Act which is going to start requiring decision explainability in production systems within the next 18 months.

Curious if anyone here is building similar deterministic-on-top-of-LLM architectures in other domains, or if there are counter-arguments I’m missing. The “let the LLM decide” school has obvious advantages I’m probably under-weighting.

reddit.com
u/jonathancheckwise — 9 days ago

Checkwise: EU-made fact-checking tool.

Hi all, dropping this here because the sub seems to value EU-made projects and I’d genuinely like feedback from fellow EU founders.

Built Checkwise over the past year as a solo founder out of Geneva. It’s a fact-checking platform with three things in one place: claim verification with sources and audit trail, website credibility rating, AI-generated image detection. Free tier live, and API access available.

What makes it EU-made structurally

Infrastructure is fully European (Infomaniak Geneva for the platform, Scaleway for GPU inference). Self-hosted Appwrite. EU-based LLM (Mistral) for the analysis layer. The data sovereignty isn’t a marketing line, it’s how the system was designed from day one.

On the methodology side: deterministic Python scoring instead of LLM-generated verdicts. The model returns structured factual flags only, the verdict tier is computed by code, which makes runs reproducible. Audit trail by default, every report shows what was checked and how. Six verdict tiers instead of true/false because real claims rarely live at the extremes.

Tracking AI Act Article 50 (August 2026, watermark labelling requirement) closely as it affects the image detection roadmap.

Would love feedback from fellow EU founders, especially anyone who’s wrestled with the EU vs US infra tradeoffs themselves, or who works in journalism / fact-checking / OSINT day to day. Bug reports, criticism, and “this would be more useful if X” all welcome.

The browser extension is part of the platform but not yet on the Chrome or Firefox stores (in review). DM me if you want the build directly.

reddit.com
u/jonathancheckwise — 12 days ago

What it does

Checkwise checks claims, rates websites for credibility, and detects AI-generated images. Three things in one place, designed to give citizens, journalists, fact-checkers, and editorial teams a more reliable and objective way to verify what’s in front of them. The fact-check returns sources, a verdict tier, and an audit trail you can share or export. The site rating gives you a credibility score with the underlying signals (ownership, age, transparency, registration data, third-party reputation). The image detection runs an ensemble of models plus non-ML signals like ELA and FFT spectral analysis.

Who it’s for

Working journalists who don’t have time to manually triangulate ten sources before publishing. Editorial teams that need an audit trail for what they verified and how. Fact-checkers building dossiers. OSINT researchers checking domains and images at scale. Educators teaching source evaluation. Anyone who wants more than vibes when they’re deciding whether to trust something they read.

What’s different

Fully EU-hosted (Geneva). Deterministic Python scoring, the LLM never produces a numeric verdict, only structured factual flags that get scored by code, which makes runs reproducible. Six verdict tiers instead of true/false, because real claims rarely live at the extremes. Audit trail by default, every report shows what was checked and how. Free tier that actually works for individuals, paid tiers for teams and API access.

Call to action

It’s at checkwise.ai. Free to use, free to break, free to tell me it’s broken. I’d genuinely value feedback from anyone who works with sources, claims, or verification day to day. Bug reports, missing features, things that don’t work the way you’d expect, all welcome.

Quick note: a browser extension is also part of the platform, but it’s not on the Chrome or Firefox stores yet (still in review). If you’d find it useful, DM me and I’ll send you the build directly. 🙂

reddit.com
u/jonathancheckwise — 13 days ago
▲ 5 r/OSINT+1 crossposts

Sharing what I’ve learned over the past few months building a detection system for AI-generated images, in case it’s useful to anyone working in similar territory.

Why ensemble

The instinct is to pick the SOTA model on whatever benchmark you trust and ship that. The problem is that single models fail in correlated ways. They’re trained on overlapping datasets, they share architectural assumptions, and when they miss, they all miss the same kind of image. Adversarial examples that fool one CLIP-style detector tend to fool others.

I went with a weighted ensemble of multiple architectures plus two non-ML signals (Error Level Analysis and FFT-based spectral analysis). The classical signal processing layer catches a different class of artifacts entirely, things that don’t show up in embedding-based detectors at all. JPEG re-compression patterns, frequency anomalies in synthetic images, that kind of thing. Cheap to compute, surprisingly useful as a tiebreaker.

Fine-tuning matters more than picking the right base

I fine-tuned my own classifier head on a curated set covering the main current generators. That’s what closed the gap on edge cases that off-the-shelf detectors consistently miss. The fine-tuning dataset was relatively small but tight: each generator represented with images that span the failure modes I’d seen in the wild. Quality of labeling beat quantity by a significant margin.

The thing nobody tells you

Don’t optimize for accuracy first, optimize for false positive rate. In this domain, false positives are catastrophic. Wrongly flagging a journalist’s authentic photo as AI-generated does more reputational damage than missing a generated one. I tune the ensemble thresholds explicitly to keep FPR near zero, even when it costs a few points of recall.

Also, EXIF and metadata are auxiliary signals at best. They’re trivially stripped or forged. Don’t gate decisions on them.

The moving target

The hardest part of this work is that the goalpost moves every few weeks. New generators ship, old detection signatures degrade, and what worked last quarter quietly stops working. Continuous fine-tuning isn’t a nice-to-have, it’s the only honest answer if you want a system that holds up over time. Anyone claiming a one-shot detector that handles every current and future generator is selling something.

This is part of a fact-checking platform I’m building (Checkwise, checkwise.ai). Image detection is one component alongside text claim verification and source rating. Happy to answer specific questions if anyone’s working on similar problems.

checkwise.ai
u/jonathancheckwise — 19 days ago