u/Chocolate_Milk_Son

Getting good predictions without data cleaning (Why "Garbage In, Garbage Out" is sometimes a trap)

Full arXiv Preprint: https://arxiv.org/abs/2603.12288

Paper Simulation Github: https://github.com/tjleestjohn/from-garbage-to-gold

Hi r/artificial,

It's a dirty little secret to many of us... sometimes, downstream AI/ML models perform surprisingly well when you just hand them raw, error-prone tabular data instead of heavily curated feature sets. Despite this, the vast majority of our field tends to be fiercely loyal to "Garbage In, Garbage Out" (GIGO). While automated ETL pipelines are absolutely essential for structuring data, our workflows are still bottlenecked with endless manual cleaning and aggressive imputation just to curate pristine, error-free tables.

My co-authors and I recently released a preprint on arXiv (From Garbage to Gold) arguing that treating GIGO as a universal law can sometimes be a trap... especially in the context of big data (many columns). That the bottleneck due to manual data cleaning can actively lower the predictive ceiling of our models when latent causes drive the system's behavior.

To be clear upfront: we are not arguing against ETL. Parsing JSON, handling schema evolution, and standardizing types is non-negotiable.

What we are arguing against is the universal assumption that "clean" data (via manual data scrubbing and aggressive imputation) is non-negotiable for big data predictive AI/ML modeling.

Here is why the traditional mindset can be limiting:

1. We conflate two different types of "noise" (Predictor Error and Structural Uncertainty).

Usually, we just lump all noise into one big bucket. But if you split that noise into two specific categories, the math changes completely:

  • Predictor Error: Random typos, dropped logs, or transient glitches.
  • Structural Uncertainty: The inherent, unresolvable gap between recorded metrics and the complex, hidden reality they represent.

We spend months manually scrubbing data because the threat of data errors is obvious, while Structural Uncertainty is often an afterthought at best. However, when latent causes drive a system, manual scrubbing fixes noise due to errors, but it fundamentally cannot fix the noise due to Structural Uncertainty.

On the other hand, the paper shows that in this context, if you use a comprehensive, high-dimensional data architecture, a flexible model can actually triangulate the hidden drivers reliably despite the presence of data errors. When keeping a massive amount of messy, highly correlated variables (even if error-prone), the sheer volume of redundant signals allows the model to drown out individual errors (bypassing the cleaning bottleneck) and simultaneously overcome Structural Uncertainty.

This redefines "data quality." It's not only about how accurately the variables are measured. It's also about how the portfolio of variables comprehensively and redundantly covers the latent drivers of the system.

2. Manual cleaning is a bottleneck on dimensionality (The Practical Problem).

To overcome Structural Uncertainty, modern AI/ML models want to find the underlying latent drivers of a system (think Representation Learning but with tabular data). To do this, however, they need a high-dimensional set of variables that contains Informative Collinearity in order to mathematically triangulate the hidden drivers.

The moment you introduce manual cleaning, you create a human bottleneck. Because we cannot manually clean 10,000 variables, we are forced to drop 9,900 of them. By artificially restricting the predictor space to make it "clean enough to model," we can harm the data architecture's inherent potential to triangulate those latent drivers. We sacrifice the model's actual predictive ceiling just to satisfy the GIGO heuristic.

Ultimately, this suggests we should focus mostly on extracting, loading, and increasing observational fidelity with automated tools, but that, in contexts characterized by latent drivers, we should stop letting manual cleaning bottlenecks restrict the scale of our AI/ML models.

Thoughts?: Have you run into situations where your data science teams actually got better predictive results by bypassing the manually cleaned tables and pulling massive dimensionality straight from the raw ELT layers?

I'd love to hear your experiences or thoughts. Happy to discuss all serious comments or questions.

Full disclosure: the preprint is a 120-page beast. It’s long because it doesn't just pitch the core theory with a qualitative argument. It gives the full mathematical treatment to everything which takes space. We also dig into edge cases, what happens when assumptions like Local Independence are violated (e.g., systematic errors exist), broader implications (like a link to Benign Overfitting and efficient feature selection strategies that make this high-d strategy practical with finite compute), a deep-dive simulation, failure modes, and a huge agenda for future research (because we do not claim the paper is the final word on the matter).

It's a major commitment upfront but may save you time and money in the long term, while also enhancing the predictive ceiling of your tabular AI/ML models.

reddit.com
u/Chocolate_Milk_Son — 2 days ago

Full Paper: https://arxiv.org/abs/2603.12288

Paper Simulation Github: https://github.com/tjleestjohn/from-garbage-to-gold

Hi r/dataengineering,

It's an open secret to many of us... sometimes, downstream ML models perform surprisingly well when you just hand them raw, error-prone data instead of heavily curated feature sets. Despite this, our field is fiercely loyal to "Garbage In, Garbage Out" (GIGO). While automated ETL pipelines are absolutely essential for structuring data and increasing observational fidelity, we still bottleneck our workflows with endless manual cleaning and aggressive imputation just to curate pristine, error-free tables.

My co-authors and I recently released a preprint (From Garbage to Gold) arguing that treating GIGO as a universal law can sometimes be a trap... especially in the context of big data (many columns). That manual cleaning can actively lower the predictive ceiling of our models when latent causes drive the system's behavior.

To be clear upfront: we are not arguing against ETL. Parsing JSON, handling schema evolution, and standardizing types is non-negotiable.

What we are arguing against is the universal assumption that "clean" data (via manual data scrubbing and aggressive imputation) is non-negotiable for big data predictive ML modeling.

Here is why the traditional mindset can be limiting:

1. We conflate two different types of "noise" (Predictor Error and Structural Uncertainty).

Usually, we just lump all noise into one big bucket. But if you split that noise into two specific categories, the math changes completely:

  • Predictor Error: Random typos, dropped logs, or transient glitches.
  • Structural Uncertainty: The inherent, unresolvable gap between recorded metrics and the complex, hidden reality they represent.

We spend months manually scrubbing data because we treat all "bad data" as a single enemy. However, when latent causes drive a system, manual scrubbing fixes Predictor Error, but it fundamentally cannot fix the Structural Uncertainty inherent to the fixed predictor set.

On the other hand, the paper shows that in this context, if you use a comprehensive, high-dimensional data architecture, a flexible model can actually triangulate the hidden drivers reliably. When keeping a massive amount of messy, highly correlated variables (even if error-prone), the sheer volume of redundant signals allows the model to drown out individual errors (bypassing cleaning) and simultaneously overcome Structural Uncertainty.

This redefines "data quality." It's not only about how accurately the variables are measured. It's also about how the portfolio of variables comprehensively and redundantly covers the latent drivers of the system.

2. Manual cleaning is a bottleneck on dimensionality (The Practical Problem).

To overcome Structural Uncertainty, modern ML models want to find the underlying latent drivers of a system (think Representation Learning but with tabular data). To do this, they need a high-dimensional set of variables that contains Informative Collinearity in order to mathematically triangulate the hidden drivers.

The moment you introduce manual cleaning, you create a human bottleneck. Because we cannot manually clean 10,000 variables, we are forced to drop 9,900 of them. By artificially restricting the predictor space to make it "clean enough to model," we harm the data architecture's inherent potential to triangulate those latent drivers. We sacrifice the model's actual predictive ceiling just to satisfy the GIGO heuristic.

Ultimately, this suggests DEs should focus mostly on extracting, loading, and increasing observational fidelity with automated tools, but that, in contexts characterized by latent drivers, we should stop letting manual cleaning bottlenecks restrict the scale of our ML models.

Thoughts?: Have you run into situations where your data science teams actually got better predictive results by bypassing the manually cleaned tables and pulling massive dimensionality straight from the raw ELT layers?

I'd love to hear your experiences or thoughts. Happy to discuss all serious comments or questions.

Full disclosure: the preprint is a 120-page beast. It’s long because it doesn't just pitch the core theory with a qualitative argument. It gives the full mathematical treatment to everything which takes space. We also dig into edge cases, what happens when assumptions like Local Independence are violated (e.g., systematic errors exist), broader implications (like a link to Benign Overfitting and efficient feature selection strategies that make this high-d strategy practical with finite compute), a deep-dive simulation, failure modes, and a huge agenda for future research (because we do not claim the paper is the final word on the matter). It's a major commitment upfront but may save you time long term in practice.

reddit.com
u/Chocolate_Milk_Son — 9 days ago

Full Paper: https://arxiv.org/abs/2603.12288

Hi r/analytics,

"Garbage In, Garbage Out" is a deeply entrenched mindset. We spend up to 80% of our time cleaning tabular data because GIGO is obviously true. But... what if this idea is sometimes holding our models back?

It's not unheard of. I'm sure many of you have noticed your models sometimes perform surprisingly well on raw, uncurated data.

To help explain this, my co-authors and I recently released a preprint called From Garbage to Gold (G2G) that basically says that sometimes GIGO is wrong. The paper discusses when and why error-prone data can actually be used to create SOTA prediction models.

In the context of big data driven by latent causes, it turns out that aggressively cleaning your data can actually blind your models to the exact signals they need to see.

The core of the paper is about how we define "noisy" data. Usually, we just lump all noise into one big bucket. But if you split that noise into two specific categories, the math changes completely:

  • Category 1: Predictor Error. This is the classic garbage. Typos, sensor glitches, reporting delays, or just weird recording errors.
  • Category 2: Structural Uncertainty. This is the inherent, probabilistic gap between a predictor and the actual hidden force driving the system. Basically, even a "perfectly" measured variable is still just a limited, imperfect proxy for reality.

Here’s the catch: traditional cleaning only fixes Category 1. You can spend six months making a dataset "flawless," but your model is still going to hit a performance ceiling because you did nothing to solve for Category 2.

Our paper shows that if you use a broad, high-dimensional architecture, a flexible model can actually triangulate the hidden truth. That when keeping a massive amount of messy, highly correlated variables (even if error-prone), the sheer volume of redundant signals allows the model to drown out individual errors (bypassing cleaning) and simultaneously overcome Structural Uncertainty.

Ultimately, this redefines "data quality." It's not only about how accurate the variables are measured. It's also about the how the portfolio of variables comprehensively and redundantly covers the latent drivers of the system.

Full disclosure: the preprint is a 120-page beast. It’s long because it doesn't just pitch the core theory. It gives the full mathematical treatment to everything which takes space. We also dig into edge cases, what happens when assumptions like Local Independence are violated, broader implications (like a link to Benign Overfitting and efficient feature selection strategies), a deep-dive simulation, failure modes, and a huge agenda for future research (because we do not claim the paper is the final word on the matter).

Would love to get your thoughts on this.

Happy to discuss or answer any serious questions.

reddit.com
u/Chocolate_Milk_Son — 10 days ago