u/Dry_Roof_1382

Insufficient data but suspiciously good metrics?

Well my research center's conducting a project on developing batteries. They task me with using ML to regress battery capacities onto a set of variables. I experimented with my custom models but then they told me to first try to replicate methodologies in a research paper.

The thing is that the article itself reports using only 90 samples collected from different labs, and 22 of them contain missing values (?) This is a heavy data shortage but somehow the authors report a R^(2) = 0.83 and pretty nice RMSEs / MAEs with gradient boosting models.

What do you think about this? I personally feel that the authors cherrypicked a seed with good metrics to report. Or is it possible that GBMs are so powerful that they can work with only a few tens of samples?

reddit.com
u/Dry_Roof_1382 — 4 days ago