u/ImpressionAccurate21

▲ 1 r/analytics+1 crossposts

I've been hearing about synthetic data for a while but only recently started actually building with it instead of relying on clean Kaggle datasets that don't behave like real production data.

I built a grocery delivery dataset from scratch. 50,000 users, 5.8 million sessions, realistic behavioral parameters. The goal was to practice proper funnel analysis, cohort retention, and RFM segmentation on data that actually has noise and drop-off.

First funnel query ran. Every stage showed identical session counts. Zero drop-off. 100% conversion from app open to order placed.

Spent time assuming it was a SQL error. It wasn't. The generator only created sessions when an order was placed. Every session in the dataset was already converted. Users who browsed and left without ordering simply didn't exist in the data.

Technically the data was correct. Analytically it was useless.

Had to rebuild the generator from scratch. Separated session generation from order generation completely. Added depth-based funnel logic where each session gets assigned a termination point before any events are generated. Added 50 behavioral signals weighted by user propensity, device, acquisition channel, time of day.

After the rebuild: 7.1% overall conversion. Android users drop off earlier in the funnel than iOS. Instagram has the lowest overall conversion but the highest checkout completion rate once users reach that stage.

The thing I didn't expect: auditing whether your data is valid and auditing whether your data is fit for analysis are two completely different steps. Most people only do the first one.

Full writeup on Medium if anyone wants the architecture details.

reddit.com
u/ImpressionAccurate21 — 12 days ago