u/balazshimself

Estimating crowd size by eye is notoriously hard. I've found a CNN called P2PNet to detect heads of people and created a custom pipeline to detect occluded people and reconstruct an approximate 3d scene.

Pipeline overview

P2PNet detection gives 2D head points
Depth Pro (Apple's metric monocular depth model) gives metric Z per pixel
Head points are back-projected to world-space XYZ using depth + focal length
RANSAC fits the dominant ground plane from the head point cloud
World scale is corrected for based on max. real-world crowd density of 6.5ppl/m2
Shadow-offset DBSCAN clusters the crowd — offset centers are computed per-person by projecting their occlusion shadow forward, which bridges the gaps that appear between rows of people at depth due to sparse data and the low camera angle.
Alpha shapes (Delaunay + circumradius threshold) trace concave hulls around each crowd cluster; interior voids naturally emerge as obstacle holes
From the DBSCAN densities-per-point a heatmap is created + missing region densities are interpolated and occluded people are populated using Poisson sampling

The shadow-offset trick (step 6) is the part I haven't seen elsewhere. DBSCAN breaks crowd clusters at depth because row-to-row gaps exceed the search radius. My original idea was a pill-shaped search area, but shifting each person's search center to the midpoint between their actual position and their shadow tip with search radius scaling linearly with depth is faster, and also reconnects those rows.

Output

The frontend renders a density-zoned map over the image: detected people, auto-generated obstacle polygons (holes in the alpha shape), occlusion shadow zones with predicted counts, and a confidence interval. AI assumptions are editable objects — the analyst can delete clusters, override predicted densities. I'm currently working on extending this to boundary editing and placing a POI to adjust the attenuation model. Modifications are logged to an audit trail that ships with the export.

Known limitations

- Ground plane assumption breaks on stairs and tiered seating (RANSAC fit flagged when inlier ratio < 60%)
- Single image only at this stage — video fusion is the next thing I'm building
- My method doesn't model crowd dynamics at an individual's scale — to calculate real individual positions an iterative approach may be needed which goes against optimizing for speed

Resources

- evolving blog post with up-to-date info: https://www.balazshimself.com/blog/crowd-predictor
- MVP tool: https://www.crowdcounting.net

Any feedback is welcome! Thanks for your time!

Combined P2PNet + Apple's Depth Pro to reconstruct crowds in 3D and predict people hidden behind obstructions — from a single image