u/Full_Piano_3448

Experimenting with egocentric video

Hey guys,

With robotics growing so fast, first-person (egocentric) vision is becoming a massive domain in CV on its own. If robots are ever going to help us in the real world, they need to understand how humans handle objects from our own perspective.

I've been deep in experimentation mode and performing some test with CV model on egocentric video from scratch on everyday simple tasks (annotation -> model training -> implementation)!

For this project, I focused on a simple, everyday task: opening and closing a bottle cap. Here is a quick look at the video showing the real-time tracking and state changes in action:

  • Data Annotation: I started by capturing raw egocentric footage. To get clean bounding boxes for the bottle and cap across the sequence, I used Labellerr. It made handling the frame-by-frame labeling smooth and kept the dataset precise.
  • Model Training & Tracking: I paired object detection for the assets (bottle and cap) with hand skeleton tracking to map exactly how the fingers grasp and interact with the objects.
  • State Logic Building: Once the spatial coordinates were tracking properly, I built a custom state machine logic on top of it. The system actively differentiates between IDLE, OPENING THE BOTTLE, and CLOSING THE BOTTLE based on hand-to-object intersections and hand velocity.

This is one of many examples i am experimenting with egocentric video (feel free to suggest some ideas regarding it)

Would love to hear your thoughts! Are any of you working on egocentric datasets or robotics perception pipelines right now? What are the biggest bottlenecks you’re running into with first-person data?

Resouces:
- video: link
- code: link

u/Full_Piano_3448 — 3 days ago

Hey everyone, following up on my earlier comparison of top depth estimation models on Hugging Face, several of you highlighted their performance in complex outdoor environments. To explore that further, I’m sharing this video showcasing how these models handle such real-world complex scenarios.

------------------------

also check my video + code here

Video: https://www.youtube.com/watch?v=WQTadQi0MCg
Notebook: https://github.com/Labellerr/Hands-On-Learning-in-Computer-Vision/blob/main/Model%20Notebooks/Depth_Estimation/depth-estimation-model-comparison.ipynb

u/Full_Piano_3448 — 17 days ago

Recently I was working on a computer vision task that heavily relied on depth estimation. If you've scrolled through Hugging Face lately, you know there are dozens of models out there all claiming to be the state-of-the-art. Honestly, it was getting overwhelming to figure out which one to actually use in production.

Instead of just guessing, I decided to build a notebook + video and run a side-by-side comparison of the top 5 downloaded depth estimation models to see how they actually handle complex scenes (like overlapping objects, stacked books, and weird fabric curves).

I compared:

  • Apple's Depth Pro
  • Depth Anything V2 (Large)
  • Depth Anything V1 (Large)
  • Intel's ZoeDepth (NYU/KITTI)
  • Intel's DPT Hybrid Midas

Hopefully, this saves some of you the headache of running all these experiments yourselves! Let me know if you guys have a go-to depth model that I missed.
------------------------------------------------------------------------

Video: https://www.youtube.com/watch?v=WQTadQi0MCg
Notebook: https://github.com/Labellerr/Hands-On-Learning-in-Computer-Vision/blob/main/Model%20Notebooks/Depth_Estimation/depth-estimation-model-comparison.ipynb

u/Full_Piano_3448 — 19 days ago