u/Straight_Stable_6095

Robot perception just became a $249 commodity. What does that actually change?

Robot perception just became a $249 commodity. What does that actually change?

Something quietly shifted in the last year that I don't think has gotten enough attention in discussions about robotics timelines.

Capable, real-time, multi-model robot vision now runs on a $249 device. Fully on-device. No cloud dependency.

I know because I built it.

OpenEyes runs on a Jetson Orin Nano 8GB:

  • Object detection + distance estimation
  • Depth mapping
  • Face detection
  • Gesture recognition
  • Full body pose estimation + activity inference

30-40 FPS. $249 hardware. MIT license.

Why this is a meaningful data point:

The cost and accessibility of robot perception has historically been a hard ceiling on who could build capable robots and what those robots could do. That ceiling just moved significantly.

Consider the trajectory:

  • 2018: capable robot vision = $10k+ compute, cloud dependent
  • 2021: capable robot vision = $500-1k, still largely cloud dependent
  • 2024: capable robot vision = $249, fully on-device

What the commoditization of perception unlocks:

Independent builders can now ship robots with real situational awareness. Not research labs. Not funded startups. Individual builders with $249 and a GitHub account.

The remaining gaps: manipulation, locomotion, reasoning. Perception was arguably the first domino.

The open question:

Commoditized perception + open-source LLMs for reasoning + increasingly affordable actuators. What's the realistic timeline to a capable general-purpose home robot built entirely from open-source components?

I'd genuinely argue we're closer than most non-roboticists think.

Full project if curious about the perception piece: github.com/mandarwagh9/openeyes

u/Straight_Stable_6095 — 23 hours ago
[Project] Vision pipeline for robots using OpenCV + YOLO + MiDaS + MediaPipe - architecture + code
▲ 4 r/deeplearning+1 crossposts

[Project] Vision pipeline for robots using OpenCV + YOLO + MiDaS + MediaPipe - architecture + code

Built a robot vision system where OpenCV handles the capture and display layer while the heavy lifting is split across YOLO, MiDaS, and MediaPipe. Sharing the pipeline architecture since I couldn't find a clean reference implementation when I started.

Pipeline overview:

python

import cv2
import threading
from ultralytics import YOLO
import mediapipe as mp

# Capture
cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1920)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 1080)

while True:
    ret, frame = cap.read()
    
    # Full res path
    detections = yolo_model(frame)
    depth_map = midas_model(frame)
    
    # Downscaled path for MediaPipe
    frame_small = cv2.resize(frame, (640, 480))
    pose_results = pose.process(
        cv2.cvtColor(frame_small, cv2.COLOR_BGR2RGB)
    )
    
    # Annotate + display
    annotated = draw_results(frame, detections, depth_map, pose_results)
    cv2.imshow('OpenEyes', annotated)

The coordinate remapping piece:

When MediaPipe runs on 640x480 but you need results on 1920x1080:

python

def remap_landmark(landmark, src_size, dst_size):
    x = landmark.x * src_size[0] * (dst_size[0] / src_size[0])
    y = landmark.y * src_size[1] * (dst_size[1] / src_size[1])
    return x, y

MediaPipe landmarks are normalized (0-1) so the remapping is straightforward.

Depth sampling from detection:

python

def get_distance(bbox, depth_map):
    cx = int((bbox[0] + bbox[2]) / 2)
    cy = int((bbox[1] + bbox[3]) / 2)
    depth_val = depth_map[cy, cx]
    
    # MiDaS gives relative depth, bucket into strings
    if depth_val > 0.7: return "~40cm"
    if depth_val > 0.4: return "~1m"
    return "~2m+"

Not metric depth, but accurate enough for navigation context.

Person following with OpenCV tracking:

python

tracker = cv2.TrackerCSRT_create()
# Initialize on owner bbox
tracker.init(frame, owner_bbox)

# Update each frame
success, bbox = tracker.update(frame)
if success:
    navigate_toward(bbox)

CSRT tracker handles short-term occlusion better than bbox height ratio alone.

Hardware: Jetson Orin Nano 8GB, Waveshare IMX219 1080p

Full project: github.com/mandarwagh9/openeyes

Curious how others handle the sync problem between slow depth estimation and fast detection in OpenCV pipelines.Built a robot vision system where OpenCV handles the capture and display layer while the heavy lifting is split across YOLO, MiDaS, and MediaPipe. Sharing the pipeline architecture since I couldn't find a clean reference implementation when I started.
Pipeline overview:
python
import cv2
import threading
from ultralytics import YOLO
import mediapipe as mp

# Capture
cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1920)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 1080)

while True:
ret, frame = cap.read()

# Full res path
detections = yolo_model(frame)
depth_map = midas_model(frame)

# Downscaled path for MediaPipe
frame_small = cv2.resize(frame, (640, 480))
pose_results = pose.process(
cv2.cvtColor(frame_small, cv2.COLOR_BGR2RGB)
)

# Annotate + display
annotated = draw_results(frame, detections, depth_map, pose_results)
cv2.imshow('OpenEyes', annotated)
The coordinate remapping piece:
When MediaPipe runs on 640x480 but you need results on 1920x1080:
python
def remap_landmark(landmark, src_size, dst_size):
x = landmark.x * src_size[0] * (dst_size[0] / src_size[0])
y = landmark.y * src_size[1] * (dst_size[1] / src_size[1])
return x, y
MediaPipe landmarks are normalized (0-1) so the remapping is straightforward.
Depth sampling from detection:
python
def get_distance(bbox, depth_map):
cx = int((bbox[0] + bbox[2]) / 2)
cy = int((bbox[1] + bbox[3]) / 2)
depth_val = depth_map[cy, cx]

# MiDaS gives relative depth, bucket into strings
if depth_val > 0.7: return "~40cm"
if depth_val > 0.4: return "~1m"
return "~2m+"
Not metric depth, but accurate enough for navigation context.
Person following with OpenCV tracking:
python
tracker = cv2.TrackerCSRT_create()
# Initialize on owner bbox
tracker.init(frame, owner_bbox)

# Update each frame
success, bbox = tracker.update(frame)
if success:
navigate_toward(bbox)
CSRT tracker handles short-term occlusion better than bbox height ratio alone.
Hardware: Jetson Orin Nano 8GB, Waveshare IMX219 1080p
Full project: github.com/mandarwagh9/openeyes
Curious how others handle the sync problem between slow depth estimation and fast detection in OpenCV pipelines.

u/Straight_Stable_6095 — 23 hours ago

Multi-model inference optimization on Jetson Orin Nano - TensorRT INT8, parallel threading, resolution splitting

Sharing the optimization journey for a robot vision system running 5 models concurrently on constrained hardware. Some of this took longer to figure out than it should have.

Models:

  • YOLO11n (detection)
  • MiDaS small (depth)
  • MediaPipe Face, Hands, Pose

Hardware: Jetson Orin Nano 8GB, JetPack 6.2.2

Optimization 1: Resolution splitting

MediaPipe has a hard sweet spot at 640x480. Running it at 1080p doesn't just slow it down - accuracy degrades too. The fix:

python

# Full res for YOLO + MiDaS
frame_full = capture(1920, 1080)

# Downscaled for MediaPipe
frame_small = cv2.resize(frame_full, (640, 480))

# Remap coordinates back after inference
detections_remapped = remap_coords(mediapipe_output, 
                                    src=(640,480), 
                                    dst=(1920,1080))

Coordinate remapping overhead: ~1ms. Worth it.

Optimization 2: TensorRT INT8

Biggest single performance gain. Pipeline:

bash

# Step 1: ONNX export
yolo export model=yolo11n.pt format=onnx

# Step 2: TensorRT INT8 conversion
trtexec --onnx=yolo11n.onnx \
        --int8 \
        --calib=./calib_images/ \
        --saveEngine=yolo11n_int8.engine

Calibration dataset: 150 frames from actual deployment environment. Indoor scenes, mixed lighting, cluttered surfaces.

Accuracy impact:

  • Large objects: negligible
  • Objects under ~30px: noticeable degradation
  • For navigation use case: acceptable

Speed: FP32 ~10 FPS → INT8 ~30-40 FPS

Optimization 3: Parallel threading

python

import threading

def mediapipe_worker(frame_queue, result_queue):
    while True:
        frame = frame_queue.get()
        result = run_mediapipe(frame)
        result_queue.put(result)

mp_thread = threading.Thread(target=mediapipe_worker, 
                              args=(frame_q, result_q))
mp_thread.daemon = True
mp_thread.start()

Main thread never blocks on MediaPipe. Uses latest available result with a staleness flag.

Open problem:

Depth + detection sync. MiDaS runs slower than YOLO. Currently pairing each detection frame with the latest available depth map. This introduces a temporal mismatch on fast-moving objects.

Options I've considered:

  • Optical flow to compensate for motion between depth frames
  • Reduce MiDaS input resolution further
  • Replace MiDaS with a faster lightweight depth model

Anyone tackled this on constrained hardware?

Full project: github.com/mandarwagh9/openeyesSharing the optimization journey for a robot vision system running 5 models concurrently on constrained hardware. Some of this took longer to figure out than it should have.Models:

YOLO11n (detection)
MiDaS small (depth)
MediaPipe Face, Hands, PoseHardware: Jetson Orin Nano 8GB, JetPack 6.2.2Optimization 1: Resolution splittingMediaPipe has a hard sweet spot at 640x480. Running it at 1080p doesn't just slow it down - accuracy degrades too. The fix:python
# Full res for YOLO + MiDaS
frame_full = capture(1920, 1080)

# Downscaled for MediaPipe
frame_small = cv2.resize(frame_full, (640, 480))

# Remap coordinates back after inference
detections_remapped = remap_coords(mediapipe_output,
src=(640,480),
dst=(1920,1080))Coordinate remapping overhead: ~1ms. Worth it.Optimization 2: TensorRT INT8Biggest single performance gain. Pipeline:bash
# Step 1: ONNX export
yolo export model=yolo11n.pt format=onnx

# Step 2: TensorRT INT8 conversion
trtexec --onnx=yolo11n.onnx \
--int8 \
--calib=./calib_images/ \
--saveEngine=yolo11n_int8.engineCalibration dataset: 150 frames from actual deployment environment. Indoor scenes, mixed lighting, cluttered surfaces.Accuracy impact:

Large objects: negligible
Objects under ~30px: noticeable degradation
For navigation use case: acceptableSpeed: FP32 ~10 FPS → INT8 ~30-40 FPSOptimization 3: Parallel threadingpython
import threading

def mediapipe_worker(frame_queue, result_queue):
while True:
frame = frame_queue.get()
result = run_mediapipe(frame)
result_queue.put(result)

mp_thread = threading.Thread(target=mediapipe_worker,
args=(frame_q, result_q))
mp_thread.daemon = True
mp_thread.start()Main thread never blocks on MediaPipe. Uses latest available result with a staleness flag.Open problem:Depth + detection sync. MiDaS runs slower than YOLO. Currently pairing each detection frame with the latest available depth map. This introduces a temporal mismatch on fast-moving objects.Options I've considered:

Optical flow to compensate for motion between depth frames
Reduce MiDaS input resolution further
Replace MiDaS with a faster lightweight depth modelAnyone tackled this on constrained hardware?Full project: github.com/mandarwagh9/openeyes

reddit.com
u/Straight_Stable_6095 — 23 hours ago
Moved my robot's vision from ESP32-CAM to Jetson Orin Nano - here's what changed
▲ 47 r/ArtificialInteligence+2 crossposts

Moved my robot's vision from ESP32-CAM to Jetson Orin Nano - here's what changed

Started like most people do - ESP32-CAM for basic vision tasks. Face detection, simple object detection, cloud inference for anything heavier.

Hit the ceiling fast.

Moved to Jetson Orin Nano 8GB for the main vision compute. The gap is significant enough that it's worth writing up.

What ESP32-CAM handles fine:

  • Simple presence detection
  • Basic face detection (if you're okay with cloud)
  • Streaming video to a host machine

What it can't do:

  • On-device inference beyond the most basic models
  • Multi-model concurrent inference
  • Anything requiring depth or pose estimation
  • Real-time tracking without cloud dependency

What Jetson Orin Nano unlocks:

  • YOLO11n at 25-30 FPS on-device
  • MiDaS depth estimation concurrently
  • Full MediaPipe stack (face + hands + pose) in parallel
  • TensorRT INT8 optimization: 30-40 FPS full stack
  • ROS2 native integration

The ESP32 still lives in my robot stack - handling motor control, sensor reading, low-level I/O. Jetson handles vision exclusively. Clean separation.

If you're building anything that needs real perception and you're hitting ESP32 limits, Orin Nano at $249 is the honest next step. Not a microcontroller anymore but the jump is worth it.

Full vision stack open source: github.com/mandarwagh9/openeyes

What's everyone using for vision on more capable robot builds?

u/Straight_Stable_6095 — 23 hours ago
Running 5 CV models simultaneously on a $249 edge device - architecture breakdown

Running 5 CV models simultaneously on a $249 edge device - architecture breakdown

Been working on a vision system that runs the following concurrently on a single Jetson Orin Nano 8GB:

  • YOLO11n - object detection
  • MiDaS - monocular depth estimation
  • MediaPipe Face - face detection + landmarks
  • MediaPipe Hands - gesture recognition (owner selection via open palm)
  • MediaPipe Pose - full-body pose estimation + activity inference

Performance:

  • All models active: 10-15 FPS
  • Minimal mode (detection only): 25-30 FPS
  • INT8 quantized: 30-40 FPS

The hard parts:

MediaPipe at high resolution was the first wall. It's optimized for 640x480 and degrades badly above that. Solution: run MediaPipe on a downscaled stream in parallel, fuse results back to the full-res frame using coordinate remapping.

Depth + detection fusion: MiDaS gives relative depth, not metric. Used bbox center coordinates to sample the depth map and output approximate distance strings ("~40cm") - good enough for navigation, not for manipulation.

Person following logic: instead of a dedicated re-ID model (too heavy for the hardware), tracks by bbox height ratio. Taller bbox = closer. Simple, fast, surprisingly robust for indoor following.

Currently using a Waveshare IMX219 at 1920x1080. Planning to test stereo next for metric depth.

Full code: github.com/mandarwagh9/openeyes

Curious how others are handling model fusion pipelines on constrained hardware - specifically depth + detection synchronization.

u/Straight_Stable_6095 — 24 hours ago
OpenEyes - ROS2 native vision system for humanoid robots | YOLO11n + MiDaS + MediaPipe, all on Jetson Orin Nano
🔥 Hot ▲ 85 r/singularity+4 crossposts

OpenEyes - ROS2 native vision system for humanoid robots | YOLO11n + MiDaS + MediaPipe, all on Jetson Orin Nano

Built a ROS2-integrated vision stack for humanoid robots that publishes detection, depth, pose, and gesture data as native ROS2 topics.

What it publishes:

  • /openeyes/detections - YOLO11n bounding boxes + class labels
  • /openeyes/depth - MiDaS relative depth map
  • /openeyes/pose - MediaPipe full-body pose keypoints
  • /openeyes/gesture - recognized hand gestures
  • /openeyes/tracking - persistent object IDs across frames

Run it with:

python src/main.py --ros2

Tested on Jetson Orin Nano 8GB with JetPack 6.2. Everything runs on-device, no cloud dependency.

The person-following mode uses bbox height ratio to estimate proximity and publishes velocity commands directly - works out of the box with most differential drive bases.

Would love feedback from people building nav stacks on top of vision pipelines. Specifically: what topic conventions are you using for perception output? Trying to make this more plug-and-play with existing robot stacks.

GitHub: github.com/mandarwagh9/openeyes

u/Straight_Stable_6095 — 24 hours ago