r/computervision

How to get rejected by IEEE T-PAMI with 'Excellent' scores?[D]
▲ 22 r/computervision+1 crossposts

How to get rejected by IEEE T-PAMI with 'Excellent' scores?[D]

Hello everyone. I am keeping my identity anonymous today to protect my professional career. I am a researcher in Computer Vision, and I am sharing this story because I have hit a devastating deadlock with IEEE T-PAMI and the IEEE Ethics Office.

Our Situation

https://preview.redd.it/ipxwj6eus32h1.jpg?width=960&format=pjpg&auto=webp&s=1f58700644683be640f6bb057c74011649f59219

In the decision letter, there were three highly positive reviews (Two EXCELLENT, One GOOD). However, the AE (who is one of T-PAMI associate EICs) rejected the paper by quoting comments from a "4th" reviewer.

>The most staggering part: We later accidentally met the actual 4th reviewer. He CONFIRMED having submitted a POSITIVE review, which was strangely withdrawn by the editor in the backend before the final decision was made.

The AE lied by saying: "... received 3 sets of comments, and one on the way ... ".

We have formally requested the IEEE (and Computer Society) to thoroughly investigate this issue, specifically asking them to check AE's backend activity logs in the submission system.

However, half a year has passed, and we have received no direct response.

Has anyone experienced something similar with IEEE or other top venues? Any advice or help bringing visibility to this would be greatly appreciated.

Evidence:

Below is the report to IEEE Ethics (identifying information has been covered):

https://preview.redd.it/e41vt2rsn02h1.png?width=3508&format=png&auto=webp&s=b2ee2d3f092dad5e20b45b9daeea7fa7b6f01d20

https://preview.redd.it/t29n03rsn02h1.png?width=3508&format=png&auto=webp&s=67aa6bc36aed76617af34e7913a203f9236bc536

https://preview.redd.it/6v5ys2rsn02h1.png?width=3508&format=png&auto=webp&s=f2452998f57f1b157d71b569dd5ff87e4d3d0b6c

https://preview.redd.it/epdxv2rsn02h1.png?width=3508&format=png&auto=webp&s=d01da8cdf9e3f6cd5be53f884b02b154f86d0b48

https://preview.redd.it/fuw3k3rsn02h1.png?width=3508&format=png&auto=webp&s=03e75f763a54429758102da4933af53511642e7d

https://preview.redd.it/xn0ze3rsn02h1.png?width=3508&format=png&auto=webp&s=9f00e88f186c0afa349d4a46439216ae57642d98

reddit.com
u/cussealin — 4 hours ago

Combined P2PNet + Apple's Depth Pro to reconstruct crowds in 3D and predict people hidden behind obstructions — from a single image

Estimating crowd size by eye is notoriously hard. I've found a CNN called P2PNet to detect heads of people and created a custom pipeline to detect occluded people and reconstruct an approximate 3d scene.

Pipeline overview

  1. P2PNet detection gives 2D head points
  2. Depth Pro (Apple's metric monocular depth model) gives metric Z per pixel
  3. Head points are back-projected to world-space XYZ using depth + focal length
  4. RANSAC fits the dominant ground plane from the head point cloud
  5. World scale is corrected for based on max. real-world crowd density of 6.5ppl/m2
  6. Shadow-offset DBSCAN clusters the crowd — offset centers are computed per-person by projecting their occlusion shadow forward, which bridges the gaps that appear between rows of people at depth due to sparse data and the low camera angle.
  7. Alpha shapes (Delaunay + circumradius threshold) trace concave hulls around each crowd cluster; interior voids naturally emerge as obstacle holes
  8. From the DBSCAN densities-per-point a heatmap is created + missing region densities are interpolated and occluded people are populated using Poisson sampling

The shadow-offset trick (step 6) is the part I haven't seen elsewhere. DBSCAN breaks crowd clusters at depth because row-to-row gaps exceed the search radius. My original idea was a pill-shaped search area, but shifting each person's search center to the midpoint between their actual position and their shadow tip with search radius scaling linearly with depth is faster, and also reconnects those rows.

Output

The frontend renders a density-zoned map over the image: detected people, auto-generated obstacle polygons (holes in the alpha shape), occlusion shadow zones with predicted counts, and a confidence interval. AI assumptions are editable objects — the analyst can delete clusters, override predicted densities. I'm currently working on extending this to boundary editing and placing a POI to adjust the attenuation model. Modifications are logged to an audit trail that ships with the export.

Known limitations

- Ground plane assumption breaks on stairs and tiered seating (RANSAC fit flagged when inlier ratio < 60%)
- Single image only at this stage — video fusion is the next thing I'm building
- My method doesn't model crowd dynamics at an individual's scale — to calculate real individual positions an iterative approach may be needed which goes against optimizing for speed

Resources

- evolving blog post with up-to-date info: https://www.balazshimself.com/blog/crowd-predictor
- MVP tool: https://www.crowdcounting.net

Any feedback is welcome! Thanks for your time!

u/balazshimself — 12 hours ago
▲ 1 r/computervision+2 crossposts

Finding the speed of a kick from a video

Somebody suggested this subreddit for this problem, I hope it is relevant.

https://imgur.com/a/RHUmoFz

I want to calculate the speed of this man's kick by measuring the distance his foot travels and dividing it by the time it takes. I know that the speed is written on the video, but I want to confirm it because it vastly exceeds the speeds from studies I've read. Max speed in studies is sub 20 m/s. This guy is kicking over 60 m/s.

Finding the time it takes for him to kick is easy enough.

Finding the distance is very difficult for me. I know that the guy's height is about 180 cm from an interview, and I think I can somehow use that information to solve my problem. I'm not sure though, and I don't want to waste my time on something that can't be done. So, is it doable?

If it isn't possible you can ignore the rest of the post.

Is there a software for doing this? Either free or cheap.

My idea (obviously can be wrong): use one of the first frames where he is standing to find what 180 cm looks like in a part of the frame. His knees are bent, so I have to first find how it would look in the frame if he was standing. I think this can be done with geometry. Since the camera is steady, I can copy the 180 cm line to the other frames. Then I approximate the arc of the kick by measuring a few small straight distances that the kick travels, frame by frame, and adding them.

I tried to do this for a few hours and didn't make any progress. So I kindly ask for help on how to solve this problem.

Alternatively, has AI gotten good enough to solve this kind of problem? Which AI could I use in that case?

u/JustNormalRedditUser — 12 hours ago

Built a local AI video analytics PoC for scene-level event analysis (YOLO26)

I built a local AI video analytics PoC that analyzes uploaded videos and generates structured reports from the scene.

The system focuses on scene-level understanding rather than only basic object detection. It can report signals such as people density, movement patterns, zone activity, crossing behavior, forgotten-item candidates, and safety-event candidates like fall or lying-still behavior.

The goal was to create a review-oriented workflow where the system highlights possible events, generates a risk score, and produces visual/report-based outputs for human review.

It does not make final security decisions. The detected events are treated as candidate signals that should be reviewed by an operator.

For the test workflow, I intentionally used mixed video scenes to evaluate how the system handles pedestrian flow, object-related events, safety-event candidates, and scene transitions. Optional portfolio link : www.linkedin.com/in/brkndc

u/OldAnywhere3060 — 20 hours ago

vggt-omega takes videos and creates a point cloud. fast, and good quality generations for pcd and depth

ofc meta would drop a dope model on a friday afternoon and have me scrambling to integrate it over my birthday weekend

you can quickly get started with the model in fiftyone by following the steps in this repo: https://github.com/harpreetsahota204/vggt_omega

u/datascienceharp — 1 day ago
▲ 12 r/computervision+1 crossposts

Synthetic DMS Training Data Generation with Video Models

I like spending my free time testing new AI tools and seeing where they might fit into real computer vision workflows. This time I experimented with synthetic training data generation for Driver Monitoring Systems using Seedance 2.0.

The inspiration came from Vision Banana: https://vision-banana.github.io/

The idea that really caught my attention is simple but powerful: many vision tasks can be represented as RGB outputs. A segmentation mask, an instance mask, a depth map, or another dense prediction target can all be treated as an image-like output.

So I tried to apply this thinking to video.

The workflow:

  1. Generate a realistic synthetic driver monitoring video
  2. Use the same video to generate a semantic segmentation mask
  3. Use the same video to generate an instance segmentation mask
  4. Combine the outputs into a dataset-like structure

The mosaic video shows the result:

RGB video + semantic mask + instance mask, aligned frame by frame.

The scene is a fictional driver gradually becoming drowsy behind the wheel. This kind of scenario is useful for DMS development, but difficult to collect and annotate at scale with real-world data.

Of course, generated annotations still need QA. They are not perfect ground truth.

But for prototyping, rare-case simulation, and early dataset generation, this feels like a very promising direction.

The interesting part is that the final output is not just a nice synthetic video. It can become structured training data:

  • RGB frames from the generated video
  • semantic classes from the semantic mask
  • object regions and bounding boxes from the instance mask
  • YOLO / COCO-style annotations after post-processing

I wrote a more detailed blog post about the experiment here:

https://www.antal.ai/blog/synthetic_dms_training_data.html

u/Gloomy_Recognition_4 — 21 hours ago

Marlin2B: a tiny video language model to extract structured information from videos

Hi all!

Shubham and Aryan here, putting out our first open source video language model release.

Story time: we were building video editing agents for social-media content and were using Gemini-2.5-Flash to analyse IG reels and find events in them. It works, but at around a thousand clips/day the cost adds up, and we kept hitting the content-policy on perfectly fine social media clips at our scale

We had a couple of H100s sitting around, so we put them on solving this as a side project. We kept the scope deliberately narrow: not a general VLM you can chat with, just two operations we needed in production. We're releasing it because it seems generally useful for anyone building structured-video pipelines.

The interesting work wasn't the training loop, it was the data curation. We expected to ride the public video-annotated corpora (Tarsier-Recap, ActivityNet, Charades-Ego, LSMDC, etc.) but were disappointed. In practice most of them have one-line captions and rough timestamps, and aren't really annotated event-by-event at second-level precision.

So we wrote a teacher + pooling + human-review pipeline with Gemini-3-Flash in thinking mode and re-annotated ~400K clips from publicly available dataset mixes with fine-grained temporal captions. We then ran SFT + SimPO post-training to make the model really good at dense captioning and temporal grounding. Honestly, most of the project was making sure this data pipeline was high-quality and free of hallucinations.

The result: Marlin is a 2B video VLM tuned for the two questions developers actually want to ask of their videos: what is happening, and when? It produces structured Scene + Event captions with second-precise timestamps, and resolves natural-language queries to span-grounded (start, end) ranges in the video. At 2B params, it's the strongest open model in its weight class on dense captioning (DREAM-1K, CaReBench) and natural-language temporal grounding (TimeLens-Bench), and competitive with Gemini-2.5 at a fraction of the cost. We'll also release our training recipe and a new benchmark for video captioning and grounding soon.

Marlin-2B is open-sourced and comes with vLLM inference and two modes:

  • marlin.caption() gives a structured output of scene description and time-grounded events from a video.
  • marlin.find() gives (start, end) timestamps for a natural-language query over a video.

Weights are open and free to use on HF. If you find it useful, or have ideas on what capabilities we should improve next for real-world use cases, we would love to hear them!!

We want to make more such specific small video language models to enable more open ended video analytics use cases.

This is how our results look like

https://preview.redd.it/nowpwlotyy1h1.jpg?width=1170&format=pjpg&auto=webp&s=aa68fdde3886b8a4dfd895b6f0e0e1e1d397a282

https://preview.redd.it/stfnnkotyy1h1.jpg?width=3370&format=pjpg&auto=webp&s=2323f4dc7c4a79e54db85bf1fd940a54e353d103

https://preview.redd.it/7ifpzjotyy1h1.jpg?width=1170&format=pjpg&auto=webp&s=c721ce9e253ef628e21b0a254798a0149e6444b7

reddit.com
u/AndromedaGambler — 1 day ago

I made a QGIS plugin called "AI Edit" to detect features from aerial images

Put your reference image (what you want)
Type your prompt
Run

My next step is turn those pixels into vectors. Already working on it, if anyone has advice, I'm all ears

u/Lilien_rig — 1 day ago
▲ 6 r/computervision+1 crossposts

Seeking Advice for MASc Thesis Topic in Computer Vision (Goal: AMD / Google / ADAS / Industry Internship)

Hi,

I’ve recently been accepted into a MASc in Electrical and Computer Engineering under a professor whose research focuses on automotive sensor systems, intelligent transportation, computer vision, and driver assistance technologies.

My long-term goal is to work at companies like AMD, Google, or similar high-performance computing / autonomous systems companies, ideally in roles related to:

  • Computer Vision
  • Embedded AI
  • Firmware / Systems Engineering
  • ADAS / Autonomous Driving
  • GPU acceleration / AI deployment

Currently, I have beginner experience with:

  • Training custom object detection datasets (YOLO/OpenCV)
  • Real-time object detection
  • Thermal + depth sensing applications
  • ROS + Linux
  • Monocular SLAM concepts (ORB-SLAM3)
  • Embedded systems / real-time sensing projects

I’m trying to choose a thesis direction that would:

  1. Align well with my supervisor’s automotive/computer vision research
  2. Build industry-relevant skills
  3. Improve my chances for internships or future roles at AMD, Google, or similar companies
  4. Potentially connect with industry-sponsored projects in Ontario or Canada

I’m very open-minded and willing to learn new technologies (C++, GPU acceleration, ROS2, embedded systems, etc.).

  • What thesis topics would best position me for AMD / Google-level roles?
  • Should I focus more on:
    • ADAS / perception
    • Visual SLAM
    • Edge AI / deployment
    • GPU optimization
    • Embedded vision systems
  • Are there companies in Ontario/Canada that sponsor MASc students or work with industry research projects in these areas?
  • What technical skills would be most sought out?

Any advice from industry professionals, grad students, or researchers would be greatly appreciated. Thank you.

reddit.com

Fine tuning yolo to find people in industrial environment

I am a student and I am trying to fine tune yolo to find people in my very high resolution industrial pictures. Without fine tuning, I get a lot of false positives because of tubes and pipes (and if I raise the confidence i don’t find the people). So I fine tuned yolo. The problem is that I have very few images with people (just 20 tiles with humans and I have 750 high res pictures I slice in tiles). 

I used my 20 humans to train/val yolo and about 2000 tiles with nothing. When I test again on all my HR images and I have fewer false positives and almost all humans. But I guess it’s overfitting because it runs on the tiles with humans used to train yolo. 

What would you do?

Thanks 

reddit.com
u/Cold-Act1693 — 2 days ago

YoloLiteV2 now pip installable

I posted last week about an upgrade to my repo YoloLite. I have now decided to launch V2 directly via PyPI! You can test it out right now with a simple pip install yololite and help me find bugs and benchmark the models.

Everything is Apache 2.0, and the weights are automatically downloaded from GitHub on demand.

You can either use the API directly via Python or run everything via the CLI:

yololite mode=predict model=yololite_cs3_m.pt source=test.jpg conf=0.4 save=True
yololite mode=train model=yololite_mnv4_s.pt data="data.yaml" epochs=30 workers=4

I have pretrained a total of 9 models across 3 different lightweight backbones:

  • CS3Darknet backbone: yololite_cs3_n.pt | yololite_cs3_s.pt | yololite_cs3_m.pt
  • MobileNetV4 backbone: yololite_mnv4_n.pt | yololite_mnv4_s.pt | yololite_mnv4_m.pt
  • HGNetV2 backbone: yololite_hg2_n.pt | yololite_hg2_s.pt | yololite_hg2_m.pt

The models have been pretrained on the official COCO-minitrain_25k dataset. (Check out their official repo for more info on the Pearson correlation coefficients between full COCO and minitrain).

Currently supported export formats include ONNX and TensorRT. The framework also supports post-export validation to ensure stability and mAP consistency after deployment.

Would love to get your feedback and bug reports!

PyPI: pip install yololite

u/ConferenceSavings238 — 3 days ago
▲ 113 r/computervision+1 crossposts

TrafficLab 3D: Digital-twin with just Mp4 and Google Maps

I built an open-source traffic digital twin tool that works from just:

  • CCTV footage
  • Google Maps imagery

Project:
https://github.com/duy-phamduc68/TrafficLab-3D

It includes:

  • staged camera calibration
  • object detection/tracking
  • speed + orientation estimation
  • synchronized CCTV + satellite visualization with 3D/floor boxes

Still has a lot of limitations (planar assumptions, occlusion problems, manual calibration workload), but I wanted to release it openly anyway and iterate from feedback.

u/zaclord68 — 3 days ago

Best tools for annotation?

Beginner to Computer Vision and I have a project where I'm working on lane markings detection from dashcam videos. I have seen Label studio so far. What should I use as there will be so many frames for each video? Note: There is a seperate large enough intern team to work on annotations.

reddit.com
u/doIores_haze — 2 days ago
▲ 12 r/computervision+1 crossposts

FFGear: A Multi-threaded, High-performance FFmpeg Decoder API in Pure Python

FFGear provides direct, transparent access to the full FFmpeg Decoder feature-set, including:

  • Hardware-Accelerated Decoding — GPU-powered decoding with CUDA/CUVID and other hardware-accelerated backends 
  • Flexible Pixel Formats — support for any FFmpeg pixel format (e.g., bgr24, yuv420p, gray) with optional OpenCV compatibility patches for YUV/NV layouts.
  • Per-Frame Metadata Extraction — asynchronous frame metadata extraction through the showinfo filter.
  • Live Complex Filtergraphs — support for live simple and complex FFmpeg filter pipelines.
  • Wide Source Support — capture USB, virtual, and IP camera feeds by index similar to OpenCV, along with support for multimedia files, image sequences, desktop screen capture, and network streams (HTTP(s), RTSP/RTP, etc.).

Get Started here: https://abhitronix.github.io/vidgear/latest/gears/ffgear/

u/abhi_uno — 2 days ago

Computer Vision Task

Currently I'm working on a computer vision project in which object detection module is there. When I'm scanning in a super market shelf, it has to show the product name below. Tell me is that possible? If yes, please suggest me the architecture. There are around 20k product classes for detection, some are very similar to see(same product with different variants)

reddit.com
u/Vijay-Data-Science — 3 days ago

Experimenting with egocentric video

Hey guys,

With robotics growing so fast, first-person (egocentric) vision is becoming a massive domain in CV on its own. If robots are ever going to help us in the real world, they need to understand how humans handle objects from our own perspective.

I've been deep in experimentation mode and performing some test with CV model on egocentric video from scratch on everyday simple tasks (annotation -> model training -> implementation)!

For this project, I focused on a simple, everyday task: opening and closing a bottle cap. Here is a quick look at the video showing the real-time tracking and state changes in action:

  • Data Annotation: I started by capturing raw egocentric footage. To get clean bounding boxes for the bottle and cap across the sequence, I used Labellerr. It made handling the frame-by-frame labeling smooth and kept the dataset precise.
  • Model Training & Tracking: I paired object detection for the assets (bottle and cap) with hand skeleton tracking to map exactly how the fingers grasp and interact with the objects.
  • State Logic Building: Once the spatial coordinates were tracking properly, I built a custom state machine logic on top of it. The system actively differentiates between IDLE, OPENING THE BOTTLE, and CLOSING THE BOTTLE based on hand-to-object intersections and hand velocity.

This is one of many examples i am experimenting with egocentric video (feel free to suggest some ideas regarding it)

Would love to hear your thoughts! Are any of you working on egocentric datasets or robotics perception pipelines right now? What are the biggest bottlenecks you’re running into with first-person data?

Resouces:
- video: link
- code: link

u/Full_Piano_3448 — 3 days ago

ECCV 2026 Rebuttal Visibility for Reviewers

Dear ECCV 2026 reviewers,

As a reviewer myself, I currently cannot see the rebuttals for two papers assigned to me. Since the rebuttal PDFs are invisible on my side, I initially assumed the authors had not submitted rebuttals.

However, one AC later commented and recommended that I reconsider/update my initial rating based on the rebuttal, which makes me suspect this may be an OpenReview visibility issue rather than missing submissions.

Is anyone else experiencing this? Can you normally view the rebuttal PDFs?

reddit.com
u/Forsaken_Ad_9749 — 2 days ago