u/Important_Priority76

I have been integrating PaddleOCR-VL-1.5 and PP-DocLayoutV3 into X-AnyLabeling, and I think this stack is interesting for the local/self-hosted document AI crowd.

The main thing it reminded me of: document parsing is not just text extraction.

For a clean receipt or a cropped text line, classic OCR may be enough. But once you move to papers, scanned PDFs, photographed documents, contracts, technical reports, tables, equations, charts, seals, headers, footers, and multi-column layouts, the problem starts to look more like a small VLM/document-understanding pipeline:

  1. Find the document elements.
  2. Preserve their geometry.
  3. Recover the reading order.
  4. Route each region to the right recognizer.
  5. Let a human verify and correct the structured result.

That is where the combination of PP-DocLayoutV3 and PaddleOCR-VL-1.5 is interesting.

PP-DocLayoutV3 handles the layout side. Instead of treating a page as a flat OCR canvas, it predicts document regions such as titles, paragraphs, tables, formulas, charts, images, seals, headers, footers, and page numbers. Recent descriptions of the model emphasize complex layouts and physical distortions such as skew, curved pages, and uneven lighting, with reading-order prediction built into the layout analysis pipeline.

PaddleOCR-VL-1.5 handles the multimodal recognition side. It is a compact 0.9B multi-task VLM for document parsing, with official support for tasks such as OCR, table recognition, formula recognition, chart recognition, text spotting, and seal recognition. The model page reports strong results on OmniDocBench v1.5 and Real5-OmniDocBench, with particular focus on real-world distortions like scanning artifacts, skew, warping, screen photography, and illumination changes.

What I wanted in X-AnyLabeling was not another “upload PDF, get text” demo. I wanted a workflow where the local model output stays inspectable and editable, because that is where document parsing usually breaks in practice.

The practical workflow is:

Step What happens
Layout detection PP-DocLayoutV3 identifies page blocks and layout categories
Task routing Labels like table, display_formula, chart, seal, and text are routed to the matching PaddleOCR-VL-1.5 task
Recognition PaddleOCR-VL-1.5 returns text, Markdown/HTML tables, LaTeX formulas, chart content, seal text, or text-spotting results
Review The source page and parsed blocks are shown side-by-side for correction
Export Results can be copied or saved as Markdown/JSON, with edited blocks tracked locally

This matters because the annoying part of local document AI is often not the first model output. It is the correction loop:

  • Did the model split the table correctly?
  • Did the formula become usable LaTeX?
  • Did a header/footer get mixed into the body text?
  • Is the reading order still correct in a multi-column PDF?
  • Can I fix one block without losing the rest of the parse?

The new panel in X-AnyLabeling is built around that loop. You can import images or PDFs, view layout polygons over the source page, click between source regions and parsed blocks, edit normal text with a rich-text editor, edit formulas as LaTeX with preview, edit tables at the cell level, and inspect the saved JSON directly.

There are two deployment paths:

  • Use the official PaddleOCR API for quick testing.
  • Use X-AnyLabeling-Server for a self-hosted/private deployment of the PP-DocLayoutV3 + PaddleOCR-VL-1.5 workflow.

For this subreddit, I think the self-hosted path is the more interesting one. PaddleOCR-VL-1.5 is small enough to be a practical document VLM candidate, while PP-DocLayoutV3 gives the pipeline a structured layout front end. The result is a hybrid setup: not just one giant VLM prompt over the whole page, and not just a traditional OCR pipeline either.

What I like about this direction is that it treats OCR as a human-in-the-loop document parsing problem, not only a benchmark number. The model needs to be good, but the UI also needs to make mistakes visible, local, and cheap to fix.

Links:

For people here running local document AI pipelines: do you prefer a VLM-first document parser, a modular layout -> OCR/table/formula pipeline, or some hybrid of the two?

u/Important_Priority76 — 16 days ago

Hi r/computervision,

I have been looking more closely at few-shot object counting recently, and one thing that keeps standing out is how awkward the task becomes once the image has both dense small objects and large scale variation.

In many counting pipelines, small dense instances push you toward image upscaling or tiling. That helps recall, but it also makes the system heavier, introduces boundary effects, and can become painful when the same image contains objects at very different sizes. Merging multi-resolution backbone features sounds natural, but the hard part is still how to keep the query representation aware of the exemplars while preserving enough spatial detail for detection.

This also changes how I think about general segmentation models like SAM 3. SAM 3 is very impressive as a unified promptable segmentation model: it can use text or visual prompts, detect/segment open-vocabulary concepts, and even extend the idea to video tracking. For many annotation tasks, that is exactly what you want: type a concept, click a box or point, get masks, refine, move on.

But for counting-heavy scenarios, I still see two obvious gaps:

  • Tiny dense instances are fragile. When the target objects are very small, visually repetitive, and packed together, a general concept segmentation model can miss instances, merge neighbors, or become sensitive to thresholds.
  • Latency matters. SAM-style foundation models are powerful, but the full pipeline can be heavy, especially when you need to run it over many images or repeatedly tune prompts inside an annotation loop.

That is why GeCo2 caught my attention. It is an AAAI 2026 few-shot counting/detection model that tries to handle the scale problem more directly. Instead of treating tiling/upscaling as the main path to high-resolution localization, GeCo2 builds a generalized-scale dense query map through gradual cross-scale query aggregation. In simpler terms, exemplar-specific information is injected and refined across multiple backbone resolutions, then fused into a high-resolution query map that can support both small crowded objects and larger instances.

The parts I find especially interesting:

  • Detection-based counting: the output is not just a scalar count. You get object locations, which makes the result inspectable and editable.
  • Few-shot prompting: the target category is specified by a few exemplar boxes at test time, which is useful for categories that are too specific or too rare to justify training a dedicated detector.
  • Scale-aware query construction: the method focuses on the multi-scale matching problem instead of relying mainly on external image preprocessing tricks.
  • Practical efficiency: the paper reports better counting/detection accuracy while running faster and using less GPU memory than previous state-of-the-art few-shot counters.

I recently integrated GeCo2 into X-AnyLabeling through the remote inference workflow, mainly because counting is often only half of the real problem. In dataset work, I usually want the model to propose boxes, let a human inspect them, fix mistakes, and then export the annotations in a normal dataset format.

The current workflow is:

  1. Load an image.
  2. Select Remote-Server -> GECO2 in the auto-labeling panel.
  3. Draw one or more exemplar boxes around the target object.
  4. Run rectangle-prompt inference.
  5. Review the returned boxes/counts and adjust the confidence threshold if needed.

So the model becomes less of a black-box counter and more of an annotation assistant: it proposes dense detections from a few examples, and the user keeps control over the final labels.

Links:

X-AnyLabeling at a glance:

Area Current coverage
Detection YOLOv5/6/7/8/9/10/11/12/26, YOLOX, RT-DETR, RF-DETR, D-FINE, DEIMv2, and more
Segmentation SAM 1/2/3, SAM-HQ, SAM-Med2D, EfficientViT-SAM, MobileSAM, YOLO-Seg variants
Grounding / open-vocabulary Grounding DINO, YOLO-World, YOLOE
Object counting CountGD, GeCo, GeCo2
Other supported tasks Pose, tracking, rotated boxes, OCR, document layout, depth, matting, anomaly detection, VLM-assisted labeling, video segmentation
Inference options Local ONNX inference, TensorRT support for YOLO models, remote PyTorch inference through X-AnyLabeling-Server
Data formats COCO, VOC, YOLO, DOTA, MOT, MASK, PPOCR, VLM-R1, ShareGPT, and more

If you work on counting, dense detection, or annotation tooling, I would love feedback on the GeCo2 integration and on what other counting models/workflows would be worth supporting next.

u/Important_Priority76 — 17 days ago