u/Important_Priority76 — reddlx

I have been integrating PaddleOCR-VL-1.5 and PP-DocLayoutV3 into X-AnyLabeling, and I think this stack is interesting for the local/self-hosted document AI crowd.

The main thing it reminded me of: document parsing is not just text extraction.

For a clean receipt or a cropped text line, classic OCR may be enough. But once you move to papers, scanned PDFs, photographed documents, contracts, technical reports, tables, equations, charts, seals, headers, footers, and multi-column layouts, the problem starts to look more like a small VLM/document-understanding pipeline:

Find the document elements.
Preserve their geometry.
Recover the reading order.
Route each region to the right recognizer.
Let a human verify and correct the structured result.

That is where the combination of PP-DocLayoutV3 and PaddleOCR-VL-1.5 is interesting.

PP-DocLayoutV3 handles the layout side. Instead of treating a page as a flat OCR canvas, it predicts document regions such as titles, paragraphs, tables, formulas, charts, images, seals, headers, footers, and page numbers. Recent descriptions of the model emphasize complex layouts and physical distortions such as skew, curved pages, and uneven lighting, with reading-order prediction built into the layout analysis pipeline.

PaddleOCR-VL-1.5 handles the multimodal recognition side. It is a compact 0.9B multi-task VLM for document parsing, with official support for tasks such as OCR, table recognition, formula recognition, chart recognition, text spotting, and seal recognition. The model page reports strong results on OmniDocBench v1.5 and Real5-OmniDocBench, with particular focus on real-world distortions like scanning artifacts, skew, warping, screen photography, and illumination changes.

What I wanted in X-AnyLabeling was not another “upload PDF, get text” demo. I wanted a workflow where the local model output stays inspectable and editable, because that is where document parsing usually breaks in practice.

The practical workflow is:

Step	What happens
Layout detection	PP-DocLayoutV3 identifies page blocks and layout categories
Task routing	Labels like `table`, `display_formula`, `chart`, `seal`, and `text` are routed to the matching PaddleOCR-VL-1.5 task
Recognition	PaddleOCR-VL-1.5 returns text, Markdown/HTML tables, LaTeX formulas, chart content, seal text, or text-spotting results
Review	The source page and parsed blocks are shown side-by-side for correction
Export	Results can be copied or saved as Markdown/JSON, with edited blocks tracked locally

This matters because the annoying part of local document AI is often not the first model output. It is the correction loop:

Did the model split the table correctly?
Did the formula become usable LaTeX?
Did a header/footer get mixed into the body text?
Is the reading order still correct in a multi-column PDF?
Can I fix one block without losing the rest of the parse?

The new panel in X-AnyLabeling is built around that loop. You can import images or PDFs, view layout polygons over the source page, click between source regions and parsed blocks, edit normal text with a rich-text editor, edit formulas as LaTeX with preview, edit tables at the cell level, and inspect the saved JSON directly.

There are two deployment paths:

Use the official PaddleOCR API for quick testing.
Use X-AnyLabeling-Server for a self-hosted/private deployment of the PP-DocLayoutV3 + PaddleOCR-VL-1.5 workflow.

For this subreddit, I think the self-hosted path is the more interesting one. PaddleOCR-VL-1.5 is small enough to be a practical document VLM candidate, while PP-DocLayoutV3 gives the pipeline a structured layout front end. The result is a hybrid setup: not just one giant VLM prompt over the whole page, and not just a traditional OCR pipeline either.

What I like about this direction is that it treats OCR as a human-in-the-loop document parsing problem, not only a benchmark number. The model needs to be good, but the UI also needs to make mistakes visible, local, and cheap to fix.

Links:

X-AnyLabeling: https://github.com/CVHub520/X-AnyLabeling
X-AnyLabeling docs: https://github.com/CVHub520/X-AnyLabeling/blob/main/docs/en/paddle_ocr.md
PaddleOCR-VL-1.5 model: https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5
PP-DocLayoutV3 docs: https://huggingface.co/docs/transformers/model_doc/pp_doclayout_v3
X-AnyLabeling-Server: https://github.com/CVHub520/X-AnyLabeling-Server

For people here running local document AI pipelines: do you prefer a VLM-first document parser, a modular layout -> OCR/table/formula pipeline, or some hybrid of the two?

Area	Current coverage
Detection	YOLOv5/6/7/8/9/10/11/12/26, YOLOX, RT-DETR, RF-DETR, D-FINE, DEIMv2, and more
Segmentation	SAM 1/2/3, SAM-HQ, SAM-Med2D, EfficientViT-SAM, MobileSAM, YOLO-Seg variants
Grounding / open-vocabulary	Grounding DINO, YOLO-World, YOLOE
Object counting	CountGD, GeCo, GeCo2
Other supported tasks	Pose, tracking, rotated boxes, OCR, document layout, depth, matting, anomaly detection, VLM-assisted labeling, video segmentation
Inference options	Local ONNX inference, TensorRT support for YOLO models, remote PyTorch inference through X-AnyLabeling-Server
Data formats	COCO, VOC, YOLO, DOTA, MOT, MASK, PPOCR, VLM-R1, ShareGPT, and more

Area

Current coverage

Detection

YOLOv5/6/7/8/9/10/11/12/26, YOLOX, RT-DETR, RF-DETR, D-FINE, DEIMv2, and more

Segmentation

SAM 1/2/3, SAM-HQ, SAM-Med2D, EfficientViT-SAM, MobileSAM, YOLO-Seg variants

Grounding / open-vocabulary

Grounding DINO, YOLO-World, YOLOE

Object counting

CountGD, GeCo, GeCo2

Other supported tasks

Pose, tracking, rotated boxes, OCR, document layout, depth, matting, anomaly detection, VLM-assisted labeling, video segmentation

Inference options

Local ONNX inference, TensorRT support for YOLO models, remote PyTorch inference through X-AnyLabeling-Server

Data formats

COCO, VOC, YOLO, DOTA, MOT, MASK, PPOCR, VLM-R1, ShareGPT, and more