I have been integrating PaddleOCR-VL-1.5 and PP-DocLayoutV3 into X-AnyLabeling, and I think this stack is interesting for the local/self-hosted document AI crowd.
The main thing it reminded me of: document parsing is not just text extraction.
For a clean receipt or a cropped text line, classic OCR may be enough. But once you move to papers, scanned PDFs, photographed documents, contracts, technical reports, tables, equations, charts, seals, headers, footers, and multi-column layouts, the problem starts to look more like a small VLM/document-understanding pipeline:
- Find the document elements.
- Preserve their geometry.
- Recover the reading order.
- Route each region to the right recognizer.
- Let a human verify and correct the structured result.
That is where the combination of PP-DocLayoutV3 and PaddleOCR-VL-1.5 is interesting.
PP-DocLayoutV3 handles the layout side. Instead of treating a page as a flat OCR canvas, it predicts document regions such as titles, paragraphs, tables, formulas, charts, images, seals, headers, footers, and page numbers. Recent descriptions of the model emphasize complex layouts and physical distortions such as skew, curved pages, and uneven lighting, with reading-order prediction built into the layout analysis pipeline.
PaddleOCR-VL-1.5 handles the multimodal recognition side. It is a compact 0.9B multi-task VLM for document parsing, with official support for tasks such as OCR, table recognition, formula recognition, chart recognition, text spotting, and seal recognition. The model page reports strong results on OmniDocBench v1.5 and Real5-OmniDocBench, with particular focus on real-world distortions like scanning artifacts, skew, warping, screen photography, and illumination changes.
What I wanted in X-AnyLabeling was not another “upload PDF, get text” demo. I wanted a workflow where the local model output stays inspectable and editable, because that is where document parsing usually breaks in practice.
The practical workflow is:
| Step | What happens |
|---|---|
| Layout detection | PP-DocLayoutV3 identifies page blocks and layout categories |
| Task routing | Labels like table, display_formula, chart, seal, and text are routed to the matching PaddleOCR-VL-1.5 task |
| Recognition | PaddleOCR-VL-1.5 returns text, Markdown/HTML tables, LaTeX formulas, chart content, seal text, or text-spotting results |
| Review | The source page and parsed blocks are shown side-by-side for correction |
| Export | Results can be copied or saved as Markdown/JSON, with edited blocks tracked locally |
This matters because the annoying part of local document AI is often not the first model output. It is the correction loop:
- Did the model split the table correctly?
- Did the formula become usable LaTeX?
- Did a header/footer get mixed into the body text?
- Is the reading order still correct in a multi-column PDF?
- Can I fix one block without losing the rest of the parse?
The new panel in X-AnyLabeling is built around that loop. You can import images or PDFs, view layout polygons over the source page, click between source regions and parsed blocks, edit normal text with a rich-text editor, edit formulas as LaTeX with preview, edit tables at the cell level, and inspect the saved JSON directly.
There are two deployment paths:
- Use the official PaddleOCR API for quick testing.
- Use X-AnyLabeling-Server for a self-hosted/private deployment of the PP-DocLayoutV3 + PaddleOCR-VL-1.5 workflow.
For this subreddit, I think the self-hosted path is the more interesting one. PaddleOCR-VL-1.5 is small enough to be a practical document VLM candidate, while PP-DocLayoutV3 gives the pipeline a structured layout front end. The result is a hybrid setup: not just one giant VLM prompt over the whole page, and not just a traditional OCR pipeline either.
What I like about this direction is that it treats OCR as a human-in-the-loop document parsing problem, not only a benchmark number. The model needs to be good, but the UI also needs to make mistakes visible, local, and cheap to fix.
Links:
- X-AnyLabeling: https://github.com/CVHub520/X-AnyLabeling
- X-AnyLabeling docs: https://github.com/CVHub520/X-AnyLabeling/blob/main/docs/en/paddle_ocr.md
- PaddleOCR-VL-1.5 model: https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5
- PP-DocLayoutV3 docs: https://huggingface.co/docs/transformers/model_doc/pp_doclayout_v3
- X-AnyLabeling-Server: https://github.com/CVHub520/X-AnyLabeling-Server
For people here running local document AI pipelines: do you prefer a VLM-first document parser, a modular layout -> OCR/table/formula pipeline, or some hybrid of the two?