u/vroemboem

Web scraping -> entity resolution -> normalized model -> API serving layer pipeline

Data is fetched from a variety of sources: XML files from FTP server, public JSON API, web scraping HTML pages, downloading PDF pages that need OCR, ...

These sources contain data about private companies and their shareholders.

Entities need to be resolved: link two address observations if they are the same, link two people observations if they are the same, ...

This needs to be brought togheter into one combined model.

This is followed by a very fast serving layer to power my own API that will be directly consumed by users, app and mcp server.

There is an initial load of about 10 million company and people rows, as well as 50 million PDF pages that need OCR. Every day about 10k elements are added.

Currently I'm doing this in PostgreSQL hosted on Railway, with DuckDB to perform the entity resolution. I have 260 GB of data in total.

I have a cron job for each source. These are the schemas: raw (separate schema for each source), xref (entity resolution), core (normalized) and mart (serving layer).

I have 1 mono repo with all of the code, most of it is Typescript with Bun.

My problem is that it has become hard to manage. Things feel a bit duck taped as I have little observability. I don't have a clear overview of the data pipeline. Additionally, doing intial loads can take many hours.

Anyone that had to work on a similar problem. How would you solve this?

reddit.com
u/vroemboem — 5 hours ago

Web scraping -> entity resolution -> normalized model -> API serving layer pipeline

Data is fetched from a variety of sources: XML files from FTP server, public JSON API, web scraping HTML pages, downloading PDF pages that need OCR, ...

These sources contain data about private companies and their shareholders.

Entities need to be resolved: link two address observations if they are the same, link two people observations if they are the same, ...

This needs to be brought togheter into one combined model.

This is followed by a very fast serving layer to power my own API that will be directly consumed by users, app and mcp server.

There is an initial load of about 10 million company and people rows, as well as 50 million PDF pages that need OCR. Every day about 10k elements are added.

Currently I'm doing this in PostgreSQL hosted on Railway, with DuckDB to perform the entity resolution. I have 260 GB of data in total.

I have a cron job for each source. These are the schemas: raw (separate schema for each source), xref (entity resolution), core (normalized) and mart (serving layer).

I have 1 mono repo with all of the code, most of it is Typescript with Bun.

My problem is that it has become hard to manage. Things feel a bit duck taped as I have little observability. I don't have a clear overview of the data pipeline. Additionally, doing intial loads can take many hours.

I was thinking Databricks could be a unified data platform from which I can manage this. One thing I'm not sure about is how to manage the scraping as I don't think Databricks is really built for this.

Anyone that had to work on a similar problem. How would you solve this?

reddit.com
u/vroemboem — 5 hours ago
▲ 2 r/pdf

Fast & cheap OCR on 50M PDF pages to build PDF search engine

I need to OCR 50M PDF pages, they are in Dutch, French and German. Most are computer written text that was printed out and scanned in. Sometimes there's a stamp or a little hand writing, but it's not important to capture that information.

The aim would be to build a search engine on top of those PDFs. Not necessarily for AI, but just for humans to search PDFs based on the text in the PDFs.

I have a limited budget of less than 1k and would like to finish the job in under 4 days. I think most VLMs are probably too expensive to run at this scale with this budget?

Options I'm looking at: Tesseract, Paddle OCR, Surya OCR, Mindee DocTR, Rapid OCR, ...

So far I'm thinking of picking Rapid OCR with PP-OCRv5, but this seems optimized for Chinese so not sure if it will work well for my languages.

Some VLMs I'm looking at, but they will probably be too slow and expensive: LightOnOCR 2 1B, SmolVLM-256M, HunyuanOCR 1B, Docling Granite, ...

Do I run these models natively, or better to go with something like Docling, PyMuPDF4LLM, Marker, ... Or do these add a lot of overhead?

Any recommendations on how to run this in parallel?

Am I missing anything? Tips on how to build the search engine afterward?

reddit.com
u/vroemboem — 1 day ago

Fast & cheap OCR on 50M PDF pages to build PDF search engine

I need to OCR 50M PDF pages, they are in Dutch, French and German. Most are computer written text that was printed out and scanned in. Sometimes there's a stamp or a little hand writing, but it's not important to capture that information.

The aim would be to build a search engine on top of those PDFs. Not necessarily for AI, but just for humans to search PDFs based on the text in the PDFs.

I have a limited budget of less than 1k and would like to finish the job in under 4 days. I think most VLMs are probably too expensive to run at this scale with this budget?

Options I'm looking at: Tesseract, Paddle OCR, Surya OCR, Mindee DocTR, Rapid OCR, ...

So far I'm thinking of picking Rapid OCR with PP-OCRv5, but this seems optimized for Chinese so not sure if it will work well for my languages.

Some VLMs I'm looking at, but they will probably be too slow and expensive: LightOnOCR 2 1B, SmolVLM-256M, HunyuanOCR 1B, Docling Granite, ...

Do I run these models natively, or better to go with something like Docling, PyMuPDF4LLM, Marker, ... Or do these add a lot of overhead?

Any recommendations on how to run this in parallel?

Am I missing anything? Tips on how to build the search engine afterward?

reddit.com
u/vroemboem — 1 day ago