Web scraping -> entity resolution -> normalized model -> API serving layer pipeline
Data is fetched from a variety of sources: XML files from FTP server, public JSON API, web scraping HTML pages, downloading PDF pages that need OCR, ...
These sources contain data about private companies and their shareholders.
Entities need to be resolved: link two address observations if they are the same, link two people observations if they are the same, ...
This needs to be brought togheter into one combined model.
This is followed by a very fast serving layer to power my own API that will be directly consumed by users, app and mcp server.
There is an initial load of about 10 million company and people rows, as well as 50 million PDF pages that need OCR. Every day about 10k elements are added.
Currently I'm doing this in PostgreSQL hosted on Railway, with DuckDB to perform the entity resolution. I have 260 GB of data in total.
I have a cron job for each source. These are the schemas: raw (separate schema for each source), xref (entity resolution), core (normalized) and mart (serving layer).
I have 1 mono repo with all of the code, most of it is Typescript with Bun.
My problem is that it has become hard to manage. Things feel a bit duck taped as I have little observability. I don't have a clear overview of the data pipeline. Additionally, doing intial loads can take many hours.
Anyone that had to work on a similar problem. How would you solve this?