u/mynameisyahiabakour

▲ 8 r/Rag

Hey r/rag,

I used to work on a lot of RAG / agent workflows lately and kept running into the same issue:

getting clean website data into the context window is way harder than it should be.

Most sites either:

  • return noisy HTML
  • block scrapers
  • have terrible markdown conversions
  • or require building a whole crawling pipeline just to ingest docs

So I ended up building an API for this, used by a few hundred companies in production today.

You can:

  • scrape any page as clean markdown
  • crawl an entire website
  • pull sitemaps
  • extract images/html
  • basically turn a website into LLM-ready context in one call

One thing I focused on heavily was making the markdown actually usable for RAG instead of just dumping raw DOM content.

Curious what everyone else here is using for live web ingestion / crawling in production right now.

API is here if anyone wants to try it.

Would genuinely love feedback from people building agent/RAG systems.

PS: Read the subreddit rules, seems this is allowed at-least once since I've never posted here and usually just lurk :)

u/mynameisyahiabakour — 8 days ago