u/Shot-Neighborhood332

Built a Fetch API that returns page labels, not just markdown
▲ 2 r/Rag

Built a Fetch API that returns page labels, not just markdown

I'm working on a Fetch API for RAG, agents, and web ingestion workflows.

Think Firecrawl/Jina Reader-style URL-to-markdown or clean-text API, but with one extra signal layer: page labels for content category and page structure.

The pain point: fetching is only the first step. You still need to decide whether a page is useful, relevant, and worth sending into indexing, embedding, or an LLM pipeline.

Examples of labels we return:

  • dead link / main content missing → skip low-value pages early
  • homepage / index page vs content page → avoid mixing navigation/listing pages with real content
  • content category → keep vertical pipelines from indexing out-of-scope pages, e.g. a finance workflow pulling in random entertainment/forum pages

Our category labels cover broad areas like Finance, Health, News, Ecommerce, Education, Jobs, Travel, and more.

A couple of open questions:

  1. If you've already built filtering logic on top of a fetch API — skipping listing pages, filtering by topic, dropping dead links — curious what that looks like in your pipeline. Does moving this upstream actually save work, or just add a layer you'd rather control yourself?

  2. Beyond category and page structure, what other fields or labels would actually be useful in a fetch API response? Author, publish date, sentiment, product pricing, freshness signals...? Curious what's missing from current fetch tools for your pipeline.

Happy to share access if you want to try it. New signups get $5 credit, around 5k pages.