u/0xMassii

Something I noticed after building a lot of LLM pipelines that fetch web content: most people pipe raw HTML directly into the prompt and wonder why the output is noisy or the costs are high.

A typical article page is 4,000 to 6,000 tokens as raw HTML. The actual content, the thing you want the model to reason over, is 1,200 to 1,800 tokens. Everything else is script tags, nav menus, cookie banners, footer links, ad containers. The model reads all of it. It affects output quality and you pay for every token.

I tested this on a set of news and documentation pages. Raw HTML averaged 5,200 tokens. After extraction, the same content averaged 1,590 tokens. That is 67% reduction with no meaningful information loss. On a pipeline running a few thousand fetches per day the difference is significant.

The extraction logic scores each DOM node by text density, semantic tag weight and link ratio. Nodes that look like navigation or boilerplate score low and get stripped. What remains goes out as clean markdown that the model can parse without fighting HTML structure.

There is a secondary issue with web fetching that is less obvious. If you are using requests or any standard HTTP library to fetch pages before putting content into a prompt, a lot of sites block those requests before they are even served. Not because of your IP, but because the TLS fingerprint looks nothing like a browser. Cloudflare and similar systems check the cipher suite order and TLS extensions before reading your request. This means your pipeline silently fetches error pages or redirects, and you end up prompting the model with garbage content. Rotating proxies does not fix this because the fingerprint is client-side.

I built a tool to handle both of these problems, it does browser-level TLS fingerprinting without launching a browser and outputs clean markdown optimised for LLM context. I am the author so disclosing that. It is open source, AGPL-3.0 license, runs locally as a CLI or REST API: github.com/0xMassi/webclaw

Posting here because the token efficiency side feels directly relevant to prompt work, especially for RAG pipelines and agent loops where web content is part of the context.

Curious if others have run into the noisy HTML problem and how you handled it. Are you pre-processing web content before it hits the prompt, or passing raw content and relying on the model to filter?

Raw HTML in your prompts is probably costing you 3x in tokens and hurting output quality