How to Feed Web Data to an LLM Without Wasting 90% of Your Context Window

The most common way developers add web knowledge to an LLM is also the worst way. They fetch a URL, grab the HTML, and dump it into the prompt as context. It works, technically. But you end up paying your model to read thousands of tokens of navigation menus, cookie consent banners, social sharing buttons, footer links, and inline ad scripts before it ever reaches the paragraph you actually cared about.

We built Neureil because we hit this problem ourselves. Here is everything we learned about why raw HTML is so bad for LLMs and what you should do instead.

What raw HTML actually looks like to a language model

Take a standard news article. The visible content, the actual story, might be 600 words. That is roughly 800 tokens. But the full HTML of a modern news page is something else entirely.

We tested a BBC article in April 2026. The full HTML came out at 11,200 tokens. Here is a rough breakdown of where those tokens went.

Navigation and mega-menu links — approximately 1,400 tokens
Inline JavaScript — approximately 2,800 tokens
Inline CSS and style attributes — approximately 1,600 tokens
Cookie notice, GDPR banners, popups — approximately 900 tokens
Related stories, recommended links, footer content — approximately 2,100 tokens
Actual article content — approximately 800 tokens

So roughly 7% of the tokens you are paying for are the content you need. The other 93% is scaffolding the model has to read through first.

On GPT-4o, input tokens cost $2.50 per million. If you run 1,000 web page lookups per day with raw HTML, you are spending around $28 daily just on tokens that provide zero signal to your model. That is $840 a month in pure waste.

It is not just the cost

The token waste is the obvious problem, but there is a subtler issue. Language models have a finite context window. When you fill a large portion of it with navigation and scripts, you leave less room for what actually matters.

For a simple summarisation task this is annoying but manageable. For an agent that needs to reference multiple web pages in a single reasoning step, it becomes a hard blocker. You run out of context before you can give the model enough information to do the job.

There is also a quality problem. LLMs are not immune to noise. When you stuff a prompt with irrelevant content, the model's attention is genuinely split. We have seen cases where a model summarises the cookie consent notice instead of the article because the cookie text appeared first and had clear sentence structure. It sounds absurd but it happens.

The naive fix and why it is not enough

The first thing most developers try is stripping the HTML tags and passing plain text. This is better but still not good. Tag-stripped HTML still contains all the text content of nav menus, footers, and sidebars. You just removed the markup without removing the noise.

naive approach — still noisyconst html = await fetch(url).then(r => r.text())
const text = html.replace(/<[^>]*>/g, '') // removes tags, not noise
// still contains nav text, footer text, sidebar links, ad copy

The right solution is content extraction, not tag stripping. You need something that understands which parts of a page represent the main article or primary content versus the surrounding chrome. Mozilla's Readability library does this well for articles. For product pages, documentation, and structured content you need additional logic.

What a proper extraction pipeline looks like

A production-ready pipeline for feeding web data to an LLM needs to handle a few things well.

Content identification. It needs to find the main body of the page, not just strip noise. This means understanding document structure, identifying article containers, and ignoring peripheral content like related posts sections and sidebar widgets.

JavaScript rendering. A huge portion of the web today renders its content client-side. If you just fetch HTML, sites built with React, Vue, or Next.js return near-empty pages. You need a rendering step for those.

Structured output. Plain text is better than raw HTML but an LLM can use your data more efficiently when it is structured. Title, author, published date, main content as separate fields. This also makes it easier to pass only what the model needs for a specific task.

Caching. If your agent or pipeline might request the same URL more than once, caching saves both time and money. Most web content does not change within a 24-hour window.

Using an API versus building it yourself

You can build this yourself. The core is not complicated. But there are a lot of edge cases that will consume your time. Sites that serve bot-detection challenges. Pages with lazy-loaded content. Sites that actively block headless browsers. Paywalled content. Malformed HTML that breaks parsers.

We spent six weeks ironing out those edge cases before we felt comfortable making Neureil public. If your core product is the AI logic rather than the web fetching, it probably makes sense to use an API for this layer.

with neureil — one call, clean outputconst response = await fetch('https://api.neureil.com/extract', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer YOUR_API_KEY`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({ url: 'https://example.com/article' })
})
 
const { title, content, author, published } = await response.json()
 
// content is now clean, structured, and LLM-ready
// average 92% fewer tokens than raw html

The actual numbers

Raw HTML avg

8,420

tokens per page

After extraction

674

tokens per page

These numbers come from our benchmark across 200 URLs covering news sites, developer documentation, e-commerce product pages, and blog posts. The 92% figure is an average. Documentation pages tend to be cleaner to begin with and see smaller reductions. News and e-commerce pages with heavy advertising often see reductions above 95%.

One practical thing you can do right now

If you are feeding web pages to an LLM today and you have not measured your token counts, do that first. Pull up a few of the URLs your agent visits most often and count the tokens in the raw HTML versus the clean extracted text. The number is almost always shocking the first time you see it.

Once you see the waste, the fix is straightforward. The extraction logic is not complex, and whether you build it or use an API, the payoff in both cost and quality is immediate.

Try Neureil free

500 free extractions, no card required. See the token difference on your own URLs in under a minute.

Get started for free

Back to all posts