Crawl budget: how to make Googlebot study your site faster

Crawl budget is the amount of resource Googlebot allocates for reading (crawling) pages on your site. Google defines a time and request limit per site based on authority, server speed and content quality. For small sites crawl budget is not a problem — Google reads everything quickly. But for large sites with 10,000+ pages it becomes a serious strategic question.

What crawl budget is and how it is set

Google calculates crawl budget automatically. Main factors: site authority and quality (more authority — more budget), server speed and reliability (fast responses, few errors — more requests), past crawl results (high quality builds Google's trust), Google's overall resources.

Crawl budget has two parts: crawl rate limit (maximum request speed to avoid overloading the server) and crawl demand (how often Google wants to refresh). Together they form the total budget. You can influence both through technical SEO and content quality.

When crawl budget becomes a problem

For small sites (up to 1,000 pages) it is usually not a problem. Google reads everything in 1-2 weeks. It becomes a problem for: e-commerce (10,000+ products, filtered views), large blog/news sites (new pages daily), forums/UGC (millions of user pages), large enterprise sites (catalogue, sections, multi-language).

In these cases Googlebot can't read everything and indexing of new or important pages slows. A new product may appear in Google after 1-2 months; a blog post may wait weeks.

What wastes crawl budget

First — low-quality or unnecessary pages. Old blog posts that aren't refreshed and bring no traffic, test pages, archive pages — they eat Googlebot's time.

Second — duplicate content and URL parameters. site.com/products?color=red&sort=price — each combination is a separate URL and Googlebot reads them all. Content is almost the same.

Third — site structure. If important pages are deep (5+ clicks from home), Googlebot reaches them late. Faceted navigation is a common issue.

First step of optimisation

Start with an audit. In GSC \"Pages\" view \"Excluded\". \"Discovered - currently not indexed\" or \"Crawled - currently not indexed\" — pages eating Googlebot time without results. Improve their quality or remove them.

Server log analysis. Screaming Frog Log Analyzer, Botify, JetOctopus show when and which pages Googlebot reads. You see where time is going.

Managing via robots.txt and noindex

Robots.txt \"Disallow\" for pages Googlebot shouldn't read at all — user profile, cart, search results, login. Google doesn't waste time.

Noindex meta — Googlebot reads but doesn't index. Use it for old pages. Note: noindex doesn't save crawl budget; robots.txt does.

XML sitemap strategy

The sitemap is a list of important pages for Google. Include only indexable, quality, current pages. Remove old or weak ones.

Segmenting helps: products-sitemap.xml, blog-sitemap.xml, categories-sitemap.xml. Google gets a clear structure; you get per-segment analytics.

Internal links and structure

Internal links are Googlebot's signposts. If an important page is 1-2 clicks from home with many links to it, Google rates it important and visits often.

Strategy: pillar/cluster, breadcrumbs, related posts, links from categories to topical content. It is both UX and efficient crawl budget use.

Sayt.uz experience

Sayt.uz currently has ~200 pages, so crawl budget isn't a problem. But we are planning 500+ blog posts and 1000+ offer pages. Then the strategy will matter.

Right now: /admin, /cabinet, /api, /tmp are closed in robots.txt. The XML sitemap is dynamic, only active quality pages. Internal links use pillar/cluster. This is the foundation — at 1000+ pages the problem won't arise.