robots.txt Complete Guide: Syntax, Directives, and Examples for Site Owners

robots.txt is a plain text file located at the root of your domain that instructs search engine crawlers about which sections of the site they are allowed to access and which they should leave alone. The standard was proposed by Martijn Koster back in 1994 and has been honored by virtually every major search engine ever since, even though it was only formally codified as RFC 9309 in 2022. While having a robots.txt file is not strictly required for indexing, a well-crafted one is an effective tool for managing crawl budget and keeping low-value pages out of the search index.

File location and technical requirements

The file must always sit at the root of the domain and be reachable at example.com/robots.txt, otherwise crawlers will simply ignore it and proceed to scan the entire site without any restrictions. It should be encoded in UTF-8 and must not exceed 500 kilobytes in size, since Googlebot stops reading after that threshold and any rules placed below it become invisible. Each subdomain requires its own robots.txt file, meaning rules written for example.com do not apply to shop.example.com.

Core directives and syntax rules

The User-agent directive specifies which crawler the following rules apply to. An asterisk targets all bots, while a specific name such as Googlebot or Bingbot scopes the rules to that particular agent. Disallow blocks access to the specified path, while Allow acts as an exception that re-opens a subpath inside an otherwise blocked directory. The Sitemap directive provides the full URL of your XML sitemap. Crawl-delay defines the minimum number of seconds between requests, but Google stopped honoring it back in 2019, while Bing and Yandex still respect the directive.

robots.txt versus meta robots versus X-Robots-Tag

Confusing these three instruments is the most widespread conceptual mistake among site owners. A Disallow rule in robots.txt prevents the crawler from fetching the page at all, yet the URL can still appear in search results if other sites link to it. The meta robots tag and the X-Robots-Tag HTTP header operate at the page level, and a noindex directive guarantees exclusion from the index. Therefore, if your goal is to keep a page out of search results, the correct approach is to use noindex rather than Disallow.

Common mistakes and consequences

The most dangerous error is leaving a stray Disallow: / line in production after migrating from a staging environment, which deindexes the entire site. Equally common is blocking directories that contain CSS or JavaScript files, which prevents Google from rendering the page properly. Many owners try to hide admin panels via robots.txt, not realizing that the file itself is publicly readable.

WordPress and e-commerce configurations

For a typical WordPress installation, it is customary to block the /wp-admin/ directory while keeping admin-ajax.php open. The /wp-includes/ and /wp-content/plugins/ directories are usually closed off, but /wp-content/uploads/ should remain open because that is where images live. The biggest pain point in e-commerce SEO is the explosion of duplicate URLs caused by filters and sorting parameters. The solution is to block parametric URLs with rules like Disallow: /*?filter=, and to close off the cart, checkout, and account areas.

Validation through Google Search Console

After any change, the file should be tested using the robots.txt Tester inside Search Console. As of 2026 Google has introduced several new user agents, including Google-Extended for training generative AI models. If you do not want your content used for AI training, you must explicitly add a block with User-agent: Google-Extended and Disallow: / to your robots.txt.