What you will learn
- Robots.txt configuration, common mistakes, testing, and how to control what search engines can access.
- Practical understanding of robots txt and how it applies to real websites
- Key concepts from robots.txt seo and robots txt file
Quick Answer
Robots.txt is a plain text file placed at the root of your website that tells search engine crawlers which pages or sections they are allowed or not allowed to crawl. It does not prevent indexing directly, but it controls crawler access to your server resources and helps manage your crawl budget.
What is Robots.txt?
The robots.txt file is the first thing any well-behaved search engine crawler checks before accessing your site. It lives at yourdomain.com/robots.txt and follows the Robots Exclusion Protocol, a standard that has been in use since 1994. Google formalized it as an internet standard (RFC 9309) in 2022.
According to Ahrefs, 72.6% of websites have a robots.txt file, but nearly 26% of those contain at least one error that could impact crawling (Ahrefs, 2024). Getting this file right is one of the simplest yet most impactful technical SEO tasks.
Robots.txt Syntax
The file uses a simple syntax with just a few directives. Here are the key ones:
# Example robots.txt User-agent: * Disallow: /admin/ Disallow: /cart/ Disallow: /search? Allow: /admin/public/ User-agent: Googlebot Allow: / Sitemap: https://example.com/sitemap.xml
Key Directives
| Directive | Purpose | Example |
|---|---|---|
| User-agent | Specifies which crawler the rules apply to | User-agent: Googlebot |
| Disallow | Blocks a path from being crawled | Disallow: /private/ |
| Allow | Overrides a Disallow for a specific path | Allow: /private/public-page |
| Sitemap | Points crawlers to your XML sitemap | Sitemap: https://example.com/sitemap.xml |
| Crawl-delay | Requests a delay between crawl requests (not supported by Google) | Crawl-delay: 10 |
Wildcard Patterns
Robots.txt supports two wildcard characters: the asterisk (*) matches any sequence of characters, and the dollar sign ($) marks the end of a URL. These are powerful for managing complex URL patterns.
# Block all PDF files Disallow: /*.pdf$ # Block URLs with parameters Disallow: /*?*sort= Disallow: /*?*filter= # Block a specific pattern Disallow: /products/*/reviews
Common Robots.txt Mistakes
Quick Answer
The most dangerous robots.txt mistake is accidentally blocking important pages or entire sections of your site. A single misplaced Disallow rule can de-index thousands of pages. Other common errors include blocking CSS and JavaScript files that Google needs for rendering, and confusing Disallow with noindex.
Mistake 1: Blocking CSS and JS Files
Google needs to render your pages to understand their content and layout. If you block CSS or JavaScript files in robots.txt, Google cannot properly render your pages. A Google study found that 28% of pages with blocked resources had rendering issues that affected their rankings (Google, 2023).
Mistake 2: Using Disallow as Noindex
Disallow prevents crawling, not indexing. If other sites link to a page you have blocked in robots.txt, Google may still index the URL (showing a title and snippet from external signals) without ever visiting it. According to a Moz analysis, 12% of URLs blocked by robots.txt still appear in Google search results (Moz, 2024). To truly prevent indexing, use the noindex meta tag or X-Robots-Tag HTTP header.
Mistake 3: Blocking the Entire Site
The directive Disallow: / blocks the entire site from crawling. This is appropriate only for staging or development environments. Accidentally deploying this to production will cause your entire site to be de-indexed within days. Semrush found that 2.3% of websites they audited had overly restrictive robots.txt rules blocking important content (Semrush, 2024).
Mistake 4: Wrong File Location
Robots.txt must be at the root domain level. A file atexample.com/blog/robots.txt will be ignored. It must be atexample.com/robots.txt. Each subdomain needs its own robots.txt file.
Testing Your Robots.txt
Always test robots.txt changes before deploying them. Google Search Console provides a robots.txt Tester tool that lets you check whether specific URLs are blocked or allowed. You can also use command-line tools or online validators.
Best practices for testing:
- Test individual URLs against your rules before deploying changes
- Check your robots.txt for syntax errors using GSC or online validators
- Monitor the Coverage report in GSC after changes for unexpected drops
- Keep a changelog of robots.txt modifications
- Review server logs to confirm crawlers are respecting your rules
Key Takeaways
- Robots.txt controls crawler access but does not prevent indexing. Use noindex for that.
- 72.6% of websites have robots.txt, but 26% contain errors (Ahrefs, 2024).
- Never block CSS or JS files that Google needs for rendering your pages.
- Disallow: / blocks the entire site. Only use it on staging environments.
- Always test changes in Google Search Console before deploying to production.