Meta robots tags and the robots.txt file are essential tools that give you control over how search engines crawl and index your website. Whether you need to block private pages, prevent duplicate content, or optimize crawl budget, understanding how these directives work will help you manage SEO visibility effectively.
What Is a Meta Robots Tag?
The meta robots tag is an HTML element placed in the <head> section of a page. It tells search engines whether and how they should index and follow links on that page.
Example:
<meta name="robots" content="noindex, nofollow" />
This line instructs all search engines not to index the page and not to follow any links on it.
Common Values for the content Attribute
Here are the most frequently used values and what they mean:
|
Value |
Description |
|---|---|
|
index |
Allow indexing of the page (default). |
|
noindex |
Prevent indexing — the page will not appear in search results. |
|
follow |
Allow search engines to follow links on the page (default). |
|
nofollow |
Prevent search engines from following any links on the page. |
|
noarchive |
Prevent displaying a cached version of the page in search results. |
|
nosnippet |
Disable text snippets and previews in search results. |
|
noimageindex |
Prevent images on the page from appearing in Google Images. |
|
notranslate |
Block translation links in search results. |
|
max-snippet:-1 |
No limit on snippet length (you can also set a specific number of characters). |
|
max-image-preview:large |
Allow large image previews in search results. |
|
max-video-preview:-1 |
Allow video previews without limitation. |
You can combine multiple directives:
<meta name="robots" content="noindex, follow, noarchive" />
➡️ This means:
Do not index the page, but do follow the links and do not store a cached copy.
Targeting Specific Search Engines
You can specify which crawler should follow your rules. For example, only block Googlebot:
<meta name="googlebot" content="noindex, nofollow" />
Or only apply rules for Bing:
<meta name="bingbot" content="noindex" />
If you use name="robots", it applies to all crawlers by default.
Example: Prevent Indexing of a Thank You Page
A classic case — preventing a post-form submission page (like a “Thank you” page) from appearing in search results.
<head>
<meta name="robots" content="noindex, nofollow" />
<title>Thank you for contacting us!</title>
</head>
✅ Google and other search engines will not index this page and will not follow its links.
The Role of robots.txt
The robots.txt file is a text file placed in your site’s root directory (e.g. https://example.com/robots.txt). It defines which areas of your website crawlers are allowed or disallowed to visit.
Basic Example:
User-agent: *
Disallow: /admin/
Allow: /
Explanation:
- User-agent: * → applies to all crawlers
- Disallow: /admin/ → prevents crawlers from accessing /admin/ directory
- Allow: / → allows crawling of the rest of the site
Blocking a Single Page
User-agent: *
Disallow: /private-page.html
Now https://example.com/private-page.html will not be crawled.
Allowing Specific Crawlers
You can give different instructions for different bots:
User-agent: Googlebot
Disallow: /test/
User-agent: Bingbot
Allow: /
Googlebot will not crawl /test/, while Bingbot can crawl everything.
Important Difference Between robots.txt and Meta Robots
|
Aspect |
robots.txt |
Meta Robots Tag |
|---|---|---|
|
Location |
Root directory (/robots.txt) |
Inside the HTML <head> |
|
Affects |
Crawling (access to pages) |
Indexing (appearance in results) |
|
Use Case |
Restrict crawling large sections or files |
Control indexing of specific pages |
|
Visibility |
Public |
Page-specific |
|
Can prevent indexing? |
❌ Not always (blocked URLs can still be indexed by URL reference) |
✅ Yes, reliable for removing pages from search results |
Best Practice:
If you need to prevent indexing, use the meta robots tag (noindex) — not just robots.txt. If you need to prevent crawling, use robots.txt.
Advanced Tip: Combining Both Approaches
You can combine them for stronger control.
robots.txt
-----------
User-agent: *
Disallow: /private/
HTML of /private/page.html
--------------------------
<meta name="robots" content="noindex, nofollow" />
This ensures:
- Crawlers won’t visit /private/
- Even if they do, the page won’t be indexed
Common Mistakes to Avoid
- Blocking via robots.txt instead of using noindex. ➜ Crawlers can’t read meta tags if they can’t access the page.
- Forgetting to remove “noindex” after launch. ➜ Common when staging or testing.
- Wrong placement of meta tag. ➜ Must be inside <head>, not <body>.
- Assuming all crawlers obey robots.txt. ➜ Some bots ignore it entirely (especially scrapers).
Best Practices Summary
- Use robots.txt for large-scale crawl management.
- Use meta robots tags for fine-grained indexing control.
- Combine both only when necessary.
- Regularly test your configuration using Google Search Console’s URL Inspection Tool.
- Always verify that important pages are indexable and unimportant ones are hidden.
Related Topics (for future internal linking)
- How to Optimize Your Crawl Budget in 2025
- Canonical Tags Explained: Prevent Duplicate Content
- How to Set Up a Custom 404 Page in HubSpot