Control Search Engine Indexing with Meta Robots Tags and robots.txt

Written by Niko Yankovsky | October 15, 2025

Meta robots tags and the robots.txt file are essential tools that give you control over how search engines crawl and index your website. Whether you need to block private pages, prevent duplicate content, or optimize crawl budget, understanding how these directives work will help you manage SEO visibility effectively.

What Is a Meta Robots Tag?

The meta robots tag is an HTML element placed in the <head> section of a page. It tells search engines whether and how they should index and follow links on that page.

Example:

This line instructs all search engines not to index the page and not to follow any links on it.

Common Values for the content Attribute

Here are the most frequently used values and what they mean:

Value	Description
index	Allow indexing of the page (default).
noindex	Prevent indexing — the page will not appear in search results.
follow	Allow search engines to follow links on the page (default).
nofollow	Prevent search engines from following any links on the page.
noarchive	Prevent displaying a cached version of the page in search results.
nosnippet	Disable text snippets and previews in search results.
noimageindex	Prevent images on the page from appearing in Google Images.
notranslate	Block translation links in search results.
max-snippet:-1	No limit on snippet length (you can also set a specific number of characters).
max-image-preview:large	Allow large image previews in search results.
max-video-preview:-1	Allow video previews without limitation.

You can combine multiple directives:

➡️ This means:
Do not index the page, but do follow the links and do not store a cached copy.

Targeting Specific Search Engines

You can specify which crawler should follow your rules. For example, only block Googlebot:

Or only apply rules for Bing:

If you use name="robots", it applies to all crawlers by default.

Example: Prevent Indexing of a Thank You Page

A classic case — preventing a post-form submission page (like a “Thank you” page) from appearing in search results.

✅ Google and other search engines will not index this page and will not follow its links.

The Role of robots.txt

The robots.txt file is a text file placed in your site’s root directory (e.g. https://example.com/robots.txt). It defines which areas of your website crawlers are allowed or disallowed to visit.

Basic Example:

Explanation:

User-agent: * → applies to all crawlers
Disallow: /admin/ → prevents crawlers from accessing /admin/ directory
Allow: / → allows crawling of the rest of the site

Blocking a Single Page

Now https://example.com/private-page.html will not be crawled.

Allowing Specific Crawlers

You can give different instructions for different bots:

Googlebot will not crawl /test/, while Bingbot can crawl everything.

Important Difference Between robots.txt and Meta Robots

Aspect	robots.txt	Meta Robots Tag
Location	Root directory (/robots.txt)	Inside the HTML <head>
Affects	Crawling (access to pages)	Indexing (appearance in results)
Use Case	Restrict crawling large sections or files	Control indexing of specific pages
Visibility	Public	Page-specific
Can prevent indexing?	❌ Not always (blocked URLs can still be indexed by URL reference)	✅ Yes, reliable for removing pages from search results

Best Practice:

If you need to prevent indexing, use the meta robots tag (noindex) — not just robots.txt. If you need to prevent crawling, use robots.txt.

Advanced Tip: Combining Both Approaches

You can combine them for stronger control.

This ensures:

Crawlers won’t visit /private/
Even if they do, the page won’t be indexed

Common Mistakes to Avoid

Blocking via robots.txt instead of using noindex. ➜ Crawlers can’t read meta tags if they can’t access the page.
Forgetting to remove “noindex” after launch. ➜ Common when staging or testing.
Wrong placement of meta tag. ➜ Must be inside <head>, not <body>.
Assuming all crawlers obey robots.txt. ➜ Some bots ignore it entirely (especially scrapers).

Best Practices Summary

Use robots.txt for large-scale crawl management.
Use meta robots tags for fine-grained indexing control.
Combine both only when necessary.
Regularly test your configuration using Google Search Console’s URL Inspection Tool.
Always verify that important pages are indexable and unimportant ones are hidden.

View full post