Robots.txt Guide


The file robots.txt is all about SEO! In fact, what you make of the robots.txt file will be a “signpost” for search engines later on. This is the most important way of telling search engines where they can and cannot go on your website. Even though this may sound like an easy thing to do, you have to be careful throughout the creation because making mistakes in robots.txt can be harmful to your website.

What Is the Robots.txt File?

The robots.txt, also called Robots Exclusion Protocol is a file which follows a strict syntax, read by search engine spiders. The spiders which check the syntax are also called robots and that is where robots.txt gets its name from.

The first file that a search engine opens in order to get guidelines on how to index the website is robots.txt. This file tells the search engine spiders which pages to index and what links to follow when reading the site. That’s the reason why the robots.txt file should be put at the root of your domain and you shouldn’t name the file anything other than robots.txt. So, this in practice will look like this example from Wikipedia: https://wikipedia.org/robots.txt.

How To Block Bots In Robots.txt

If you're using WordPress, you can easily manage your site's Robots.txt with this plugin: https://wordpress.org/plugins/pc-robotstxt/

Each line from the robots.txt file starts with “user-agent” and then blocking or disallowing directives for all or from specific search engines. For instance, if you want to block sites which check your backlinks such as Ahrefs, Majestic, SEMRush, etc then you should use the following code in your robots.txt file:

User-agent: Rogerbot
User-agent: Exabot
User-agent: MJ12bot
User-agent: Dotbot
User-agent: Gigabot
User-agent: AhrefsBot
User-agent: BlackWidow
User-agent: ChinaClaw
User-agent: Custo
User-agent: DISCo
User-agent: Download\ Demon
User-agent: eCatch
User-agent: EirGrabber
User-agent: EmailSiphon
User-agent: EmailWolf
User-agent: Express\ WebPictures
User-agent: ExtractorPro
User-agent: EyeNetIE
User-agent: FlashGet
User-agent: GetRight
User-agent: GetWeb!
User-agent: Go!Zilla
User-agent: Go-Ahead-Got-It
User-agent: GrabNet
User-agent: Grafula
User-agent: HMView
User-agent: HTTrack
User-agent: Image\ Stripper
User-agent: Image\ Sucker
User-agent: Indy\ Library
User-agent: InterGET
User-agent: Internet\ Ninja
User-agent: JetCar
User-agent: JOC\ Web\ Spider
User-agent: larbin
User-agent: LeechFTP
User-agent: Mass\ Downloader
User-agent: MIDown\ tool
User-agent: Mister\ PiX
User-agent: Navroad
User-agent: NearSite
User-agent: NetAnts
User-agent: NetSpider
User-agent: Net\ Vampire
User-agent: NetZIP
User-agent: Octopus
User-agent: Offline\ Explorer
User-agent: Offline\ Navigator
User-agent: PageGrabber
User-agent: Papa\ Foto
User-agent: pavuk
User-agent: pcBrowser
User-agent: RealDownload
User-agent: ReGet
User-agent: SiteSnagger
User-agent: SmartDownload
User-agent: SuperBot
User-agent: SuperHTTP
User-agent: Surfbot
User-agent: tAkeOut
User-agent: Teleport\ Pro
User-agent: VoidEYE
User-agent: Web\ Image\ Collector
User-agent: Web\ Sucker
User-agent: WebAuto
User-agent: WebCopier
User-agent: WebFetch
User-agent: WebGo\ IS
User-agent: WebLeacher
User-agent: WebReaper
User-agent: WebSauger
User-agent: Website\ eXtractor
User-agent: Website\ Quester
User-agent: WebStripper
User-agent: WebWhacker
User-agent: WebZIP
User-agent: Wget
User-agent: Widow
User-agent: WWWOFFLE
User-agent: Xaldon\ WebSpider
User-agent: Zeus
Disallow: /

You can also add this to your .htaccess file:

SetEnvIfNoCase User-Agent .*rogerbot.* bad_bot
SetEnvIfNoCase User-Agent .*exabot.* bad_bot
SetEnvIfNoCase User-Agent .*mj12bot.* bad_bot
SetEnvIfNoCase User-Agent .*dotbot.* bad_bot
SetEnvIfNoCase User-Agent .*gigabot.* bad_bot
SetEnvIfNoCase User-Agent .*ahrefsbot.* bad_bot
SetEnvIfNoCase User-Agent .*sitebot.* bad_bot
<Limit GET POST HEAD>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>

What Are the Pros and Cons of Using Robots.txt?

The pros of using robots.txt are amazing because you can actually block search engines from crawling on some pages or sections on your website. Therefore, if some sections on your website are not search-engine-optimized yet, through robots.txt you can “tell” search engines not to go there. Therefore, the search engines will “read” only those pages which they are allowed to crawl on. That’s why robots.txt plays a crucial role in obtaining good SEO results.

While you can actually tell a search engine where not to go on your site, they will still have the right to show the page they’re not allowed to crawl in search results. Since search engines won’t know what’s on the page which is blocked, the search results will say “blocked directory” or “there isn’t a result which is available for this site”. This may bring you bad reputation amongst your users. So, as you can see, blocking sites can be a disadvantage.

Test Your Robots.txt File

The best way to validate and test your robots.txt file is by using the Google Console testing tool. Make sure that everything in your robots.txt file is corrected and well checked before you run it. This is essential because certain mistakes can block your entire site and that way you will never be able to maintain any kind of SEO result.

Was this answer helpful?

 Print this Article

Related Articles

Web hosting inode limits

Our Web hosting plans have unlimited disk space and bandwidth, however there are certain inode...

Steps To Take When Your Account Is Flagged For Excessive Resource Usage

Have you received a message from our team warning you about ‘excessive resource usage’?Any...

How To Prevent Excessive IMAP Activity On Your Email Account

Over time, IMAP has become one of the most common email protocols used to check an email account....

What Is Hotlink Protection

Hotlink protection ensures that you are protected against bandwidth theft. It prevents other...

Optimize your WordPress Site with WP Super Cache

Optimizing your WordPress website is important for user experience. Slow websites repel users,...