Why Your Site Needs a Robots.txt File
The robots.txt file is a file that sits at the root level of your website and asks spiders and bots to behave themselves when they’re on your site. You can take a look at it by pointing your browser to http://www.yourDrupalsite.com/robots.txt. Think of it like an electronic No Trespassing sign that can easily tell the search engines not to crawl a certain directory or page of your site. Using wildcards, you can even tell the engines not to crawl file types like .jpg and .pdf. This means none of your JPEG images or PDF files will show up in the search engines. I’m not recommending that you do that, just that you could.
Having a search engine-optimized robots.txt file is very important for SEO. It clearly established rules and expectations for Google’s spiders, which directly affects what web pages rank and don’t rank.
The robots.txt file is required
On December 1, 2008, John Mueller, a Google analyst, said that if the Googlebot can’t access the robots.txt file (say the server is unreachable or returns a 5xx error result code) then it won’t crawl the website at all. In other words, the robots.txt file must be there if you want the web site to be crawled and indexed by Google.
Drupal 6 Robots.txt
Drupal 6 provides a standard robots.txt file that does an adequate job. This file carries instructions for robots and spiders that may crawl your site. However, what if you want to make changes to that file? Here is what you do...
Editing your robots.txt file
There may be a few times throughout your website’s SEO campaign that you’ll need to make changes to your robots.txt file. This section provides the necessary steps to make each change.
1. Check to see if your robots.txt file is there and available to visiting search bots. Open your browser and visit the following link: http://www.yourDrupalsite.com/robots.txt.
2. Using your FTP program or command line editor, navigate to the top level of your Drupal website and locate the robots.txt file.
3. Make a backup of the file.
4. Open the robots.txt file for editing. If necessary, download the file and open it in a local text editor tool.
5. Most directives in the robots.txt file are based the on line user-agent :. If you are going to give different instructions to different engines, be sure to place them above the User-agent: *, as some search engines will only read the directives for * if you place their specific instructions following that section.
6. Add the lines you want.
7. Save your robots.txt file, uploading it if necessary, replacing the existing file.
Point your browser to http://www.yourDrupalsite.com/robots.txt and double-check that your changes are in effect. You may need to do a refreshing on your browser to see the changes.
Problems with the default Drupal robots.txt file
There are several problems with the default Drupal robots.txt file. If you use Google Webmaster Tool's robots.txt testing utility to test each line of the file, you'll find that a lot of paths which look like they're being blocked will actually be crawled. The reason is that Drupal does not require the trailing slash (/) after the path to show you the content. Because of the way robots.txt files are parsed, Googlebot will avoid the page with the slash but crawl the page without the slash.
For example, /admin/ is listed as disallowed. As you would expect, the testing utility shows that http://www.yourDrupalsite.com/admin/ is disallowed. But, put in http://www.yourDrupalsite.com/admin (without the trailing slash) and you'll see that it is allowed. Disaster! Fortunately, this is relatively easy to fix.
Fixing the Drupal robots.txt file
Carry out the following steps in order to fix the Drupal robots.txt file:
1. Make a backup of the robots.txt file.
2. Open the robots.txt file for editing. If necessary, download the file and open it in a local text editor.
3. Find the Paths (clean URLs) section and the Paths (no clean URLs) section. Note that both sections appear whether you've turned on clean URLs or not. Drupal covers you either way. They look like this:
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/ # Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
4. Duplicate the two sections (simply copy and paste them) so that you have four sections—two of the # Paths (clean URLs) sections and two of # Paths (no clean URLs) sections.
5. Add 'fixed!' to the comment of the new sections so that you can tell them apart.
6. Delete the trailing / after each Disallow line in the fixed! sections. You should end up with four sections that look like this:
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
# Paths (clean URLs) – fixed!
Disallow: /admin
Disallow: /comment/reply
Disallow: /contact
Disallow: /logout
Disallow: /node/add
Disallow: /search
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login
# Paths (no clean URLs) – fixed!
Disallow: /?q=admin
Disallow: /?q=comment/reply
Disallow: /?q=contact
Disallow: /?q=logout
Disallow: /?q=node/add
Disallow: /?q=search
Disallow: /?q=user/password
Disallow: /?q=user/register
Disallow: /?q=user/login
7. Save your robots.txt file, uploading it if necessary, replacing the existing file (you backed it up, didn't you?).
8. Go to http://www.yourDrupalsite.com/robots.txt and double-check that your changes are in effect. You may need to do a refresh on your browser to see the changes.
Now your robots.txt file is working as you would expect it to.
Additional changes to the robots.txt file
Using directives and pattern matching commands, the robots.txt file can exclude entire sections of the site from the crawlers like the admin pages, certain individual files like cron.php, and some directories like /scripts and /modules.
In many cases, though, you should tweak your robots.txt file for optimal SEO results. Here are several changes you can make to the file to meet your needs in certain situations:
• You are developing a new site and you don't want it to show up in any search engine until you're ready to launch it. Add Disallow: * just after the User-agent:
• Say you're running a very slow server and you don't want the crawlers to slow your site down for other users. Adjust the Crawl-delay by changing it from 10 to 20.
• If you're on a super-fast server (and you should be, right?) you can tell the bots to bring it on! Change the Crawl-delay to 5 or even 1 second. Monitor your server closely for a few days to make sure it can handle the extra load.
• Say you're running a site which allows people to upload their own images but you don't necessarily want those images to show up in Google. Add these lines at the bottom of your robots.txt file:
User-agent: Googlebot-Image
Disallow: /*.jpg$
Disallow: /*.gif$
Disallow: /*.png$
If all of the files were in the /files/users/images/ directory, you could do this:
User-agent: Googlebot-Image
Disallow: /files/users/images/
• Say you noticed in your server logs that there was a bad robot out there that was scraping all your content. You can try to prevent this by adding this to the bottom of your robots.txt file:
User-agent: Bad-Robot Disallow: *
• If you have installed the XML Sitemap module, then you've got a great tool that you should send out to all of the search engines. However, it's tedious to go to each engine's site and upload your URL. Instead, you can add a couple of simple lines to the robots.txt file. For more information, scroll down to the section called Adding your XML Sitemap to the robots.txt file, later in this post.
Robots.txt is a request, not a command
Do not expect that just because you put it in the robots.txt file that it will be strictly obeyed. Rogue spiders and bots often ignore your requests. This is highly unlikely from the major search engines, but it can, and does, happen. With this in mind, if you really want to obscure sensitive documents from the rest of the world, put it behind a password-protected section of your site.
Adding your XML Sitemap to the robots.txt file
Another way that that the robots.txt file helps you search engine optimize your Drupal site is by allowing you to specify where your sitemaps are located. While you probably want to submit your sitemap directly to Google, Yahoo!, and MSN, it's a good idea to put a reference to it in the robots.txt file for all of those other search engines. You can do this by carrying out the following steps:
1. Open the robots.txt file for editing.
2. The sitemap directive is independent of the User-agent line, so it doesn't
matter where you place it in your robots.txt file.
• To keep things neat, add this line first:
# Sitemaps
Add these lines for your XML sitemap:
Sitemap: http://www.yourDrupalsite.com/sitemap.xml
Sitemap: http://www.yourDrupalsite.com/?q=sitemap.xml
If you're using the URL list sitemap instead, add these lines:
Sitemap: http://www.yourDrupalsite.com/urllist.txt
Sitemap: http://www.yourDrupalsite.com/?q=urllist.txt
3. Save your robots.txt file, uploading it if necessary, replacing the existing file (you backed it up, didn't you?).
Go to http://www.yourDrupalsite.com/robots.txt and double-check that your changes are in effect. You may need to perform a refresh on your browser to see the changes.
Please Note: If you have an XML sitemap, use it. If not, use the URL list sitemap. However, do not add both, an XML sitemap and a URL list sitemap, to the robots.txt file. It could confuse the search engines; possibly even causing duplicate content on your site. Also, do not add your visitor-facing sitemap to your robots.txt file.