March 30, 2011

Regarding robots.txt from an SEO Perspective

Robots.txt files are one of the ways a SEO company may attempt to prevent certain pages from being indexed. For example, if someone would like to disallow the pages directory they would insert the following into their Robots.txt file.
User-agent: *
Disallow: /pages/
Sitemap: http://www.example.com/sitemap.xml
While this is one of the correct uses of the file, there are some things you should be aware of regarding robots.txt from an SEO perspective.

Quote from Google on Robots.txt and SEOLosing Link Juice

One of the most important, if not the most important, ranking factor in SEO is link popularity. Essentially, when you implement a Disallow through Robots.txt on a section or page of your website you hurt the potential for that portion of your website to transfer link authority to other areas of your site.
So for example, say you disallow your login page, maybe you figure it does not target specific keywords for SEO so there is no reason for having it in the index. I am not saying I agree with this but it could be your train of thought. The weight from external links pointing at that page on your site will not be funneled through text links on the page, the reason being, Google does not crawl the text on the disallowed page. Here we see a quote from Google webmaster central on the subject.
“While  Google won’t crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.”
Here we see that Google will not crawl the content on the page but still may index the URL. We will talk about the URL and offsite content regarding the page being indexed in a moment. But first, let’s quickly wrap up this point on losing link juice. The main point is that when robots.txt is used to block a page on your site external links pointing at the disallowed page will not be able to easily transfer their authority to other areas of your website, as the internal links are blocked by Google through the robots.txt file.

Pages May Still Appear

Now back to our point on pages staying in the index although they have been blocked. According to the Google Code FAQ page on robots.txt, pages may still appear even though they are disallowed in a robots.txt file.
“Blocking Google from crawling a page is likely to decrease that page’s ranking or cause it to drop out altogether over time. It may also reduce the amount of detail provided to users in the text below the search result. This is because without the page’s content, the search engine has much less information to work with.”
Google SEO Quote on Robots Meta Google goes on to further this idea stating.
“However, robots.txt Disallow does not guarantee that a page will not appear in results: Google may still decide, based on external information such as incoming links, that it is relevant. If you wish to explicitly block a page from being indexed, you should instead use the noindex robots meta tag or X-Robots-Tag HTTP header. In this case, you should not disallow the page in robots.txt, because the page must be crawled in order for the tag to be seen and obeyed.”

Using the noindex Meta Header

As Google has stated in the quote above, in some cases, there is a better option for disallowing pages than utilizing the Robots.txt file. That being, using the following Meta Tags:
No Index Meta Tag
<meta content=”noindex”>
X-Robots Meta Tag
X-Robots-Tag: noindex
Lets talk about the NoIndex Meta Tag First
According to Google, “By default, Googlebot will index a page and follow links to it. So there’s no need to tag pages with content values of INDEX or FOLLOW.”
Many SEO companies will recommend the following Meta Tag:
<meta content=”noindex, follow”>
While it won’t hurt you to implement this, it is not needed, as Google will follow links if the following Meta Tag is implemented.
<meta content=”noindex”>
This piece of code tells the search engines two things. First, it says “noindex” or do not include this page in your index. This is often well recognized as it is placed in the actual header of the page the search engine is crawling. This makes it quite clear to the search engines to leave this page out of search results. Secondly, this piece of code tells the search engines that it is OK to follow the links on the page. So even though the page will not be included in the index, external link juice will still pass into the page and be redirected to other pages through the followed links on the page. So if you have a SEO global navigation for example, those pages will be afforded link weight which corresponds to their interlinking relationship with the page.
According to Google, here is a full list of Robots Meta Tag content values.
  • NOINDEX – prevents the page from being included in the index.
  • NOFOLLOW – prevents Googlebot from following any links on the page. (Note that this is different from the link-level NOFOLLOW attribute, which prevents Googlebot from following an individual link.)
  • NOARCHIVE – prevents a cached copy of this page from being available in the search results.
  • NOSNIPPET – prevents a description from appearing below the page in the search results, as well as prevents caching of the page.
  • NOODP – blocks the Open Directory Project description of the page from being used in the description that appears below the page in the search results.
  • NONE – equivalent to “NOINDEX, NOFOLLOW”
As we can see, these content values and their intended purpose are pretty clear. So with this many options, when would you consider the X-Robots-Tag?

Why the X-Robots Meta Tag is Important

According to a Google blog on the subject, these are the uses for the X-Robots Meta Tag:
  • X-Robots-Tag: noarchive, nosnippet – don’t display a cache link or snippet for this item in the Google search results
  • X-Robots-Tag: noindex – don’t include this document in the Google search results:
  • X-Robots-Tag: unavailable_after: 7 Jul 2007 16:30:00 GMT – Tell us that a document will be unavailable after 7th July 2007, 4:30pm GMT
Google mentions that you can combine multiple directives in the same document. Here is an example they provide.
Do not show a cached link for this document, and remove it from the index after 23rd July 2011, 3pm PST:
X-Robots-Tag: noarchive
X-Robots-Tag: unavailable_after: 23 Jul 2011 15:00:00 PST
Perhaps you have already recognized the value in this Meta Tag after reading these key uses. The X-Robots Meta Tag gives you the ability to make a piece of content unavailable after a certain date. In addition, it provides you the ability to noindex, noarchive and nosnippet, just as the aforementioned Robots Meta Tag does.

Important Points

It can be easier to use Robots.txt opposed to inserting “noindex” on every page. It is always important to keep in mind your workload and time constraints when making these types of website management decisions. However, also make sure to consider your long term website goals.
In the case that you are using Robots.txt to block duplicate content keep in mind there are other options, many of which can be positive from an SEO perspective, those being either a 301 redirect or canonical tag. Now, if you are dealing with duplicate content on a subdomain you will want to address this on a case by case basis based on your vision for the site.
When you block affiliate links with Robots.txt you are creating a dead end for link juice. In most cases, it is better to simply use a canonical tag in this situation. This allows you to take advantage of any external links which may be pointing at that section.

Written By: John E Lincoln

0 comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Share

Twitter Delicious Facebook Digg Stumbleupon Favorites More