Common Search Engine Protocol-SEO
1. Sitemaps
The Best Webpage on the Internet
The Best Webpage on the Internet
Think of a sitemap as a list of files that give hints to the search engines on how they can crawl your website. Sitemaps help search engines find and classify content on your site that they may not have found on their own. Sitemaps also come in a variety of formats and can highlight many different types of content, including video, images, news and mobile.
You can read the full details of the protocols at Sitemaps.org. In addition, you can build your own sitemaps at XML-Sitemaps.com. Sitemaps come in three varieties:
2. XML
Extensible Markup Language (Recommended Format)
This is the most widely accepted format for sitemaps. It is extremely easy for search engines to parse and can be produced by a plethora of sitemap generators. Additionally, it allows for the most granular control of page parameters.
Relatively large file sizes. Since XML requires an open tag and a close tag around each element, file sizes can get very large.
3. Robots.txt
The robots.txt file, a product of the Robots Exclusion Protocol, is a file stored on a website's root directory (e.g., www.google.com/robots.txt). The robots.txt file gives instructions to automated web crawlers visiting your site, including search spiders.
By using robots.txt, webmasters can indicate to search engines which areas of a site they would like to disallow bots from crawling as well as indicate the locations of sitemap files and crawl-delay parameters.
You can read more details about this at the robots.txt Knowledge Center page.
The following commands are available:
Disallow
Prevents compliant robots from accessing specific pages or folders.
Sitemap
Indicates the location of a website’s sitemap or sitemaps.
Crawl Delay
Indicates the speed (in milliseconds) at which a robot can crawl a server.
Warning: Not all web robots follow robots.txt. People with bad intentions (i.e. e-mail address scrapers) build bots that don’t follow this protocol and in extreme cases can use it to identify the location of private information. For this reason, it is recommended that the location of administration sections and other private sections of publicly accessible websites not be included in the robots.txt. Instead, these pages can utilize the meta robots tag (discussed next) to keep the major search engines from indexing their high risk
content.
4. Meta Robots
The meta robots tag creates page-level instructions for search engine bots.
The meta robots tag should be included in the head section of the HTML document.
An Example of Meta Robots
Hello World
In the example above, “NOINDEX, NOFOLLOW” tells robots not to include the given page in their indexes, and also not to follow any of the links on the page
5. Rel="Nofollow"
Remember how links act as votes? The rel=nofollow attribute allows you to link to a resource, while removing your "vote" for search engine purposes. Literally, "nofollow" tells search engines not to follow the link, but some engines still follow them for discovering new pages. These links certainly pass less value (and in most cases no juice) than their followed counterparts, but are useful in various situations where you link to an untrusted source.
An Example of nofollow
In the example above, the value of the link would not be passed to example.com as the rel=nofollow attribute has been added.
6. Rel="canonical"
Often, two or more copies of the exact same content appear on your website under different URLs. For example, the following URLs can all refer to a single homepage:
- http://www.example.com/
- http://www.example.com/default.asp
- http://example.com/
- http://example.com/default.asp
- http://Example.com/Default.asp
To search engines, these appear as 5 separate pages. Because the content is identical on each page, this can cause the search engines to devalue the content and its potential rankings.
The canonical tag solves this problem by telling search robots which page is the singular "authoritative" version which should count in web results.
An Example of rel="canonical" for the URL
http://example.com/default.asp
Hello World
In the example above, rel=canonical tells robots that this page is a copy of http://www.example.com, and should consider the latter URL as the canonical.