Difference between revisions of "Robots Are Our Friends"

From Code4Lib
Jump to: navigation, search
(Typical Reasons for not allowing Robots)
 
(15 intermediate revisions by 3 users not shown)
Line 1: Line 1:
For a variety of reasons cultural heritage organizations often have [http://www.robotstxt.org/ robots.txt] documents that restrict what web crawlers (aka robots) can see on a website. This is a bad thing because it means that the content that libraries, archives and museums are putting online becomes virtually invisible to search engines like Google, Bing, Yahoo, is less likely to be shared in social media sites like Facebook, Twitter, Pinterest and stands less of a chance of being incorporated into datasets such as Wikipedia. The Robots Are Our Friends campaign aims to help promote an understanding of the role that robots.txt plays in determining the footprint our cultural heritage collections have on the Web.
+
[[Image:Friendly-Robut.png|right]]
 +
 
 +
For a variety of reasons cultural heritage organizations often have [http://www.robotstxt.org/ robots.txt] documents that restrict what web crawlers (aka robots) can see on a website. This is a bad thing because it means that the content that libraries, archives and museums are putting online becomes virtually invisible to search engines like Google, Bing, Yahoo, is less likely to be shared in social media sites like Facebook, Twitter, Flickr, Pinterest and stands less of a chance of being used in educational sites like Wikipedia. The Robots Are Our Friends campaign aims to help promote an understanding of the role that robots.txt plays in determining the footprint our cultural heritage collections have on the Web.
  
 
== Background ==  
 
== Background ==  
Line 6: Line 8:
  
 
== Typical Reasons for not allowing Robots ==
 
== Typical Reasons for not allowing Robots ==
 +
 +
* While indexing a dynamic site, robots can put an extra strain on the server, causing a slow response, or in some cases, pegging the CPU at 100%.
 +
* Some content is intentionally shielded from search engines to help shape how a websites resources are presented in search results. For example, if an organization has put a lot of PDFs online and doesn't want those to turn up in search results.
  
 
== Throttling ==
 
== Throttling ==
 +
 +
=== crawl-delay ===
 +
 +
Several major search engines support the [https://en.wikipedia.org/wiki/Robots.txt#Crawl-delay_directive crawl delay directive, which you can put in your robots.txt file. This directive lets you tell web crawlers the minimum delay to wait between two successive requests.
 +
 +
<pre>
 +
User-agent: *
 +
Crawl-delay: 3
 +
</pre>
 +
 +
=== Google ===
 +
# Log in to [https://www.google.com/webmasters/tools Google Webmaster Tools]
 +
# Select your site
 +
# In the left-hand side, select '''Configuration''' and then '''Settings'''
 +
# Under '''Crawl rate''' select '''Limit Google's maximum crawl rate'''
 +
# Use the slider to adjust the amount of requests per second/number of seconds between requests
 +
# Hit '''Save'''
 +
 +
Note: This may take a day or two to go into effect and '''it only lasts for 90 days''', at which point it will revert back to Google selecting the crawl rate.
 +
 +
=== Bing ===
 +
# Log in to [http://bing.com/webmaster Bing Webmaster Tools]
 +
# Select your site
 +
# In the left-hand side, select '''Configure My Site''' and then '''Crawl Control'''
 +
# Use the 2D chart to change the crawl rate for specific times of day. Or you can use the drop-down menu to move crawl activity to off-peak hours of traffic.
 +
 +
Note: Directives in robots.txt will override this setting (e.g. Setting <tt>Crawl-delay</tt>)
  
 
== Sitemaps ==
 
== Sitemaps ==
 +
"Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site." -- www.sitemap.org
 +
 +
The full specification is available at [http://www.sitemaps.org/ sitemaps.org]
 +
 +
=== Creating a sitemap ===
 +
[http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators List of sitemaps.org-compliant sitemap generators] including Drupal and Wordpress.
 +
 +
Some library software specific sitemap generators exist:
 +
* [https://github.com/bibliotechy/cdm-sitemaps ContentDM] (locally hosted instances only)
 +
*[http://vufind.org/wiki/search_engine_optimization#sitemaps VUfind]
 +
 
 +
=== Telling Search Engine about your sitemap ===
 +
Sitemaps can be added into your '''robots.txt''', or submitted directly to Google Webmaster Tools or Bing.
 +
 +
==== robots.txt ====
 +
add a line like this to your robots.txt file:
 +
 +
<pre>
 +
Sitemap: http://yourdomain.tld/path/to/sitemap.xml
 +
</pre>
 +
 +
==== Google ====
 +
# Log in to [https://www.google.com/webmasters/tools Google Webmaster Tools]
 +
# Select your site
 +
# In the left-hand side, select '''Optimization''' and then '''Sitemaps'''
 +
# Click '''Add/Test Sitemap'''
 +
# Enter the path to your sitemap on your server and click '''Submit Sitemap'''
 +
 +
 +
  
 
== HTML5 Microdata ==
 
== HTML5 Microdata ==
 +
 +
== Services ==
 +
 +
* [http://jronallo.github.com/blog/dpla-strawman-technical-proposal/ Collection Achievements and Profiles System and DPLA Crawler Services]
 +
* [https://www.google.com/webmasters/tools Google Webmaster Tools]
 +
* Python's [http://docs.python.org/2/library/robotparser.html robotparser]
  
 
== Form Letter ==
 
== Form Letter ==

Latest revision as of 12:37, 17 November 2012

Friendly-Robut.png

For a variety of reasons cultural heritage organizations often have robots.txt documents that restrict what web crawlers (aka robots) can see on a website. This is a bad thing because it means that the content that libraries, archives and museums are putting online becomes virtually invisible to search engines like Google, Bing, Yahoo, is less likely to be shared in social media sites like Facebook, Twitter, Flickr, Pinterest and stands less of a chance of being used in educational sites like Wikipedia. The Robots Are Our Friends campaign aims to help promote an understanding of the role that robots.txt plays in determining the footprint our cultural heritage collections have on the Web.

Background

Typical Reasons for not allowing Robots

  • While indexing a dynamic site, robots can put an extra strain on the server, causing a slow response, or in some cases, pegging the CPU at 100%.
  • Some content is intentionally shielded from search engines to help shape how a websites resources are presented in search results. For example, if an organization has put a lot of PDFs online and doesn't want those to turn up in search results.

Throttling

crawl-delay

Several major search engines support the [https://en.wikipedia.org/wiki/Robots.txt#Crawl-delay_directive crawl delay directive, which you can put in your robots.txt file. This directive lets you tell web crawlers the minimum delay to wait between two successive requests.

User-agent: *
Crawl-delay: 3

Google

  1. Log in to Google Webmaster Tools
  2. Select your site
  3. In the left-hand side, select Configuration and then Settings
  4. Under Crawl rate select Limit Google's maximum crawl rate
  5. Use the slider to adjust the amount of requests per second/number of seconds between requests
  6. Hit Save

Note: This may take a day or two to go into effect and it only lasts for 90 days, at which point it will revert back to Google selecting the crawl rate.

Bing

  1. Log in to Bing Webmaster Tools
  2. Select your site
  3. In the left-hand side, select Configure My Site and then Crawl Control
  4. Use the 2D chart to change the crawl rate for specific times of day. Or you can use the drop-down menu to move crawl activity to off-peak hours of traffic.

Note: Directives in robots.txt will override this setting (e.g. Setting Crawl-delay)

Sitemaps

"Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site." -- www.sitemap.org

The full specification is available at sitemaps.org

Creating a sitemap

List of sitemaps.org-compliant sitemap generators including Drupal and Wordpress.

Some library software specific sitemap generators exist:

Telling Search Engine about your sitemap

Sitemaps can be added into your robots.txt, or submitted directly to Google Webmaster Tools or Bing.

robots.txt

add a line like this to your robots.txt file:

Sitemap: http://yourdomain.tld/path/to/sitemap.xml

Google

  1. Log in to Google Webmaster Tools
  2. Select your site
  3. In the left-hand side, select Optimization and then Sitemaps
  4. Click Add/Test Sitemap
  5. Enter the path to your sitemap on your server and click Submit Sitemap



HTML5 Microdata

Services

Form Letter