Jump to: navigation, search

Robots Are Our Friends

1,804 bytes added, 12:37, 17 November 2012
Typical Reasons for not allowing Robots
[[Image:Friendly-Robut.png|right]] For a variety of reasons cultural heritage organizations often have [ robots.txt] documents that restrict what web crawlers (aka robots) can see on a website. This is a bad thing because it means that the content that libraries, archives and museums are putting online becomes virtually invisible to search engines like Google, Bing, Yahoo, is less likely to be shared in social media sites like Facebook, Twitter, Flickr, Pinterest and stands less of a chance of being incorporated into datasets such as used in educational sites like Wikipedia. The Robots Are Our Friends campaign aims to help promote an understanding of the role that robots.txt plays in determining the footprint our cultural heritage collections have on the Web.
== Background ==
* While indexing a dynamic site, robots can put an extra strain on the server, causing a slow response, or in some cases, pegging the CPU at 100%.
* Some content is intentionally shielded from search engines to help shape how a websites resources are presented in search results. For example, if an organization has put a lot of PDFs online and doesn't want those to turn up in search results.
== Throttling ==
== Sitemaps ==
"Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site." --
The full specification is available at []
=== Creating a sitemap ===
[ List of sitemap generators] including Drupal and Wordpress.
Some library software specific sitemap generators exist:
* [ ContentDM] (locally hosted instances only)
*[ VUfind]
=== Telling Search Engine about your sitemap ===
Sitemaps can be added into your '''robots.txt''', or submitted directly to Google Webmaster Tools or Bing.
==== robots.txt ====
add a line like this to your robots.txt file:
Sitemap: http://yourdomain.tld/path/to/sitemap.xml
==== Google ====
# Log in to [ Google Webmaster Tools]
# Select your site
# In the left-hand side, select '''Optimization''' and then '''Sitemaps'''
# Click '''Add/Test Sitemap'''
# Enter the path to your sitemap on your server and click '''Submit Sitemap'''
== HTML5 Microdata ==

Navigation menu