Robots Are Our Friends

For a variety of reasons cultural heritage organizations often have robots.txt documents that restrict what web crawlers (aka robots) can see on a website. This is a bad thing because it means that the content that libraries, archives and museums are putting online becomes virtually invisible to search engines like Google, Bing, Yahoo, is less likely to be shared in social media sites like Facebook, Twitter, Pinterest and stands less of a chance of being incorporated into datasets such as Wikipedia. The Robots Are Our Friends campaign aims to help promote an understanding of the role that robots.txt plays in determining the footprint our cultural heritage collections have on the Web.

Background

https://twitter.com/rubinsztajn/status/265908774810824704

Typical Reasons for not allowing Robots

While indexing a dynamic site, robots can put an extra strain on the server, causing a slow response, or in some cases, pegging the CPU at 100%.

Throttling

Google

Log in to Google Webmaster Tools
Select your site
In the left-hand side, select Configuration and then Settings
Under Crawl rate select Limit Google's maximum crawl rate
Use the slider to adjust the amount of requests per second/number of seconds between requests
Hit Save

Note: This may take a day or two to go into effect and it only lasts for 90 days, at which point it will revert back to Google selecting the crawl rate.

Bing

Log in to Bing Webmaster Tools
Select your site
In the left-hand side, select Configure My Site and then Crawl Control
Use the 2D chart to change the crawl rate for specific times of day. Or you can use the drop-down menu to move crawl activity to off-peak hours of traffic.

Note: Directives in robots.txt will override this setting (e.g. Setting Crawl-delay)

Robots Are Our Friends

Background

Typical Reasons for not allowing Robots

Throttling

Sitemaps

HTML5 Microdata

Services

Form Letter

Code4Lib