Difference between revisions of "Robots Are Our Friends"
(→Throttling) |
|||
Line 10: | Line 10: | ||
== Throttling == | == Throttling == | ||
+ | |||
+ | * Google | ||
+ | # Log in to [https://www.google.com/webmasters/tools Webmaster Tools] | ||
+ | # Select your site | ||
+ | # In the left-hand side, select '''Configuration''' and then '''Settings''' | ||
+ | # Under '''Crawl rate''' select '''Limit Google's maximum crawl rate''' | ||
+ | # Use the slider to adjust the amount of requests per second/number of seconds between requests | ||
+ | # Hit '''Save''' | ||
+ | |||
+ | Note: This may take a day or two to go into effect and '''it only lasts for 90 days''', at which point it will revert back to Google selecting the crawl rate. | ||
== Sitemaps == | == Sitemaps == |
Revision as of 14:53, 7 November 2012
For a variety of reasons cultural heritage organizations often have robots.txt documents that restrict what web crawlers (aka robots) can see on a website. This is a bad thing because it means that the content that libraries, archives and museums are putting online becomes virtually invisible to search engines like Google, Bing, Yahoo, is less likely to be shared in social media sites like Facebook, Twitter, Pinterest and stands less of a chance of being incorporated into datasets such as Wikipedia. The Robots Are Our Friends campaign aims to help promote an understanding of the role that robots.txt plays in determining the footprint our cultural heritage collections have on the Web.
Contents
Background
Typical Reasons for not allowing Robots
- While indexing a dynamic site, robots can put an extra strain on the server, causing a slow response, or in some cases, pegging the CPU at 100%.
Throttling
- Log in to Webmaster Tools
- Select your site
- In the left-hand side, select Configuration and then Settings
- Under Crawl rate select Limit Google's maximum crawl rate
- Use the slider to adjust the amount of requests per second/number of seconds between requests
- Hit Save
Note: This may take a day or two to go into effect and it only lasts for 90 days, at which point it will revert back to Google selecting the crawl rate.
Sitemaps
HTML5 Microdata
Services
- Collection Achievements and Profiles System and DPLA Crawler Services
- Google Webmaster Tools
- Python's robotparser