8
edits
Changes
add my talk
Cultural heritage production is moving to the digital medium and libraries use of repository solutions such as Fedora Commons and DSpace are a solid response to this change. But how do we go from, for instance a selection of 90's computing technology to a collection of digital objects ready for ingest into your institution's local repository? Once you have ingested your digital objects how are you going to provide access to these resources? The arrival of the Salman Rushdie Papers, which contain 10 years of Sir Salman Rushdie's digital life, gave Emory University Libraries the opportunity to explore these questions. I would like to to talk about the approach the Emory University Libraries adopted, what we learned and the coding challenges that remain.
== Indexing big data with Tika, Solr & map-reduce ==
The Web Archiving Service at the California Digital Library has
crawled a large amount of data, in every format found on the web: 30
TB, comprising about 600 million fetched URLs. In this talk we will
discuss how we parsed this data using Tika and map-reduce, and how we
indexed this data with Solr, tweaked the relevance ranking, and were
able to provide our users with a better search experience.
* Scott Fisher, California Digital Library, scott.fisher AT ucop BORK edu
* Erik Hetzner, California Digital Library, erik.hetzner AT ucop BORK edu
[[Category: Code4Lib2012]]