Changes

2011talks Submissions

3,061 bytes added, 16:49, 12 November 2010
no edit summary
The next phase of Penn State's institutional digital stewardship program will involve prototyping a suite of curation services to enable users to manage and enrich their digital content -- we’re just about to get started on this, at the time this proposal was written. The curation services will be implemented following the microservices philosophy, and they will be stitched together via OpenSRF. We will talk about why we chose the “road to SRFdom,” colliding the ILS world with the repository world, how we implemented the curation services & architecture, and how OpenSRF might be helpful to you. Code will be shown, beware.
 
== Enhancing the Performance and Extensibility of the XC’s MetadataServicesToolkit ==
* [http://tech.benanderson.us Ben Anderson], [http://www.extensiblecatalog.org/ eXtensible Catalog Organization], banderson@library.rochester.edu
 
Learn how we increased the performance of the [http://code.google.com/p/xcmetadataservicestoolkit/ XC Metadata Services Toolkit] (MST) by over 900%. The MST is an open-source Java application, that uses SOLR and MySQL to harvest (OAI-PMH) library metadata (MARC, DC), clean it up, convert and frbrize, and then make new metadata (RDA flavor, XC Schema) available for harvesting. Our first release performed too slowly with degrading performance with large record batches and we needed to enable the MST to process a library’s entire catalog in a reasonable amount of time on a common server. The MST was also intended to be extensible. Libraries will almost certainly want to customize this process in some way. Thus our second goal was to make it is as easy as possible for a developer to write a service which can be plugged into the MST.
 
In the spring of ‘10 we set out to accomplish our 2 goals. The first task was to establish how close the existing MST was to these goals. More concretely, our goal was to be able to process 1M MARC records/hr and have little to no degradation as the MST processed several million records. The first service in our chain of services, the normalization service, served as our initial metric. The normalization service was processing records at a speed of 125k/hr, much slower than we hoped for. On top of that, before processing 2M records, the MST essentially crawled to a halt. We were about an order of magnitude off and we needed to increase scalability in a substantial way as well. Also, examining the steps involved in writing a new service for the MST showed us that it was not easy to do so. Internals of the MST were exposed to the service developer and the developer was expected to re-implement much of this internal code with no instructions on how to do so. Much work needed to be done to abstract the implementation of the MST away from the service developer.
 
Working hard over the course of several months, we were able to accomplish both of our goals. The MST is now processing records at a speed of 1.2M records/hr with no degradation on a set of 6M records on a less than optimal server (1.5GHz cpu). In this talk, I will detail the specifics of the strategies we used to accomplish this major speed enhancement (such as a shift from Apache SOLR to a hybrid SOLR/MySQL approach). In regards to our second goal, third party developers can now download an MST development environment, write a few lines of code, and package their service for deployment into the MST. Third party developers need not concern themselves with the details of the internal MST implementation. In this talk, I will also walk through [http://code.google.com/p/xcmetadataservicestoolkit/wiki/HowToImplementService the steps] required to write a service for the MST.
3
edits