UPDATE: The submission deadline has passed and voting on the talks has commenced at http://vote.code4lib.org/election/index/17
See the Call for Submissions for guidelines on appropriate topic talks and the criteria on which submissions are evaluated.
Please follow the formatting guidelines:
== Talk Title: == * Speaker's name, affiliation, and email address * Second speaker's name, affiliation, email address, if second speaker Abstract of no more than 500 words.
How "great" are the Great Books?
- Eric Lease Morgan, University of Notre Dame (emorgan at nd.edu)
In the 1960s a set of books called the Great Books of the Western World was published. It was supposed to represent the best of Western literature and enable the reader to further their liberal arts education. Sixty volumes in all, it included works by Plato, Aristotle, Shakespeare, Milton, Galileo, Kepler, Melville, Darwin, etc. These great books were selected based on the way they discussed a set of 102 "great ideas" such as art, astronomy, beauty, evil, evolution, mind, nature, poetry, revolution, science, will, wisdom, etc. How "great" are these books, and how "great" are the ideas expressed in them?
Given full text versions of these books it is almost trivial to use the "great ideas" as input and apply relevancy ranking algorithms against the texts thus creating a sort of score -- a "Great Ideas Coefficient". Term Frequency/Inverse Document Frequency (TFIDF) is a well-established algorithm for computing just this sort of thing:
relevancy = ( c / t ) * log( d / f ) where:
- c = number of times a given word appears in a document
- t = total number of words in a document
- d = total number of documents in a corpus
- f = total number of documents containing a given word
Thus, to calculate our Great Ideas Coefficient I sum the relevancy score for each "great idea" for each "great book". Plato's Republic might have a cumulative score of 525 while Aristotle's On The History Of Animals might have a cumulative score of 251. Books with a larger Coefficient could be considered greater. Given such a score a person could measure a book's "greatness". We could then compare the score to the scores of other books. Which book is the "greatest"? We could compare the score to other measurable things such as book's length or date to see if there were correlations. Are "great books" longer or shorter than others? Do longer books contain more "great ideas"? Are there other books that were not included in the set that maybe should have been included?
The first part of this talk describes the different steps involved in the text pre-processing to calculate an accurate TFIDF value for each item of the corpus. The results and statistical analysis are discussed in the second part. Finally I will outline the remaining work such as refining the analysis and extending the current quantitative process to a web implementation.
UNR BookFinder: Leveraging Google Books to Move Beyond Catalog Search
- Will Kurt, University of Nevada, Reno, (wkurt at unr.edu)
Google Books is a great tool, but it lacks an easy method allowing users to access the items they find through their library. The UNR BookFinder is a mashup of the Google Books and WorldCat APIs (and some ugly hacks) which allows users to search for items with the power of Google’s fulltext search while eliminating the need to search all of the library’s various resources to find an item. The UNR BookFinder automatically searches the catalog and consortial ILL for the item, if these fail an ILLiad request form as automatically filled out. The end result is that the user can explore an universe of books and access them as fast as possible through the university library. A video of the alpha version can be found here.
Moving a large multi-tiered search architecture from dedicated hosts to the cloud
- Peter Ciuffetti, Senior Software Engineer, Credo Reference Ltd. (pete at credoreference.com)
So you want to move a large production search service from dedicated hosts to the cloud? The flexibility is enticing, the costs are attractive, the geek cred is undeniable. Our cloud adventure came with many undocumented surprises ranging from mysterious server behavior to sales engineers suggesting that 'maybe the cloud isn't for you'. We eventually made it all work and our production service is now on the cloud. This talk will cover what the cloud product FAQs don't say, what their tech support doesn't know (or won't say) and mistakes you can avoid by talking to the guys with the arrows in their backs.
VuFind Beyond MARC: Discovering Everything Else
- Demian Katz, Library Technology Development Specialist, Villanova University (demian dot katz at villanova dot edu)
The VuFind discovery layer has been providing a user-friendly interface to MARC records for several years now. However, library data consists of more than just MARC records, and VuFind has grown to accommodate just about anything you can throw at it. This presentation will examine the new workflows and tools that enable discovery of non-MARC resources and some of the non-traditional applications of VuFind that they make possible. Technologies covered will include OAI-PMH, XSLT, Aperture, Solr and, of course, VuFind itself.
Linked data apps for medical professionals
- Rurik Thomas Greenall, NTNU Library, (rurik dot greenall at ub dot ntnu dot no)
The promise of linked data for libraries has yet to be realized, as a demonstration of the power of RDF, HTTP-URIs and SPARQL, NTNU Library together with the Norwegian Electronic Health Library produced a linked data representation of MeSH and created a small translation app that can be used to help health professionals identify the right term and apply it in their database searches. This talk presents the simple ways in which the core technologies and concepts in linked data provide a solid, time-saving way of developing usable applications.
fiwalk With Me: Building Emergent Pre-Ingest Workflows for Digital Archival Records using Open Source Forensic Software
- Mark A. Matienzo, Manuscripts and Archives, Yale University Library (mark at matienzo dot org)
Many of the complications of born-digital records involve preparing them for transfer into a storage or preservation environment. Digital evidence of any kind is easily susceptible to unintentional and intentional modification. This presentation will describe the use of open source forensic software in pre-ingest workflows for digital archives. Digital archivists and other digital curation practitioners can develop emergent processes to prepare records for ingest and transfer using a combination of relatively simple tools. The granularity and simplicity of these tools and procedures provides the possibility for their smooth integration into a digital curation environment built on micro-services.
Why (Code4) Libraries Exist
- Eric Hellman, President, Gluejar, Inc. (eric at hellman dot net)
Libraries have historically delivered value to society by facilitating the sharing of books. The library "brand" is built around the building and exploitation of their collections. These collections have been acquired and owned. As ebook readers become the preferred consumption platform for books, libraries are beginning to come to terms with the fact that they don't own their digital collections, and can't share books as they'd like to. Yet libraries continue to be valuable in many ways. In this transitional period, only one thing can save libraries from irrelevance and dissipation: Code.
The Story of TILE: Making Modular & Reusable Tools
- Doug Reside, MITH, University of Maryland (dougreside at gmail dot com)
The Text Image Linking Environment (TILE) is a collaborative project between the Maryland Institute for Technology in the Humanities (MITH), the Digital Library Program at Indiana University, and the School of Library and Information Science at Indiana University Bloomington. Since May 2009, the TILE project team has been developing through NEH Research & Development funding a web-based, modular, image markup tool for both semi-automated linking between encoded text and image of text, and image annotation. The software will be complete and ready for release in June 2011.
Many members of the TILE development team are also members of the Open Annotation Collaboration (OAC), and have therefore attempted to develop TILE’s annotation features to be OAC compliant. Like OAC, TILE assumes that the text and the images to be linked may exist at separate and completely unconnected servers. When a user starts the TILE tool for the first time, she is prompted to supply a URI to a TILE compliant JSON file.
TILE’s JSON is simple and thoroughly documented, and we provide several translators to map common existing metadata formats to the format. We have already created a PHP script that will generate TILE JSON from a TEI P5 document and are currently working to do the same for the METS files used in the Indiana University’s METS navigator tool.
Additionally, TILE provides a modular exporting tool that allows users to run the work they’ve done in TILE through an external translator and then download the result to the client computer. For example, a user may import a set of images and transcripts from a METS file at the Library of Congress, use TILE to link images and text, and then export the result as a TEI file. The TEI file may then be reimported to TILE at a later data to further edit or convert the file.
At Code4Lib, we will demonstrate the functionality of TILE and display a poster and provide handouts that describe the thinking behind TILE, how it is intended to be used, and details on how TILE is built and functions.
We Don’t Server Their Kind : Managing E-resources with Flat-File Databases
- Junior Tidal, Multimedia and Web Services Librarian, New York City College of Technology, CUNY (jtidal at citytech dot cuny dot edu)
Managing E-resources can be a daunting challenge. URLs, database names, and even vendors can change, go down, or simply cease to exist. My proposal involves the use of a PHP-based, flat-file database driven web tool for database management. The design of this program was to fulfill two needs: ease of use for librarians with a lack of programming experience and to meet the security and technical restrictions placed by the college’s IT department. My presentation will explore the development of this tool, challenges within its development, and future improvements. PHP code and the flat-file database will also be explained and provided to attendees. For a working demonstration feel free to visit the New York City College of Technology’s A-Z database page or the subject database page.
Drupal 7 as a Rapid Application Development Tool
- Cary Gordon, President, The Cherry Hill Company & Board Member, The Drupal Association (cgordon at chillco dawt com)
Five years ago, I discovered that the Drupal CMS had a programming framework disguised as an API, and learned that I could use it to solve problems.
Drupal 7 builds on that to provide a powerful toolset for interfacing with, manipulating and presenting data. It empowers tool-builders by providing a minimal install option, along with a more powerful installation profile system makes it easier for developers to package and distribute their applications.
Helping Open Source Succeed
Deciding if open source is an option for your institution, or what open source software matches your institution’s needs and capabilities, is a complex decision. LYRASIS is developing a new area of focus to assist libraries with decision tools and an open source software registry. We want to learn from the creators of open source software what questions institutions have when considering the adoption of open source software and what information you would like to see in a registry that compares various open source tools. A summary of topics discussed in this session will be openly published as part of LYRASIS’ program development plans and decision support resources.
The mission of the new and emerging LYRASIS Technology Services area is to serve members and the broader library community as a provider of expertise and capacity in open source based technology solutions. We think that viable roles for an organization supporting open source software are to: a) Increase understanding of open source technology within the library community, including value, benefits, risks, and costs; b) Assist in decision-making by providing resources to help libraries evaluate open source technologies, institutional readiness, and capacity for adoption; c) Support adoption and use of open source technologies and systems within libraries and consortia; d) Foster integration of open source software tools to expand the ability of existing programs to meet a range of library user needs; e) Develop and test new open source software programs, and contribute to the development of existing programs; f) Support long-term sustainability of viable, library-based open source software and systems. We recognize that these roles exist to some extent on a continuum, with latter services related to development and sustainability building on the knowledge and experience gained through deployment of existing open source systems. In turn, effective adoption and use depends on understanding open source systems and having resources to assist in decision-making and implementation.
With open source software in the “innovator” and “early adopter” stages in the library community, we intend to focus its initial efforts on roles A-D in the above list: increased understanding, decision-support, and effective adoption and integration of existing library-focused open source systems. This session is focused on the decision-support services area of activity.
The impact of this session is expected to be far reaching, if initially subtle. With most of the session time devoted to discussion and interaction among peers on questions surrounding the adoption of open source software, participants will take away a deeper understanding of topics each institution should consider when looking at open source software. These findings, along with that of similar sessions around the country, will inform the creation and expansion of the free decision support tools being developed by LYRASIS.
Letting in the light: using Solr as an external search component
- Jay Luker, IT Specialist, ADS (jluker at cfa dot harvard dot edu)
- Benoit Thiell, software developer, ADS (bthiell at cfa dot harvard dot edu)
It’s well-established that Solr provides an excellent foundation for building a faceted search engine. But what if your application’s foundation has already been constructed? How do you add Solr as a federated, fulltext search component to an existing system that already provides a full set of well-crafted scoring and ranking mechanisms?
This talk will describe a work-in-progress project at the Smithsonian/NASA Astrophysics Data System to migrate its aging search platform to Invenio, an open-source institutional repository and digital library system originally developed at CERN, while at the same time incorporating Solr as an external component for both faceting and fulltext search.
In this presentation we'll start with a short introduction of Invenio and then move on to the good stuff: an in-depth exploration of our use of Solr. We'll explain the challenges that we faced, what we learned about some particular Solr internals, interesting paths we chose not to follow, and the solutions we finally developed, including the creation of custom Solr request handlers and query parser classes.
This presentation will be quite technical and will show a measure of horrible Java code. Benoit will probably run away during that part.
Working with DuraCloud: How to preserve your data in the cloud
- Bill Branan, DuraSpace, bbranan at duraspace dot org
- Andrew Woods, DuraSpace, awoods at duraspace dot org
Ever expanding digital collections have become the norm in academic libraries. As the size of collections grow, the need for simple-to-deploy yet powerful preservation strategies becomes increasingly important. The DuraCloud project, a cloud-hosted service for data management and preservation, is committed to bringing the availability and elasticity of the cloud to bear on the issue of digital preservation. This session will discuss the APIs and tools which can be used to communicate and integrate with the DuraCloud platform, providing an immediate connection to scalable storage available from multiple cloud storage providers, configurable services which can be run over your content out-of-the-box, and a development platform which can serve as the basis for ongoing data mining and analysis.
Visualizing Library Data
- Karen Coombs, OCLC, coombsk at oclc dot org
Visualizations can be powerful tools to give context to library users and to provide a clear picture for data-driven decision-making in libraries. Map mashups, tag clouds and timelines can be used to show information to users in new ways and help them locate materials to meet their needs. QR codes can help link users to materials that libraries have in their collections. Charts and graphs can be used to help analyze library collections (holdings) and compare them to other libraries. This session will show prototypes which combine tools like Google Chart API, Protovis and Simile Widgets with data from WorldCat, WorldCat Registry, Classify, Terminology Services, and Dewey.info to create vivid illustrations in library user interfaces and administration tools.
Kuali OLE: Architecture for Diverse and Linked Data
- Tim McGeary, Lehigh University, Kuali OLE Functional Council, tim dot mcgeary at lehigh dot edu
- Brad Skiles, Project Manager, Kuali OLE, Indiana University, bradskil at indiana dot edu
With programming scheduled to be begin in January 2011 on the Kuali Open Library Environment (OLE), the Kuali OLE Functional Council is developing the requirements for an architecture for diverse data sets and linked data. With no frontrunner for one bibliographic data standard, and local requirements on what data will be accompanying or linked to the main record store, Kuali OLE needs to build a flexible environment for records management and access.
We will present the concepts of our planned architecture, a multi-repository framework, using a document repository, a semantic repository, and a relational repository, brokered on top of the enterprise service bus of Kuali Rice. As a community source project, this is an opportunity for the Kuali OLE partners to present our plans for discussion with the community, and we look forward to feedback, questions, and comments.
One Week | One Tool: ultra-rapid open source development among strangers
- Scott Hanrath, University of Kansas Libraries, shanrath at ku dot edu
- Jason Casden, North Carolina State University Libraries, jason_casden at ncsu dot edu
In summer 2010, the Center for History and New Media at George Mason University, supported by an NEH Summer Institute grant, gathered 12 'digital humanists' for an intense week of collaboration they dubbed 'One Week | One Tool: a digital humanities barn raising.' The group -- several of whom hang their professional hats in libraries and most of whom were previously unacquainted -- was asked to spend one week together brainstorming, specifying, building, publicizing, and releasing an open source software tool of use to the digital humanities community. The result was Anthologize, a free, open source plugin that transforms WordPress into a platform for publishing electronic texts in formats including PDF, ePub, and TEI; in other words, a "blog-to-book" tool. This presentation will focus on how One Week | One Tool addressed the challenges of collaborative open source development. From the perspectives of two library coders on the team, we will describe and provide lessons learned from the One Week development process including: how the group structured itself without predefined roles; how the one week time frame and makeup of the group -- which included scholars, grad students, librarians, museum professionals, instructional technologists, and more -- influenced planning and development decisions; the roles of user experience and outreach efforts; the life of Anthologize since the end of the week; and thoughts on what a one week, one 'library' tool could look like.
- Anne-Lena Westrum, Oslo Public Library (annelena at deichman dot no)
- Asgeir Rekkavik, Oslo Public Library (asgeirr at deichman dot no)
The Pode project at Oslo Public Library has experimented on the automated FRBRizing of catalogue records, as well as expressing bibliographic descriptions as linked data to enrich catalogue browsing with information from external sources.
When a library enduser searches the online catalogue for works by a particular author, he will typically get a long list that contain all the different translations and editions of all the books by that author, sorted by title or date of issue. The Pode project applied a method of automated FRBRizing, based on the information contained in MARC records, RDF representation and SPARQL queries, to demonstrate how an author's complete production can be presented as a lucid list of unique works, that can easily be browsed by their different expressions and manifestations. Furthermore, by linking instances in the dataset to matching or corresponding instances in external sets, the presentation can be enriched with additional information about authors and works, as well as links to electronic full-text representations.
The talk will also present the work on making an RDF representation of the catalogue records for the whole collection of non-fiction documents at the Norwegian Multilingual library, linking subject headings and Dewey classes, and allowing endusers to browse the collection by the multilingual Dewey class labels published by OCLC at http://dewey.info.
The talk will focus on the challenges technological progresses such as these raise for cataloguers, to deliver consistent and standardized catalogue records. Many cataloguers have a local and pragmatic focus on the library's own and already existing services. Attitudes like this might represent a problem when emerging technologies find new applications for library catalogue data, as well as when the library wants to use data submitted by others to enrich their own services.
Touch and go: building a touch screen kiosk with software you already own
- Andreas K. Orphanides, North Carolina State University Libraries, akorphan at ncsu dot edu
Next-L Enju, NDL Search and library geeks in Japan
- Kosuke Tanabe, Keio University, tanabe at mwr dot mediacom dot keio dot ac dot jp
Next-L Enju is an open source integrated library system developed by Project Next-L, the library geek community in Japan launched on November 2006. It is built on open-source software (Ruby on Rails, PostgreSQL/MySQL and Solr) and supports modern ILS features (e.g. FRBR structure and RESTful WebAPI).
Enju has been inplemented by some libraries, which include National Diet Library (NDL), the largest library in Japan. NDL has chosen Enju to provide a new search engine, called "NDL Search" and added some extra features (e.g. automatic FRBRization and providing bibliographic data in a Linked Data format) . The development version is available at http://iss.ndl.go.jp/ .
I'm one of the authors of Next-L Enju. I'd like to talk about the overview and structure of Next-L Enju, NDL Search and the activities of our project.
A community based approach to developing a Digital Exhibit at Notre Dame using the Hydra Framework
- Rick Johnson, University of Notre Dame, (rick dot johnson at nd dot edu)
- Dan Brubaker Horst, University of Notre Dame (dbrubak1 at nd dot edu)
It was clear to us early on that the scope of managing, preserving, and interacting with digital content is too much for any one institution to conquer by itself. We realized that we need help.
We were already fairly convinced using Fedora, Solr, and ActiveFedora were solid choices because of their strong development community and flexible robust solutions. We were also exploring Blacklight for search and browse for the same reasons. The open questions were:
- What is the best way to put the pieces together?
- How do you tackle the heterogenous content types and workflows without getting bogged down in each individual solution?
After connecting with folks from the Hydra project at Code4Lib it was immediately clear that we had many things in common:
- The same architectural choices: Fedora, Solr, ActiveFedora, Blacklight
- Similar design philosophies
- A need to work together
- Too many shared use cases to ignore
So, we jumped on board and have adopted the Hydra Framework for all of our Digital Repository efforts.
In our presentation we will cover:
- Why we chose to adopt the Hydra Framework instead of creating our own solution
- Why the community based approach is so appealing
- How we were welcomed into the Hydra development community
- Why we chose to create something beyond basic Blacklight search and facet browse
- How to create your own Digital Exhibit using Hydra including
- Metadata management
- Custom Browse and Search
The Hydra Project is actively seeking partnerships with other institutions to extend its efforts. Will your institution be next?
More about the Hydra Project
Mendeley's API and University Libraries: 3 examples to create value
- Ian Mulvany, Mendeley
Mendeley (http://www.mendeley.com) is a technology startup that is helping to revolutionize the way research is done. Used by more than 600,000 academics and industry researchers, Mendeley enables researchers to arrange collaborative projects, work and discuss in groups, as well as share data across its web platform. Launched in London in December 2008, Mendeley is already the world’s largest research collaboration platform. Through this platform, we anonymously pools users’ research paper collections, creating a crowd-sourced research database with a unique layer of social information - each research paper is connected with socio-demographic information about its audience.
Based on this platform and data, I will present three examples of how Mendeley is working to support university libraries and contribute to opening up academic research: 1) Mendeley’s integration as a workflow tool with institutional repositories with the aim of increasing IR deposit rates; 2) Application examples building on Mendeley’s API to showcase what is possible with the newly available type of usage data Mendeley is aggregating; 3) Preview of Mendeley’s library dashboard that will reveal content usage within an institution.
I would also hope that a subsequent discussion can address how you (the attendees) could envision Mendeley’s future in the library tech community.
ArticlesPlus: Summon API Client Implementation and Integration with Drupal 6
- Albert Bertram, University of Michigan (firstname.lastname@example.org)
On September 27, 2010, the University of Michigan launched ArticlesPlus, an application for web-scale article discovery using the Serials Solutions' Summon service as its search engine. Rather than providing a search box which sent our patrons to the interface provided by Serials Solutions, we used Summon's API to integrate the search as a feature of our library's website. In addition to Summon as the search engine, we used our Drupal instance for the interface engine.
I propose to talk about how we implemented the Summon API, the Drupal module we developed in to access the Summon API, problems with implementing an interface ourselves, benefits of implementing the interface ourselves, and plans for future expansion or improved integration in our website.
Let's Get Small: A Microservices Approach to Library Websites
- Sean Hannan, Johns Hopkins University, shannan at jhu dot edu
Most, if not all, library websites are housed and maintained in singular, monolithic content management systems. This is fantastic if the library website is the one place your users go for library information. But it isn't. Users are going to Facebook, checking mobile applications, browsing portals as well as checking the library website. Wouldn't it be great if you could update the information on all of these sites from a single source? Why maintain the library hours in five different places?
In this talk, I will show how breaking the construction of the library website into as-needed, swappable microservices can free your content to live where it needs to, as well as free you from the maintenance headaches usually involved. What kind of microservices, you ask? Well, basic templating and styling is a given, but how about a microservice that gracefully degrades your layouts for older browsers? Or enforces highfalutin typographic rules? Or optimizes your site assets to improve load times? All wonderful little black boxes that allow you to focus on the website and its content, and not the details.
I promise at least one diagram. That will burn your eyes.
Sharing Between Data Repositories
- Kevin S. Clarke, NESCent/Dryad Data Repository, ksclarke at nescent dot org
Dryad (http://datadryad.org) is a generic subject repository that shares author submitted data with other scientific repositories. In a part "how we done it" and part "things to consider" talk, I'll discuss 1) why we chose BagIt and OAI-ORE as mechanisms for sharing our data, 2) how we've integrated with TreeBASE (http://www.treebase.org/ -- a subject repository of phylogenetic information), and 3) the possibility of this method of data sharing being adopted by other repositories within the larger DataONE community.
There will be cake.
Hey, Dilbert. Where’s my data?!
- Tommy Barker, University of Pennsylvania, tbarker at pobox dot upenn dot edu
Libraries are notorious for maintaining data in massively disparate systems such as databases, flat files, xml and web services. The data is rich and valuable to assessment, but extracting value from multiple systems is complex and time consuming. Yes, there are open source and commercial solutions available, but libraries have unique requirements that can be difficult to integrate into these products. Commercial options also tend to be overly complex or the cool features require an expensive enterprise edition.
With funding from the Institute of Museum and Library Services, UPenn is developing MetriDoc to address data integration headaches within the library, and support reporting requirements from management. MetriDoc’s mission is to provide an open source API / tool set where users can specify dataflows and use library based services to solve integration problems while MetriDoc worries about scalability and performance. MetriDoc accomplishes this with no complex xml configuration or scary SOA middleware, but instead uses a simple DSL where possible. Eventually the project will also include dashboards to assist with complex job management and data flow monitoring.
The first half the presentation briefly discusses MetriDoc’s architecture while the remainder of the presentation will include code samples to illustrate problems it can solve. Information on how to contribute or download MetriDoc will be provided as well.
Open Data and the Biodiversity Heritage Library experience
- Trish Rose-Sandler, Missouri Botanical Gardens, trish dot rose dash sandler at mobot dot org
The Biodiversity Heritage Library (BHL) is an international consortium of the world’s leading natural history museum libraries, botanical libraries, and research institutions organized to digitize, serve, and preserve the legacy literature of biodiversity. From the beginning the BHL partners conceived of the BHL collection as being “open” – available to anyone regardless of geographic location or affiliation and a linked into a global Biodiversity Commons. This talk will discuss the basic principles of open data and use BHL as one example of how those principles have played out in a real world context.
What does it mean for data to be “open” and what tools or services can enable this? Our metadata is purposely “open” so that others can harvest it and repurpose it in different contexts. We make it available through both OAI-PMH and APIs.
If you “open” your data will they come? In some cases yes. BHL can give examples of scientists and science services, who have taken our data and exploited it for other purposes (e.g. BioStor, Earthcape, EOL, ZipcodeZoo) Yet, in a recent BHL survey we learned that of our frequent users, 42% were not aware that we provided APIs and 31% did not understand what APIs were. Clearly promotion of your open data is a key activity to making it truly useful.
What are some advantages to open data? Harvestable data allows that data which was created for a specific purpose and audience (e.g. historic texts, nomenclatural services, encyclopedias) to interact with other data and serve new, previously unimagined, roles. For BHL, opening our data it was a desire to do three things 1) make biodiversity data available to foster scientific research 2) support the public use of these data and 3) build a web of science.
The Road to SRFdom: OpenSRF as Curation Microservices Architecture
- Dan Coughlin, Penn State University (email@example.com)
- Mike Giarlo, Penn State University (firstname.lastname@example.org)
OpenSRF is the XMPP-based framework that underlies the Evergreen ILS, providing a service-oriented architecture with failover, load-balancing, and high availability. Curation microservices represent a new approach to digital curation in which typical repository functions such as storage, versioning, and fixity-checking are implemented as small, independent services. Put them together and what do you have?
The next phase of Penn State's institutional digital stewardship program will involve prototyping a suite of curation services to enable users to manage and enrich their digital content -- we’re just about to get started on this, at the time this proposal was written. The curation services will be implemented following the microservices philosophy, and they will be stitched together via OpenSRF. We will talk about why we chose the “road to SRFdom,” colliding the ILS world with the repository world, how we implemented the curation services & architecture, and how OpenSRF might be helpful to you. Code will be shown, beware.
The Constitution of Library: Intelligent Approaches to Composing Fine Grained Microservices
- Simon Spero, (cthulhu at unc dot edu)
- Doctoral Student, School of Information and Library Science, University of North Carolina at Chapel Hill.
- Senior Partner, Spero Investigations. "We Hope the Helpless".
Abstract: As the amount of content
that's in that's supposed to be in institutional and other large scale repositories continues to grow, the performance requirements for a ubiquitous digital curation fabric become much harder to meet. At the same time, the policy requirements for managing this information become increasingly more complicated, and the additional staff available to support these requirements continues to be predominately unicorn-american.
With requirements becoming more complicated, preservation actions need to be provided at a very fine granularity; however, composing these services into useful workflows becomes more and more complicated, and making sure that those workflows are supporting desired policy goals virtually impossible.
This talk will describe proven technologies for for intelligent planning that have been used for tasks ranging from deploying armies to flying spacecraft (and less relevantly, for composing web services). The talk will also briefly overview some of the techniques used to optimize dynamic programming languages and HPC message passing systems, and suggest how they can be used to reduce or eliminate the overhead of fine grained microservices to support the rates of ingest and access needed to survive in a born-curated world.
Enhancing the Performance and Extensibility of the XC’s MetadataServicesToolkit
Learn how we increased the performance of the XC Metadata Services Toolkit (MST) by over 900%. The MST is an open-source Java application, that uses SOLR and MySQL to harvest (OAI-PMH) library metadata (MARC, DC), clean it up, convert and frbrize, and then make new metadata (RDA flavor, XC Schema) available for harvesting. Our first release performed too slowly with degrading performance with large record batches and we needed to enable the MST to process a library’s entire catalog in a reasonable amount of time on a common server. The MST was also intended to be extensible. Libraries will almost certainly want to customize this process in some way. Thus our second goal was to make it is as easy as possible for a developer to write a service which can be plugged into the MST.
In the spring of ‘10 we set out to accomplish our 2 goals. The first task was to establish how close the existing MST was to these goals. More concretely, our goal was to be able to process 1M MARC records/hr and have little to no degradation as the MST processed several million records. The first service in our chain of services, the normalization service, served as our initial metric. The normalization service was processing records at a speed of 125k/hr, much slower than we hoped for. On top of that, before processing 2M records, the MST essentially crawled to a halt. We were about an order of magnitude off and we needed to increase scalability in a substantial way as well. Also, examining the steps involved in writing a new service for the MST showed us that it was not easy to do so. Internals of the MST were exposed to the service developer and the developer was expected to re-implement much of this internal code with no instructions on how to do so. Much work needed to be done to abstract the implementation of the MST away from the service developer.
Working hard over the course of several months, we were able to accomplish both of our goals. The MST is now processing records at a speed of 1.2M records/hr with no degradation on a set of 6M records on a less than optimal server (1.5GHz cpu). In this talk, I will detail the specifics of the strategies we used to accomplish this major speed enhancement (such as a shift from Apache SOLR to a hybrid SOLR/MySQL approach). In regards to our second goal, third party developers can now download an MST development environment, write a few lines of code, and package their service for deployment into the MST. Third party developers need not concern themselves with the details of the internal MST implementation. In this talk, I will also walk through the steps required to write a service for the MST.
Free my DSpace Data! How to get your data out of DSpace 1.7 and restore your content after a disaster.
- Tim Donohue, DuraSpace, tdonohue at duraspace dot org
For a while, DSpace has provided many means to get content into the system (or create new content in the system), e.g. basic ingest packages, user interfaces, SWORD. However, getting your content out of DSpace, especially for backups or migrations has often been problematic. In the past, although individual Items could be exported in standard formats, entire Collections or Communities (and the relationships between them) could not be as easily exported.
DSpace 1.7.0 provides a new AIP Backup & Restore feature which allows DSpace to export all of its contents (Communities, Collections, Items, Groups, People, Permissions, and relationships between all objects) into a series of METS-based Archival Information Packages (AIPs). As these AIPs are just zip files, they can be backed up using your normal backup practices (e.g. to tape, hard-drive, or even to the cloud via a service like DuraCloud). As these AIPs also fully describe your DSpace contents, they can be used to restore your entire DSpace after a local server crash or larger disaster.
DSpace created AIPs use standard library metadata formats like MODS, PREMIS and METSRights (along with a few DSpace-specific ones where a "standard format" doesn't yet exist) to describe all the content housed in your DSpace installation. This comes in handy, should you ever decide to migrate some or all of your contents to another DSpace instance or another system altogether.
This talk will describe this new DSpace AIP Backup & Restore feature, provide hints/tips on how it can be used to backup/restore/migrate data. Time permitting, I can also touch on the DSpace Roadmap and other ideas/plans to "free your DSpace data".
Using cloud-based services to leverage open source software
- Erik Mitchell, Wake Forest University, mitcheet at wfu dot edu
Open source software and cloud computing systems are perceived as enticing technologies for both IT staff and IT/Academic administrators. The implementation of open source software or adoption of cloud services is often met with resistance however because of lack of technical expertise in smaller organizations or lack of perceived benefit in larger organizations. Although these technologies are not necessarily related when combined they offer easy deployment of services without significant organization investment or local expertise . This ability allows organizations to leverage open source systems without the overhead typically associated with "free as in a free kitten."
While there are some large national projects looking at using cloud platforms to deliver new services there is an opportunity for a grassroots effort to develop and support pre-configured application servers that are simple to deploy and maintain. These 'disposable' servers would serve the needs of both small and large libraries by enabling them to adopt open source software without taking on the requirement of local infrastructure, configuration, or detailed support.
This presentation will cover the technical details and lessons learned from efforts to create this type of service  on the Amazon EC2 platform and discuss the impact of this approach on open software adoption and its potential impact on IT support in libraries.
VIVO: Enabling National Networking of Scientists
- Brian Keese, Indiana University, bkeese at indiana dot edu
VIVO is an open-source semantic Web application that enables the discovery of research and scholarship across disciplines at an institution. Originally developed from 2003-2009 by Cornell University, in September 2009 the National Institute of Health's National Center for Research Resources made a grant to the University of Florida , Cornell University , Indiana University Bloomington , and four implementation partners to use VIVO to create a national network for scientists. This network will allow researchers to discover potential collaborators with specific expertise, based on authoritative information on projects, grants, publications, affiliations, and research interests, essentially creating a social network for browsing, visualizing, and discovering scientists. This talk will give an overview of the technical underpinnings of VIVO, describe how it integrates with the larger semantic Web, sketch out the plans for enabling discovery across the national network of VIVO sites, and explore the role of libraries in implementing VIVO at all the partner sites. Additionally we will demonstrate some experiments in federated searching that have been undertaken by the VIVO network and the NIH funded Clinical and Translational Science Awards (CTSA) consortium network of networks.
Mass Moves with Worldcat APIs
- Sam Kome, Claremont Colleges Library, sam.kome at cuc dot claremont dot edu
Claremont needed to perform a mass evaluation of item level records to facilitate large scale collection moves and de-accession. Our de-accession criteria, for example, include that 3 or more copies of any book must be available in the 50+ libraries in our Link+ network. We addressed our requirements with the help of the OCLC Worldcat Search and xID APIs and a couple simple python scripts. The process was ultimately a success. We will present our approach, code, and the lessons learned as we discovered limits inherent in the APIs and in our own coding (in)experience. Bonus sub-topic: the use of OCLC Work ID to identify and coalesce alternative ISBNs.
A Simple Algorithm for User Query Classification & Resource Recommendation
- Josh Bishoff, University of Illinois, bishoff2 at illinois dot edu
One of the longstanding problems in library services is how we might automatically direct users to the most appropriate personnel, databases or facilities to meet their information need. Utilizing the faceted navigation features of various next-gen catalogs, we can efficiently & very accurately assign subject domains to user search queries.
For example: if a user searches “Gallium Arsenide” in the library discovery layer, we can first broadcast this query to a suitably large OPAC and receive the following subject distribution:
-Engineering & Technology: 45%
-Physical Sciences: 21%
…and so on.
By leveraging the cataloging efforts that have classified large collections, we can efficiently classify queries with a high rate of accuracy. By applying this approach to the library discovery layer, we can offer users tailored result sets from subject-specific A & I services. We can also recommend subject specialists & most appropriate campus libraries.
This presentation will discuss the technical challenges of implementing such a system and the trouble with mapping traditional subject classifications to non-book resources (databases, people, buildings, etc.). The dangers of incorrect automatic query classification will be discussed, along with strategies to combat this. A functional system will be demonstrated and code will be made available.
Beyond Sacrilege: A CouchApp Catalog
- Gabriel Farrell, Drexel University, email@example.com
At Code4LibCon2008 Dan Scott gave us a taste of yummy CouchDB, a document-oriented database with a RESTful JSON API (http://couchdb.org/). Since then, CouchDB has passed the 1.0 mark and landed on desktops, the cloud, and mobile devices. With the advent of CouchApps (web apps served directly from CouchDB) applications can be built that are as easy to install as the replicating of databases (which is super easy!). I'll discuss the advantages and challenges in designing a CouchApp to be used as a catalog, repository, or directory of resources. Some things are made fairly simple, such as site templating, the outputting of documents in different formats, and the attachment of binary objects to documents. Some things, like document versioning and the modeling of data, are a little trickier, but still straightforward. And some things, such as granular authentication and the integration of search, are tangled enough to produce some head-on-wall banging. But hey, take it easy. It’s time to relax.
Check out some code at http://github.com/gsf/catlg.
(Yet Another) Home-Grown Digital Library System, Built Upon Open Source XML Technologies and Metadata Standards
- David Lacy, Library Software Development Specialist, Villanova University (david dot lacy at villanova dot edu)
We have recently rearchitected our homegrown digital library utilizing an all-XML framework. The system is comprised of a data repository residing in a native XML database (eXist-DB), a metadata editor constructed using a Java-based XForms processor (Orbeon Forms), and a series of services for image manipulation, OCR processing and OAI-PMH serving. In this talk, I will detail our workflow process from scanning to online publishing, demonstrate the software's flexible configuration and features, and how these steps allow rapid digital preservation and online access. Oh, and it's open source, so I'll show you where to get it as well.
Programming Latent Semantic Analysis for Large Digital Corpora
- Wally Hooper, Indiana University (firstname.lastname@example.org)
- Kirk Hess, Indiana University (email@example.com)
The Chymistry of Isaac Newton Project http://www.chymistry.org is publishing one hundred eighteen alchemical manuscripts written by Isaac Newton, thirty-two of which are now publically available using TEI and Unicode encodings, and served using the eXtensible Text Framework (XTF) engine . The National Science Foundation has funded a three-year project (2009–12, #0620868) to develop computational tools for the analysis of the alchemical language in Newton alchemical corpus. This project is applying computational tools from the fields of computational linguistics, information retrieval, and network sciences to mine and analyze Newton’s manuscripts.
One technique, Latent Semantic Analysis (LSA), has been used by the project to create a set of tools to discover the semantic structure and organization of the corpus of text, and has discovered shared passages, phrases, and technical vocabulary across the corpus. We thought many projects with tei data might want to do LSA, but may not know how. We’ll discuss creating tools for LSA to analyze tei encoded text using xsl, perl, php, a mathematical/statistical software package (e.g. Matlab), and having a supercomputer handy is helpful but not required! We'll walk through our method for chunking text, building a term document matrix, executing singular value decomposition and outputting that data as correlated document pairs and in Graphml format so it can be analyzed in a network analysis and vizualization tool (e.g. Network Workbench).
Adventures In Implementing an Extended FRBR Model
- PaulBen McElwain, Digital Library Program, Indiana University (pbmcelwa at indiana dot edu)
The Variations/FRBR Project (http://vfrbr.info) has developed an implementation of an extended FRBR/FRAD conceptual model.
The model encompasses the entities defined in FRBR along with some further entities from FRAD, the attributes defined for those entities, and the relationships between the entities. One extension to the FRBR model is through the addition of some entity attributes needed for MARC attributes important to collections of recorded music. But the most interesting, and challenging, extension (from a data model perspective) is the addition of a structured set of properties for the attributes of the entities, and properties for the entity relationships. The place of publication/distribution, for instance, can include properties for type, jurisdiction, normalized value, and source vocabulary, all in addition to the string value of the place.
The model was defined in XML Schema and then implemented in a Java class structure with a relational database for persistence. The implemented data service currently supports a user search application, data exports in multiple structured formats (FRBR XML, RDF/XML), and is also designed to support an interactive cataloging interface.
The presentation discusses the model designs developed, the technologies considered, and the implementations produced. This presentation should be of interest to other projects considering complex models of shared hierarchies implemented across XML Schema, Java, and relational data stores (via JPA).
Web Services and Library Systems
- Denis Galvin, Rice University, dgalvin at rice dot edu
- Mang Sun, Rice University, mang dot sun at rice dot edu
Web Services are ideal for ingesting and reformatting information. Whereas traditional Library systems like the ILS are rigid, Web Services are flexible. New interfaces can be created to display both catalog records and patron services. Item information can be integrated into third party systems like discovery layers. Web Services also represent an industry standard approach to systems integration. They offer the opportunity to bring together library operations with business systems.
Currently Rice University is experimenting with Web Services for its mobile OPAC. Two interfaces have been created, one for catalog searching and one for patron services. They are lightweight and ideal for mobile devices. In this session we will talk about how libraries can and might use Web Services now and in the future.
GIS on the Cheap
- Mike Graves, University of North Carolina at Chapel Hill, gravm at email dot unc dot edu
Using a few tools that you probably already have laying around your library I'll show how OpenLayers  can be used to create a dynamic interface for browsing and searching your digital collections geographically. With very little effort Solr can be made to serve up results directly into OpenLayers creating all sorts of mapping possibilities. With a little more effort a Postgres database can handle complex polygon searches . I'll talk about how we're developing a lightweight GIS framework to provide a new user experience for interacting with a number of our collections in unique ways.
Building an Open Source Staff-facing Tablet App for Library Assessment
- Jason Casden, NCSU Libraries, jason_casden at ncsu dot edu
- Joyce Chapman, NCSU Libraries, joyce_chapman at ncsu dot edu
Many libraries currently produce manual headcounts or activity counts for physical spaces in order to better understand patron needs as well as the use of spaces and services, but struggle with the difficulties of collecting data on more than one aspect of use, as well as organizing and analyzing the resulting data. The availability of tablet devices, such as the Apple iPad, has created an opportunity to simplify and encourage the collection of fine-grained data about the use of library physical spaces. The streamlined collection and centralized management of space usage data could also enable more sophisticated and rapid quantitative assessment methods that would significantly reduce the technical barriers for librarians to employ space usage data for assessment.
This talk will present a library assessment and software development perspective on the creation and utility of an open source tablet-based tool for collecting and analyzing data about the use of library physical spaces. Building on recent experience developing web-based and native-iPhone library apps, we will discuss complicating implementation-related issues such as platform dependence, intermittent network coverage (data caching), and centralized data synchronization with multiple collectors. HTML5 and co-evolving technologies (specifically, Web SQL client-side storage) can be utilized to balance the various advantages of web-based apps with the performance of native apps, but implementation choices can directly impact both the types of data that can be collected and the cost of adoption of an open source release. Finally, we will use an early prototype of this tool to demonstrate some new assessment possibilities.
APPLICATIONS AT THE HEART OF A NEW PUBLISHING ECOSYSTEM
- Rafael Sidi, VP Product Management, Elsevier (firstname.lastname@example.org)
During the last decade, computing developments in information discovery have had a significant impact on the research breakthroughs that enhance our society. In the course of thousands of interviews with researchers, developers and industry influencers, we uncovered trends that are shaping lean research globally – workflow efficiencies, funding pressures, government policies and global competition. We also looked at key trends defining the future of web – openness and interoperability, personalization, and collaboration and trusted views, and saw an opportunity to create an ecosystem that empowers the scientific community to innovate, create and discover applications that leverage scientific literature to improved their search and discovery process.
This session explores this new ecosystem that enables developers, researchers and research institutions to develop applications that leverage public domain and licensed content. We will talk about a platform that enables collaboration with the scientific community- researchers and developers- on solutions that target specific researcher interests and workflows. We will explain how publishers can offer their content through APIs and how publishers and platform providers can present developers with application building tools. This ecosystem will create a channel where developers can collaborate with researchers in developing new applications. These same publishers and platform providers have an opportunity to serve as the host of the new scientific knowledge ecosystem that is evolving. This fresh approach in scientific publishing would set a new paradigm in the way research information is discovered, used, shared and re-used to accelerate science.
Enhancing the Mobile Experience: Mobile Library Services at Illinois
- Josh Bishoff, University of Illinois, bishoff2 at illinois dot edu
The University of Illinois Libraries launched a mobile interface in Spring 2010 that includes a custom mobile catalog layer built on top of VuFind (). It allows patrons to request books for delivery, to browse the local and CARLI consortium catalogs, and access account information for renewal & checking hold status. This presentation will focus on new features designed to add value for the mobile user, such as adding Google map links to catalog records, offering current information for campus bus stops, and automatic device detection for users accessing the full-sized library gateway from their mobile device. I’ll discuss how developing for the mobile context, and talking to mobile users, has informed the further development & improvement of library web services overall.
Reuse of Archival Description for Digital Objects
- Jason Ronallo, NCSU Libraries, jason_ronallo at ncsu dot edu
In order to deal with the modern records explosion, archives have devised methods of processing and describing materials at a broad level, rather than at the item-level. This has culminated in what’s becoming a widely adopted approach to archival processing called "More Product, Less Process," where less fine-grained descriptive metadata is created. Except for highly valued materials which may still receive detailed archival description--think Thomas Jefferson's letters--this approach usually does not enable item-level discovery through an archival finding aid. However, it does make collections more readily available and helps repositories move through backlogs of unprocessed collections. Some in the profession have begun to advocate for a similar approach to the digitization of archival and manuscript materials. A growing trend in digitization is the large scale digitization of collections, where the creation of discovery-enabling detailed descriptive metadata for every object is traded for the rapid access to large swaths of collections.
Reuse of archival description for digital objects can help streamline that workflow as well as improve access. What is meant by reusing archival description for digital objects? What does it look like in practice? What new tools can be developed to support this approach to descriptive metadata?
This talk will be an exploration of the interplay of archival description and descriptive metadata for digital objects. The focus will be on the tools and challenges in automating this workflow. Examples will draw from the work at NCSU Libraries with the Special Collections Research Center and include coverage of currently used tools, including locally-developed open-source, as well as future directions for development. Topics covered will include:
- Necessary preconditions and conventions for this to work
- Reuse of archival description from EAD XML for digital objects with simple tools
- Generation of stub descriptive metadata records for digital objects
- The continual refresh of metadata in the access layer throughout its lifecycle
- Later enhancement of (select?) stub records
- Reuse of enhanced digital object description in finding aids
- Future directions?
These emerging practices present challenges for potential change to:
- Archival description and practice
- Encoded Archival Description
- Tools for archival description (e.g. Archon and Archivists’ Toolkit)
- Identifier schemes and resolvers
- Search and discovery interfaces for public access to collections
- Search engine optimization
Chicago Underground Library’s Community-Based Cataloging System
- Margaret Heller, Chicago Underground Library/Dominican University (email@example.com)
- Nell Taylor, Chicago Underground Library (firstname.lastname@example.org)
http://www.underground-library.org (until November 15, you will need to add /catalog to see the actual catalog)
We have developed a unique cataloging and discovery system using Drupal, which we eventually hope to provide as a standalone module that any organization can implement as both a technical and theoretical template to start an Underground Library in its own city. Chicago Underground Library (CUL) is a replicable model for community collections. It uses the lens of an archive to examine the creative, political, and intellectual interdependencies of a city, tracing how people have worked together, who influenced whom, where ideas first developed, and how they spread from one publication to another through individuals.
Cataloging is done by members of the community, and so the system is designed to be intuitive for non-librarians. Our indexing method captures every single contributor (authors, editors, typesetters, illustrators, etc.) and catalogers create exhaustive folksonomy lists of subjects so that users can see how publications are linked by threads of influence. Users are able to search all of the individuals and subjects, click on contributors’ names and find everything else they’ve worked on throughout their careers, look on a map at where each publication came from and see what’s been published in their neighborhood, and also provide their own historical notes and additions to any catalog entry. Many of the publications in our collection have incomplete data sets because the people who made them never expected them to wind up in a library. We will be proactively reaching out to people in the community to share their knowledge of different publications in the catalog. For instance, they will contribute stories about where a magazine might have been distributed, who we’re missing from the masthead, where the publisher’s office might have moved to, which publications hosted readings together, etc. Our catalogers will use these contextual comments to glean more metadata for the catalog entry, but will leave up all the comments and anecdotes as part of the record. In effect, we want to create a social network that builds a library catalog, and vice versa.
At Code4Lib, we will present our current system, discuss the challenges we face, and our future development plans.
Practical Relevancy Testing
- Naomi Dushay, Stanford University Libraries (ndushay at stanford dot edu)
Evaluating search result relevancy is difficult for any sizable amount of data, since human vetted ideal search results are essentially non-existent. This is true even for library collections, despite dedicated librarians and their familiarity with the collections.
So how can we evaluate if search engine configuration changes (e.g. boosting, field analysis, search analysis settings) are an improvement? How can we ensure the results for query A don’t degrade while we try to improve results for query B?
Why yes, Virginia, automatable tests are the answer.
This talk will show you how you can easily write these tests from your hidden goldmine of human vetted relevancy rankings.
LibX 2.0 and the LibX Libapp Builder
- Godmar Back, Virginia Tech, email@example.com
- Brian Nicholson, Virginia Tech, firstname.lastname@example.org
LibX is a platform for delivering library services that require a client-side presence, such as toolbars, context menus, and content scripts. These services integrate library-related resources (links, results, tutorials) into those web pages your users use when they don't go through the library's portals. While LibX 1.5 was mostly used as a toolbar to represent a library's OPAC, LibX 2.0's focus is on simplifying the creation and distribution of content scripts, which we call LibApps. The LibX Edition Builder allows librarians, even those who prefer not to program, to independently create and distribute LibX editions for their user communities.
In a similar vein, the LibX LibApp Builder allows librarians to independently create and manage LibApps for their user communities and share them with others. This talk will discuss the design and implementation of the LibApp builder. We will also show how LibApps can be created to link the user's web experience to modern discovery systems such as Summon in a smart and non-obtrusive way.
Describing Digital Collections at the Free Library
- Daria Norris, Free Library of Philadelphia, norrisla at freelibrary dot org
The Free Library of Philadelphia has developed a Digital Collections content management system and search engine to describe the scholarly and historical items we are digitizing and making available on our web site. This application has evolved into a highly customizable way of setting up the metadata requirements of each individual collection while also conforming to the Dublin Core standard. The collections are diverse and include scans of medieval manuscripts, historical photographs of Philadelphia, Pennsylvania German fraktur, automobile reference photos and more. Development has also included the integration of authorities like the Getty Thesauri and the LOC's Thesaurus for Graphic Materials in a library that can also be used in other applications. I'll also discuss our future plans for the project.
Lessons from the Hydra Community: cultivating a large, distributed, agile, open source developer network
- Matt Zumwalt, MediaShelf & Hydra Project, matt.zumwalt at yourmediashelf dot com
- Bess Sadler, Stanford University, Hydra Project & Project Blacklight, bess at stanford dot edu
When we set out to create the Hydra framework in 2009, we knew that building a strong developer community would be as important as releasing quality code. By August 2010 when we released the Beta version of Hydrangea (the Hydra reference implementation) Ohloh already rated our committers as "one of the largest open-source teams in the world" and placed it "in the top 2% of all project teams on Ohloh." [see ohloh.com] In the 3 months following that release, the number of active committers jumped even higher and the number of subsidiary projects quadrupled. This early success is the product of a concerted, collaborative effort that has incorporated input from many participants and advisors.
Over these first 18 months of work on Hydra, we have cobbled together a formidable list of principles and best practices for developers and for our whole community. Many of these best practices easily translate to any development effort. They are especially applicable to distributed open source teams using agile development methodologies.
Building and sustaining a community is an ongoing learning process. We have already learned a great amount -- most Hydra participants agree that working on this project has made us better at our jobs. We would like to share what we have learned thus far and get feedback about where to go from here.
Opinionated Metadata (OM): Bringing a bit of sanity to the world of XML Metadata
- Matt Zumwalt, MediaShelf & Hydra Project, matt.zumwalt at yourmediashelf dot com
Opinionated Metadata (OM) grew from discussions at Code4Lib 2010. It's now an integral component in the Hydra Framework. Unlike most XML solutions, which start from schemas and build outwards, OM allows you to start from the natural vocabulary that emerges in user stories. Based on the terms that show up in those user stories, you can use OM to create a Terminology that maps each term to nodes in schema-driven XML. This Terminology then serves as a Domain Specific Language (DSL) for your code to rely on. Using that Terminology, you can:
- Generate absolute and relative xpath queries for each term
- Generate complex xpath queries for nested terms (ie. query a mods document for the "first name" of the second "person" entry OR query for all of the "person" entries whose "role" is "creator")
- Validate xml documents against a schema (if one is associated with the Terminology)
- Query an xml document for all values corresponding to a given term
- Update the values in an xml document corresponding to a given term
- Insert new nodes corresponding to a given term into an xml document
- Generate solr field names appropriate for indexing a term
OM borrows some characteristics from the XUpdate Language and is in part inspired by XForms. It is also strongly influenced by the agile, user-driven development methodologies of tools like Ruby on Rails. It puts the strengths of these technologies at your disposal in flexible, maintainable ways.
Internally, OM works as an extension to Nokogiri (a complete Ruby wrapper for the libxml2 and libxslt libraries). It gives you access to the full power of those underlying libraries, including a complete XPath implementation, while transparently handling the idiosyncrasies of those libraries and the XPath language for you.
While OM is just a library, it can be used in a web application to create, retrieve, update and delete XML documents. Within Hydra, we have implemented a full stack that uses OM to read XML documents, populate an HTML form, accept updates via a REST API, and update the XML accordingly.