Notes from Open Source Discovery Portal Camp
Notes from Open Source Discovery Portal Camp
On 6 November 2008, there was a meeting at the Palinet offices in Philadelphia to discuss the future of open source discovery portals. The VuFind and Blacklight projects had already started cooperating by sharing indexing code in the form of SolrMarc, and as a community we wanted to explore whether there were other ways we could be cooperating, and what our development priorities should be. We discussed the following topics and some people identified themselves as particularly interested in following up on specific topics and doing further work in a given area. Bess took notes, which are pasted here, but please feel free to expand upon these with your own memories of the conversation.
Andrew started by giving us a brief introduction to Jangle, a standard approach to building a toolset for interacting with the ILS. It was kickstarted by the DLF ILS API set. The idea is to create a standard way of interacting with the ILS. Jangle is the first implementation of this, and is planned as the "reference implementation." It's an open source standard approach, and will give us a lot of flexibility.
Gabe points out that the DLF standard and the Jangle standard aren't the same thing exactly, but people seem to agree it's still a good start at standardization. Andrew asks, how do we contribute the VuFind drivers to Jangle? Is there an NCIP driver for Jangle? Xtensible Catalog is using NCIP, for example. One problem with this, though, is that many vendors don't implement NCIP.
Ross Singer is the main developer for Jangle, and he wrote an article about it for the latest code4lib journal. Everyone's homework is to go read that article, available here. Many institutions are hacking their own ILS, it would be more efficient if we all share this code through something like Jangle, which could then be used by VuFind, Blacklight, Helios, or any other project that could talk to Jangle.
Eric Morgan: Jangle is a step in the right direction. DLF came up with a list of API features they want, and then Ross came along and said here's a simple RESTful implementation of a lot of that API, based on ATOM publishing protocol. We need a number of agreed upon shapes of URLs that do things like tell me the status of this book, authority information for a person. To what degree do we want to use something like Jangle in vufind? There aren't a lot of choices right now, and this seems like a good project to explore further.
What about XC? There's a lot of frustration around this project, because they say they are open source but haven't actually made any of their source available. How do we get them to participate with the larger community? There's a growing community of developers around these issues, and XC should be involved. Eric says someone should have explicitly invited them.
(interested in further development: Bess, Andrew, Gabe)
Non-catalog content / digital repositories
Could we adapt SolrMARC to also include SolrOAI? Yes, Bob, Naomi and Andrew all have ideas about how this could work. Sounds like this is the kernel of our kernel. Solr already has a lot of functionality to allow for this. Do we want a couple of plugins, one for solr and one for OAI? Or do we want an app that handles both?
Lots of little data silos aren't going to work, we need everything in a local catalog. But that doesn't mean we should all try to be google. We still need well-defined collection development policies.
What about social data? SoPAC is neat, and has an independent layer for saving social data.
We also talked about Blacklight and the ways it brings in various data sources and handles behavior for different kinds of objects, e.g., MusicBrainz data for music items.
(interested in further development: Bob, Dennis, Peter, Naomi, Bess)
Q: How well is solr marc handling bad data these days?
Bob: I've been adding to marc4j more permissive reading and error correction. It's also reporting errors as it finds them, to make it easier to find bad records. Request for writing to log files instead of standard out. How to handle records with bad leaders? Naomi has some marc test data. We need more test driven development.
Naomi is offering code for parsing OCLC numbers and LC numbers, she'll be working with Bob next week to get that into solrmarc.
Chris from Villanova is going to do some graphic design work for solr marc. Yay!
(Interested in further development: Bob, Naomi, Chris, Bess)
Can we get the LC authority control data, index it locally, and take advantage of that in our searching. Actually getting the authority index data is the problem. It's government monitored data, so why can't we get access to it? We can get snapshots, but there's no method for harvesting it. We need some way to get weekly / monthly updates of authority data. EdSu might have set something up, but it isn't an official service.
Eric says go ahead and implement something, and don't worry about the update method right now. Can we get authority data? Does Open Library have any authority data? Bess will look into this.
"Fred Data" <-- subject authorities
Consensus seems to be that we need a proof of concept first, see how well that scales, and then after that start lobbying LC / OCLC / Palinet / other vendors.
(Interested in further development: Ya'aqov, Daniel, Mark, Bess)
Dedupping / FRBR
We need to look at Trish Williams' work with hierarchical solr records for implementing FRBR. If we start working with this maybe we can advocate for getting this into the solr trunk.
Do we really need FRBR or will "other editions" do the trick? xISBN and xISSN does a pretty good job from OCLC, and can be implemented in just a few lines of code. No one understands FRBR anyway.
One user group that could really use FRBR is musicians, to tie together recordings and scores. The Variations project at Indiana is working with this. OCLC also made public their algorithm for FRBRizing work. Some libraries have a problem stemming from having digitized versions of an item, and then the digital version has a catalog entry, but then in your VuFind system you index both the digital surrogate and the catalog record about the digital surrogate. You probably want to combine these or suppress
Open Library is also very interested in de-duping research.
Marc format for holdings data (MFHD) xISSN service might be helpful for this, too Bibliographic records for serials should refer to each other.
How to represent this data for users? There's a summary holdings field, a one-line display, and then there's a detailed holding display. There can be multiple screens of lines with this. Summary holdings are pretty easy, detail holdings are hard. Are they necessary?
Maybe we can handle this the way we're doing "composition era" in blacklight? If we know the range, we can assign values for all possible values of this range.
You can get an extract of all serial holdings from your Open URL database (SFX), harvest your journal holdings through that. Texas A&M is doing this w/ SFX. This seems like an efficient way of getting detailed holdings. Indexing this might be helpful if you don't have marc records for all of your electronic holdings, and it also might help for knowing when you have full text online and when you don't.
(interested in further development: Ya'aqov, Mark)
Federated Search / article content
Can we partner with LibraryFind? Or should we implement an engine like pazpar2? IndexData has something called pazpar2, which is a federated search engine.
(Interested in further development: one guy, whose name I didn't catch. Please self identify!)
back-end arch / OSS methods
We never really discussed this, but Peter and Peter said they were interested in following up on this topic.
kernelizing the projects
- solrmarc - solroai - authority data & merging
How do we organize - How do we reach out to libraries and formalize a committment?
John, Dennis, Joe, Andrew, Mark, Daniel