Difference between revisions of "Notes from Open Source Discovery Portal Camp"

From Code4Lib
Jump to: navigation, search
(Jangle)
(solr marc)
 
(30 intermediate revisions by 11 users not shown)
Line 1: Line 1:
 
== Notes from Open Source Discovery Portal Camp ==
 
== Notes from Open Source Discovery Portal Camp ==
  
On 6 November 2008, there was a meeting at the Palinet offices in Philadelphia to discuss the future of open source discovery portals. The [[VuFind]] and [[Blacklight]] projects had already started cooperating by sharing indexing code in the form of [[SolrMarc]], and as a community we wanted to explore whether there were other ways we could be cooperating, and what our development priorities should be. We discussed the following topics and some people identified themselves as particularly interested in following up on specific topics and doing further work in a given area. Bess took notes, which are pasted here, but please feel free to expand upon these with your own memories of the conversation.  
+
On 6 November 2008, there was a meeting at the Palinet offices in Philadelphia to discuss the future of open source discovery portals. The [[VuFind]] and [[Blacklight]] projects had already started cooperating by sharing indexing code in the form of [[Solrmarc]], and as a community we wanted to explore whether there were other ways we could be cooperating, and what our development priorities should be. We discussed the following topics and some people identified themselves as particularly interested in following up on specific topics and doing further work in a given area. Bess took notes, which are pasted here, but please feel free to expand upon these with your own memories of the conversation.  
  
 
=== Jangle ===
 
=== Jangle ===
  
Andrew started by giving us a brief introduction to Jangle, a standard approach to building a toolset for interacting with the ILS. It was kickstarted by the DLF ILS API set. The idea is to create a standard way of interacting with the ILS. Jangle is the first implementation of this, and is planned as the "reference implementation." It's an open source standard approach, and will give us a lot of flexibility.  
+
Andrew started by giving us a brief introduction to [[Jangle]], a standard approach to building a toolset for interacting with the ILS. It was kickstarted by the DLF ILS API set. The idea is to create a standard way of interacting with the ILS. Jangle is the first implementation of this, and is planned as the "reference implementation." It's an open source standard approach, and will give us a lot of flexibility.  
  
Gabe points out that the DLF standard and the Jangle standard aren't the same thing exactly, but people seem to agree it's still a good start at standardization. Andrew asks, how do we contribute the VuFind drivers to Jangle? Is there an NCIP driver for Jangle? Xtensible Catalog is using NCIP, for example. One problem with this, though, is that many vendors don't implement NCIP.  
+
Gabe points out that the DLF standard and the Jangle standard aren't the same thing exactly, but people seem to agree it's still a good start at standardization. Andrew asks, how do we contribute the VuFind drivers to Jangle? Is there an [[NCIP]] driver for Jangle? [http://www.extensiblecatalog.org/ Xtensible Catalog] is using NCIP, for example. One problem with this, though, is that many vendors don't implement NCIP.  
  
Ross Singer is the main developer for Jangle, and he wrote an article about it for the latest code4lib journal. Everyone's homework is to go read that article, available here: [http://journal.code4lib.org/articles/109]. Many institutions are hacking their own ILS, it would be more efficient if we all share this code through something like Jangle, which could then be used by VuFind, Blacklight, Helios, or any other project that could talk to Jangle.  
+
Ross Singer is the main developer for Jangle, and he wrote an article about it for the latest code4lib journal. Everyone's homework is to go read that article, available [http://journal.code4lib.org/articles/109 here]. Many institutions are hacking their own ILS, it would be more efficient if we all share this code through something like Jangle, which could then be used by VuFind, Blacklight, [http://code.google.com/p/fac-back-opac/ Helios], or any other project that could talk to Jangle.  
  
Eric Morgan: Jangle is a step in the right direction. DLF came up with a list of API features they want, and then Ross came along and said here's a simple RESTful implementation of a lot of that API, based on ATOM publishing protocol. We need a number of agreed upon shapes of URLs that do things like tell me the status of this book, authority information for a person. To what degree do we want to use something like Jangle in vufind? There aren't a lot of choices right now, and this seems like a good project to explore further.  
+
Eric Morgan: Jangle is a step in the right direction. DLF came up with a list of API features they want, and then Ross came along and said here's a simple RESTful implementation of a lot of that API, based on [[ATOM]] publishing protocol. We need a number of agreed upon shapes of URLs that do things like tell me the status of this book, authority information for a person. To what degree do we want to use something like Jangle in vufind? There aren't a lot of choices right now, and this seems like a good project to explore further.  
  
What about XC? They aren't actually open source, no one has seen their code. They've approached both VuFind and Blacklight about incorporating the code from those projects, but aren't making any of their own code available. There's a lot of frustration around this project. How do we get them to participate with the larger community? There's a growing community of developers around these issues, and XC should be involved. Eric says someone should have explicitly invited them.  
+
What about [[XC]]? There's a lot of frustration around this project, because they say they are open source but haven't actually made any of their source available. How do we get them to participate with the larger community? There's a growing community of developers around these issues, and XC should be involved. Eric says someone should have explicitly invited them.  
  
(Bess, Andrew, Gabe)
+
(interested in further development: Bess, Andrew, Gabe)
  
 
=== Non-catalog content / digital repositories ===  
 
=== Non-catalog content / digital repositories ===  
  
- Bob, Dennis, Peter, Naomi
+
Could we adapt [[SolrMARC]] to also include [[SolrOAI]]? Yes, Bob, Naomi and Andrew all have ideas about how this could work. Sounds like this is the kernel of our kernel. [[Solr]] already has a lot of functionality to allow for this. Do we want a couple of plugins, one for solr and one for [[OAI]]? Or do we want an app that handles both?  
 
+
Could we adapt SolrMARC to also include SolrOAI? Yes, Bob, Naomi and Andrew all have ideas about how this could work. Sounds like this is the kernel of our kernel. Solr already has a lot of functionality to allow for this. Do we want a couple of plugins, one for solr and one for OAI? Or do we want an app that handles both?  
+
  
 
Lots of little data silos aren't going to work, we need everything in a local catalog. But that doesn't mean we should all try to be google. We still need well-defined collection development policies.  
 
Lots of little data silos aren't going to work, we need everything in a local catalog. But that doesn't mean we should all try to be google. We still need well-defined collection development policies.  
  
What about social data? SoPAC is neat, and has an independent layer for saving social data.  
+
What about social data? [[SoPAC]] is neat, and has an independent layer for saving social data.  
  
We also talked about Blacklight and the ways it brings in various data sources and handles behavior for different kinds of objects.  
+
We also talked about [[Blacklight]] and the ways it brings in various data sources and handles behavior for different kinds of objects, e.g., [http://musicbrainz.org/ MusicBrainz] data for music items.  
  
 +
(interested in further development: Bob, Dennis, Peter, Naomi, Bess)
  
 
=== solr marc ===  
 
=== solr marc ===  
  
- Bob, Naomi, Chris
+
Q: How well is solr marc handling bad data these days?
  
How well is solr marc handling bad data these days?
+
Bob: I've been adding to [[marc4j]] more permissive reading and error correction. It's also reporting errors as it finds them, to make it easier to find bad records. Request for writing to log files instead of standard out. How to handle records with bad leaders? Naomi has some marc test data. We need more test driven development.  
 
+
Bob: I've been adding to marc4j more permissive reading and error correction. It's also reporting errors as it finds them, to make it easier to find bad records. Request for writing to log files instead of standard out. How to handle records with bad leaders? Naomi has some marc test data. We need more test driven development.  
+
  
 
Naomi is offering code for parsing OCLC numbers and LC numbers, she'll be working with Bob next week to get that into solrmarc.  
 
Naomi is offering code for parsing OCLC numbers and LC numbers, she'll be working with Bob next week to get that into solrmarc.  
Line 42: Line 39:
 
Chris from Villanova is going to do some graphic design work for solr marc. Yay!  
 
Chris from Villanova is going to do some graphic design work for solr marc. Yay!  
  
=== Authority control === - YZ, Daniel, Mark, Bess
+
(Interested in further development: Bob, Naomi, Chris, Bess)
  
Can we get the LC authority control data, index it locally, and take advantage of that in our searching. Actually getting the authority index data is the problem. It's government created data, so why can't we get access to it? We can get snapshots, but there's no method for harvesting it. We need some way to get weekly / monthly updates of authority data. EdSu might have set something up, but it isn't an official service.  
+
=== Authority control ===
 +
 
 +
Can we get the LC authority control data, index it locally, and take advantage of that in our searching. Actually getting the authority index data is the problem. It's government monitored data, so why can't we get access to it? We can get snapshots, but there's no method for harvesting it. We need some way to get weekly / monthly updates of authority data. EdSu might have set something up, but it isn't an official service.  
  
 
Eric says go ahead and implement something, and don't worry about the update method right now. Can we get authority data? Does Open Library have any authority data? Bess will look into this.  
 
Eric says go ahead and implement something, and don't worry about the update method right now. Can we get authority data? Does Open Library have any authority data? Bess will look into this.  
Line 50: Line 49:
 
"Fred Data" <-- subject authorities
 
"Fred Data" <-- subject authorities
  
Consensus seems to be that we need a proof of concept first, see how well that scales, and then after that start lobbying LC / OCLC / Palinet / other vendors.  
+
Consensus seems to be that we need a proof of concept first, see how well that scales, and then after that start lobbying LC / OCLC / Palinet / other vendors.
  
 +
(Interested in further development: Ya'aqov, Daniel, Mark, Bess)
  
 
=== Dedupping / FRBR ===
 
=== Dedupping / FRBR ===
Line 63: Line 63:
 
Open Library is also very interested in de-duping research.  
 
Open Library is also very interested in de-duping research.  
  
=== Serials holdings === - YZ, Mark
+
=== Serials holdings ===  
  
Marc format for holdings data (muff head?)
+
Marc format for holdings data (MFHD)
 
xISSN service might be helpful for this, too
 
xISSN service might be helpful for this, too
 
Bibliographic records for serials should refer to each other.  
 
Bibliographic records for serials should refer to each other.  
  
How to represent this data for users? There's a summary holdings record, a one-line display, and then there's a detailed holding display. There can be multiple screens of lines with this. Summary holdings are pretty easy, detail holdings are hard.  
+
How to represent this data for users? There's a summary holdings field, a one-line display, and then there's a detailed holding display. There can be multiple screens of lines with this. Summary holdings are pretty easy, detail holdings are hard. Are they necessary?
  
 
Maybe we can handle this the way we're doing "composition era" in blacklight? If we know the range, we can assign values for all possible values of this range.  
 
Maybe we can handle this the way we're doing "composition era" in blacklight? If we know the range, we can assign values for all possible values of this range.  
  
You can get an extract from your Open URL database, harvest your journal holdings through that. Texas A&M is doing this w/ Ex Libris. This seems like an efficient way of getting detailed holdings. Indexing this might be helpful if you don't have marc records for all of your electronic holdings, and it also might help for knowing when you have full text online and when you don't.  
+
You can get an extract of all serial holdings from your Open URL database (SFX), harvest your journal holdings through that. Texas A&M is doing this w/ SFX. This seems like an efficient way of getting detailed holdings. Indexing this might be helpful if you don't have marc records for all of your electronic holdings, and it also might help for knowing when you have full text online and when you don't.  
  
=== Federated Search / article content === - no one
+
(interested in further development: Ya'aqov, Mark)
 +
 
 +
=== Federated Search / article content ===  
  
 
Can we partner with LibraryFind? Or should we implement an engine like pazpar2?  
 
Can we partner with LibraryFind? Or should we implement an engine like pazpar2?  
IndexData has something called pazpar2, which is a federated search engine.  
+
IndexData has something called pazpar2, which is a federated search engine.
 +
 
 +
(Interested in further development: one guy, whose name I didn't catch. Please self identify!)
  
 
=== back-end arch / OSS methods ===  
 
=== back-end arch / OSS methods ===  
Line 92: Line 96:
  
 
=== How do we organize - How do we reach out to libraries and formalize a committment? ===
 
=== How do we organize - How do we reach out to libraries and formalize a committment? ===
Palinet, Dennis, Joe, Andrew, Mark, Daniel
+
John, Dennis, Joe, Andrew, Mark, Daniel
 +
 
 +
[[Category: Meeting agendas]]

Latest revision as of 01:13, 26 February 2009

Notes from Open Source Discovery Portal Camp

On 6 November 2008, there was a meeting at the Palinet offices in Philadelphia to discuss the future of open source discovery portals. The VuFind and Blacklight projects had already started cooperating by sharing indexing code in the form of Solrmarc, and as a community we wanted to explore whether there were other ways we could be cooperating, and what our development priorities should be. We discussed the following topics and some people identified themselves as particularly interested in following up on specific topics and doing further work in a given area. Bess took notes, which are pasted here, but please feel free to expand upon these with your own memories of the conversation.

Jangle

Andrew started by giving us a brief introduction to Jangle, a standard approach to building a toolset for interacting with the ILS. It was kickstarted by the DLF ILS API set. The idea is to create a standard way of interacting with the ILS. Jangle is the first implementation of this, and is planned as the "reference implementation." It's an open source standard approach, and will give us a lot of flexibility.

Gabe points out that the DLF standard and the Jangle standard aren't the same thing exactly, but people seem to agree it's still a good start at standardization. Andrew asks, how do we contribute the VuFind drivers to Jangle? Is there an NCIP driver for Jangle? Xtensible Catalog is using NCIP, for example. One problem with this, though, is that many vendors don't implement NCIP.

Ross Singer is the main developer for Jangle, and he wrote an article about it for the latest code4lib journal. Everyone's homework is to go read that article, available here. Many institutions are hacking their own ILS, it would be more efficient if we all share this code through something like Jangle, which could then be used by VuFind, Blacklight, Helios, or any other project that could talk to Jangle.

Eric Morgan: Jangle is a step in the right direction. DLF came up with a list of API features they want, and then Ross came along and said here's a simple RESTful implementation of a lot of that API, based on ATOM publishing protocol. We need a number of agreed upon shapes of URLs that do things like tell me the status of this book, authority information for a person. To what degree do we want to use something like Jangle in vufind? There aren't a lot of choices right now, and this seems like a good project to explore further.

What about XC? There's a lot of frustration around this project, because they say they are open source but haven't actually made any of their source available. How do we get them to participate with the larger community? There's a growing community of developers around these issues, and XC should be involved. Eric says someone should have explicitly invited them.

(interested in further development: Bess, Andrew, Gabe)

Non-catalog content / digital repositories

Could we adapt SolrMARC to also include SolrOAI? Yes, Bob, Naomi and Andrew all have ideas about how this could work. Sounds like this is the kernel of our kernel. Solr already has a lot of functionality to allow for this. Do we want a couple of plugins, one for solr and one for OAI? Or do we want an app that handles both?

Lots of little data silos aren't going to work, we need everything in a local catalog. But that doesn't mean we should all try to be google. We still need well-defined collection development policies.

What about social data? SoPAC is neat, and has an independent layer for saving social data.

We also talked about Blacklight and the ways it brings in various data sources and handles behavior for different kinds of objects, e.g., MusicBrainz data for music items.

(interested in further development: Bob, Dennis, Peter, Naomi, Bess)

solr marc

Q: How well is solr marc handling bad data these days?

Bob: I've been adding to marc4j more permissive reading and error correction. It's also reporting errors as it finds them, to make it easier to find bad records. Request for writing to log files instead of standard out. How to handle records with bad leaders? Naomi has some marc test data. We need more test driven development.

Naomi is offering code for parsing OCLC numbers and LC numbers, she'll be working with Bob next week to get that into solrmarc.

Chris from Villanova is going to do some graphic design work for solr marc. Yay!

(Interested in further development: Bob, Naomi, Chris, Bess)

Authority control

Can we get the LC authority control data, index it locally, and take advantage of that in our searching. Actually getting the authority index data is the problem. It's government monitored data, so why can't we get access to it? We can get snapshots, but there's no method for harvesting it. We need some way to get weekly / monthly updates of authority data. EdSu might have set something up, but it isn't an official service.

Eric says go ahead and implement something, and don't worry about the update method right now. Can we get authority data? Does Open Library have any authority data? Bess will look into this.

"Fred Data" <-- subject authorities

Consensus seems to be that we need a proof of concept first, see how well that scales, and then after that start lobbying LC / OCLC / Palinet / other vendors.

(Interested in further development: Ya'aqov, Daniel, Mark, Bess)

Dedupping / FRBR

We need to look at Trish Williams' work with hierarchical solr records for implementing FRBR. If we start working with this maybe we can advocate for getting this into the solr trunk.

Do we really need FRBR or will "other editions" do the trick? xISBN and xISSN does a pretty good job from OCLC, and can be implemented in just a few lines of code. No one understands FRBR anyway.

One user group that could really use FRBR is musicians, to tie together recordings and scores. The Variations project at Indiana is working with this. OCLC also made public their algorithm for FRBRizing work. Some libraries have a problem stemming from having digitized versions of an item, and then the digital version has a catalog entry, but then in your VuFind system you index both the digital surrogate and the catalog record about the digital surrogate. You probably want to combine these or suppress

Open Library is also very interested in de-duping research.

Serials holdings

Marc format for holdings data (MFHD) xISSN service might be helpful for this, too Bibliographic records for serials should refer to each other.

How to represent this data for users? There's a summary holdings field, a one-line display, and then there's a detailed holding display. There can be multiple screens of lines with this. Summary holdings are pretty easy, detail holdings are hard. Are they necessary?

Maybe we can handle this the way we're doing "composition era" in blacklight? If we know the range, we can assign values for all possible values of this range.

You can get an extract of all serial holdings from your Open URL database (SFX), harvest your journal holdings through that. Texas A&M is doing this w/ SFX. This seems like an efficient way of getting detailed holdings. Indexing this might be helpful if you don't have marc records for all of your electronic holdings, and it also might help for knowing when you have full text online and when you don't.

(interested in further development: Ya'aqov, Mark)

Federated Search / article content

Can we partner with LibraryFind? Or should we implement an engine like pazpar2? IndexData has something called pazpar2, which is a federated search engine.

(Interested in further development: one guy, whose name I didn't catch. Please self identify!)

back-end arch / OSS methods

We never really discussed this, but Peter and Peter said they were interested in following up on this topic.

kernelizing the projects

- solrmarc - solroai - authority data & merging


How do we organize - How do we reach out to libraries and formalize a committment?

John, Dennis, Joe, Andrew, Mark, Daniel