MARC Problems

Revision as of 19:22, 28 September 2009 by Jrochkind (Talk | contribs) (Things Difficult or Impossible to Do With Our Data)

Revision as of 19:22, 28 September 2009 by Jrochkind (Talk | contribs) (Things Difficult or Impossible to Do With Our Data)

Things Difficult or Impossible to Do With Our Data

Whenever anyone says "I don't see why there's a problem parsing our AACR2/MARC data, all the data is there", I want a list of things I've tried to do/get with our data and not been able to.

So now I'm going to make a list as I go. Some of these can be blamed on MARC, some on AACR2, some on ISBD, some on ILS software used for maintaining MARC, some on cataloger tradition, or cataloger mistake.But they're all things that just about anyone working with a large quantity of real world MARC data are going to have trouble doing.

Figure out what an 856 is

Is it a link to full text? A link to table of contents? Something else? I want my software to know, so my software can easily tell the user that full text is available and give them the link. You can sort of kind of estimate it. http://roytennant.com/proto/856/analysis.html

Format field 505 contents

505 notes are really hard to read all mashed together. I'd like to list them one entry per line. But it's very difficult to tell where one entry begins and another ends. Sure, you can split on "--" (and you need to split on '--' even in so-called 'formatted' contents notes), but that's not foolproof. Sometimes a '.' or a ';' split an entry -- but sometimes they don't, they are internal to an entry. No good algorithm.

Figure out if my library holds a particular volume and issue of a serial

This is a clear user need, that I'd like to be able to tell them, when I know they are interested in a particular volume/issue. I guess MFHD is _theoretically_ capable of expressing this. But hardly anyone's ILS is actually going to produce anything that can be machine-interpreted. And I'm not even sure MFHD can express it -- if you think that ISBD-like standard for using punctuation and such to express 'runs' counts, forget about it, that doesn't really result in unambiguous machine-parseable statements even when users don't make mistakes entering it, which they do.

Series title: what transcribed series name goes with which controlled series name?

Without being able to figure out what transcribed name (490) corresponds with which controlled series name (8xx), it's pretty impossible to figure out how to create a display which makes any sense, and doesn't list the same series twice (once with transcribed name and once with controlled name), and supports collocation properly. See more at: http://bibwild.wordpress.com/2009/09/24/a-reasonable-display-for-series-data-in-marc

Contents : transcribed vs. controlled

In a similar problem as with series titles, contents ('analytics') may exist in a transcribed form in a 505, and in a controlled form in a 7xx. But there's no way to know which 505 entry correspond to which 7xx entry. (In fact, it's hard to even know where one 'entry' begins and ends in a 505! See above).

It would be nice to only list the contents once, not twice. With transcribed and (linked to search) controlled forms. But this can not be done.

False economies

Obviously, a single cataloger can't be expected to spend hours recording every possible bit of metadata that could apply to a work, nor can most libraries provide the level of funding to afford that. However, it ought to be possible to distribute the work and incrementally improve our metadata; although bibliographic utilities could be a nexus for that sort work, valuable information often gets left out of many records.

Tables of contents and composite works

Much science fiction, to give an example, exists as short stories that end up in libraries as anthologies. Without even a 505, it's impossible for a patron to see if the library has 'The Cold Equations' by Tom Godwin.