MARC Problems

From Code4Lib
Jump to: navigation, search

Things Difficult or Impossible to Do With Our Data

Whenever anyone says "I don't see why there's a problem parsing our AACR2/MARC data, all the data is there", I want a list of things I've tried to do/get with our data and not been able to.

So now I'm going to make a list as I go. Some of these can be blamed on MARC, some on AACR2, some on ISBD, some on ILS software used for maintaining MARC, some on cataloger tradition, or cataloger mistake.But they're all things that just about anyone working with a large quantity of real world MARC data are going to have trouble doing.

Figure out what an 856 is

Is it a link to full text? A link to table of contents? Something else? I want my software to know, so my software can easily tell the user that full text is available and give them the link. You can sort of kind of estimate it. http://roytennant.com/proto/856/analysis.html

Format field 505 contents

505 notes are really hard to read all mashed together. I'd like to list them one entry per line. But it's very difficult to tell where one entry begins and another ends. Sure, you can split on "--" (and you need to split on '--' even in so-called 'formatted' contents notes), but that's not foolproof. Sometimes a '.' or a ';' split an entry -- but sometimes they don't, they are internal to an entry. No good algorithm.

Figure out if my library holds a particular volume and issue of a serial

This is a clear user need, that I'd like to be able to tell them, when I know they are interested in a particular volume/issue. I guess MFHD is _theoretically_ capable of expressing this. But hardly anyone's ILS is actually going to produce anything that can be machine-interpreted. And I'm not even sure MFHD can express it -- if you think that ISBD-like standard for using punctuation and such to express 'runs' counts, forget about it, that doesn't really result in unambiguous machine-parseable statements even when users don't make mistakes entering it, which they do.

Series title: what transcribed series name goes with which controlled series name?

Without being able to figure out what transcribed name (490) corresponds with which controlled series name (8xx), it's pretty impossible to figure out how to create a display which makes any sense, and doesn't list the same series twice (once with transcribed name and once with controlled name), and supports collocation properly. See more at: http://bibwild.wordpress.com/2009/09/24/a-reasonable-display-for-series-data-in-marc

Contents : transcribed vs. controlled; analytic in the first place?

In a similar problem as with series titles, contents ('analytics') may exist in a transcribed form in a 505, and in a controlled form in a 7xx. But there's no way to know which 505 entry correspond to which 7xx entry. (In fact, it's hard to even know where one 'entry' begins and ends in a 505! See above).

It would be nice to only list the contents once, not twice. With transcribed and (linked to search) controlled forms. But this can not be done.

A related problem is that there's no good way to be sure if a 7xx IS an analytic in the first place. If second indicator is 2, it is. If second indicator is blank, there's no way to tell. And oddly it's not even possible for the second indicator indicate "definitely is not an analytic".

Display of 7xx in general

In addition to the above problem with 7xx analytics in particular, 7xx's provide an even more general problem. They might be a uniform title for work cataloged (I can't figure out in what circumstances they'd be this, normally the uniform title is in 130 or 240, but apparently sometimes it's in a 730. Serials?). They might be a 'citation' (controlled heading) to a related work. They might be an analytic. They might be something else entirely. Is there any way to know what it is? Got me. Without knowing what it is, it's hard to display it appropriately for the user.

How can I know if a 700 is an analytic, or just a related work? Can I even call a person listed in a 700 a "contributor" if it's got a $t? It might not be a contributor at all, but just the author of a related work.

How do I know if a 730 is a uniform title for this work, or a related work?

Figuring out 'form' of work in general

At first you see that maybe you can do this with an 006. Look, it's got a first byte that tells you if it's a 'book', 'musical recording', 'non-musical recording', etc. But then you realize that most records don't have 006's. What are the rules for what record will have an 006 and what won't? I have no idea where to find them.

Where in the MARC record does it tell you this sort of thing? Seems to be spread out across a buncha different fields, with what fields will be present in a given record kind of hard to predict.

Telling the user if something is a 'book' or a 'musical recording' is a pretty useful thing to do, no? I believe it's possible from MARC, but it's sure not simple, and I'm not sure becuase I haven't figured it out yet.

Okay, wait, I didn't want the 006 at all, I really wanted the 008. Okay... The 008 tells me the bytes mean something different depending on whether the record is a "Book", "Music", "Computer File", etc. Where do I tell which it is, to see what the bytes mean? Still working on it.

MARC and Cooperative Cataloging: False Economies

Obviously, a single cataloger can't be expected to spend hours recording every possible bit of metadata that could apply to a work, nor can most libraries provide the level of funding to afford that. However, it ought to be possible to distribute the work and incrementally improve our metadata; although bibliographic utilities could be a nexus for that sort work, valuable information often gets left out of many records.

Tables of contents and composite works

Much science fiction, to give an example, exists as short stories that end up in libraries in anthologies. Without even a 505, it's impossible for a patron to see if the library has 'The Cold Equations' by Tom Godwin.