Parsing Library Data
The legacy data that libraries must deal with is often challenging to parse algorithmically. MARC is just the first layer--once you peel that back, you find that you have an elaborate mish-mash of elements, each of which with its own idiosyncrasies. This page is meant to serve as a place for the Code4lib community to track and share information, problems, methodologies, code, pseudo-code, etc. about nuts-and-bolts parsing of legacy library data.
Identifiers
Library of Congress Control Number
OCLC Control Number
ISBN
MARC 21 Field(s): 020
- Problems with parsing in MARC
ISSN
MARC 21 Field(s): 022
- Problems with parsing in MARC
Dewey Decimal Call Number
MARC 21 Field(s): 082
Library of Congress Call Number
MARC 21 Field(s): 050
Personal Names
MARC 21 Field(s): (Name Headings) 100 600 700 800; Also X00 - Personal Names-General Information (Uncontrolled Names) 245$c 505$r 511 720
- Parsing name parts
- Identifying the name of a person that played a particular role
Corporate Names
MARC 21 Field(s): 110 610 710 810; Also X10 - Corporate Names-General Information
- Distinguishing corporate names from personal names
Titles
MARC 21 Field(s): (Transcribed Titles) 245 505$t (Alternate Titles) 210 222 242 246 247 (Uniform Titles) 130 240 243 630 730 830; Also X30 - Uniform Titles-General Information
- Normalization
- Matching techniques
Subject Headings
URLs
MARC 21 Field(s): 856