Getting Started with Zebra

From Code4Lib
Revision as of 15:47, 25 February 2009 by Edward Vielmetti (Talk | contribs)

Jump to: navigation, search

I will try to outline here how to index (and search) MARC records using Zebra, but tweaking the indexing process is a bit trickier than I know how to do.

1. Install yaz, zebra, and all of their friends. I have found that the "standard" make process works pretty well, but allow yaz and zebra to specify where it puts various configuration files. The extra specification is not worth the effort.

2. Save your MARC records someplace on your file system. By "binary" MARC records, I suppose you mean "real" MARC records -- MARC records in communications format -- MARC records as the types of records fed to traditional integrated library systems. This is opposed to some flavor of XML or "tagged format" often used for display.

3. Create a zebra.cfg file, and have it look something like this:

 # global paths
 profilePath: .:./etc:/usr/local/share/idzebra-2.0/tab
 modulePath: /usr/local/lib/idzebra-2.0/modules
 
 # turn ranking on
 rank: rank-1
 
 # define a database of marc records called opac
 opac.database: opac
 opac.recordtype: grs.marcxml.marc21
 attset: bib1.att
 attset: explain.att

4. Index your MARC records with the following command. You should see lot's of great stuff sent to STDOUT.

 zebraidx -g opac update <path to MARC records>

You have now created your index. Once you get this far with indexing, you will want to tweak various .abs files (I think) to enhance the indexing process. This particular thing is not my forte. It seems like black magic to most of us. This is not a Zebra-specific problem; this is a problem with Z39.50.

Next, you need to implement the client/server end of things:

5. Start your server. This will be a Z39.50 server -- a "kewl" library-centric protocol that existed before the Internet got hot:

 zebrasrv localhost:9999 &

6. Use yaz-client to search your index:

 $ yaz-client
 Z> open localhost:9999/opac
 Z> find origami
 Z> show 1
 Z> quit

Using the yaz-client almost requires a knowledge of Z39.50. Attached should be a Perl script that allows you to search your server in a bit more user-friendly way. To use it you will need to install a few Perl modules and then edit the constant called DATABASE.

Even though Z39.50 is/was "kewl" it is still pretty icky. SRU is better -- definitely a step in the right direction, and Zebra supports SRU out of the box. [1]

7. Create an an SRU configuration file looking something like this:

 <yazgfs>
   <server>
     <config>zebra.cfg</config>
     <cql2rpn>pqf.properties</cql2rpn>
   </server>
 </yazgfs>

8. Acquire a "better" pqf.properties file. PQF is about querying Z39.50 databases. It is ugly. It was designed in a non-Internet world. Instead of knowing that 1=4 means search the title field, you want to simply search the title. Attached is a "better" pqf.properties file, and it is "better" because it maps things like 1=4 to Dublin Core equivalents. Save it in a directory called etc in the same directory as your zebra.cfg file. (Notice how the zebra.cfg file, above, denotes etc as being in zebra's path.)

9. Kill your presently running Z39.50 server.

10. Start up a SRU server:

 zebrasrv -f sru.cfg localhost:9999 &

11. Use your HTTP client to search the SRU server. Queries will look like this:

 http://localhost:9999/opac?operation=searchRetrieve&version=1.1&query=origami&maximumRecords=5

The result should be a stream of XML ready for XSLT processing.

All of the above is almost exactly what I did to create an index of MARC records harvested from the Library of Congress and the University of Michigan's OAI data repository (MBooks). [2] Take a look at the HTML source. Notice how the client in this regard is only one HTML file containing a form, one CSS file for style, and one XSL file for XML to HTML transformation.

[1] SRU - http://www.loc.gov/standards/sru/

[2] Example SRU interface - http://infomotions.com/ii/

Appendix A: opac.pl

 #!/usr/bin/perl
 
 # opac.pl - a simple z39.50 client
 
 # Eric Lease Morgan <emorgan@nd.edu>
 # 2007-06-05 - based on previous work with ZOOM Perl
 
 # require
 use MARC::Record;
 use strict;
 use ZOOM;
 
 # define
 use constant DATABASE => 'wilson.infomotions.com:9999/ii';   # test server
 
 # get the query
 my $query = shift;
 
 # sanity check
 if ( ! $query ) {
 
 print "Usage: $0 query\n";
 exit;
 
 }
 
 # create an connection and search
 my $connection = new ZOOM::Connection( DATABASE, 0, count => 1, preferredRecordSyntax => "usmarc" );
 my $results = $connection->search_pqf( qq[$query] );
 
 # loop through the first 50 hits results
 my $index = 0;
 for ( my $i = 0; $i <= 49; $i++ ) {
 
 # get the record
 my $record = $results->record( $i )->raw;
 my $marc = MARC::Record->new_from_usmarc( $record );
 
 # extract some data
 my $author = $marc->author;
 my $title  = $marc->title_proper;
 my $date   = $marc->publication_date;
 
 # display
 print "   author: $author\n";
 print "    title: $title\n";
 print "     date: $date\n";
 print "\n";
 
 }

Appendix B: pqf.properties

  # $Id: pqf.properties,v 1.13 2006/09/20 10:12:29 mike Exp $
 #
 # Propeties file to drive org.z3950.zing.cql.CQLNode's toPQF()
 # back-end and the YAZ CQL-to-PQF converter.  This specifies the
 # interpretation of various CQL indexes, relations, etc. in terms
 # of Type-1 query attributes.
 #
 # This configuration file generates queries using BIB-1 attributes.
 # See http://www.loc.gov/z3950/agency/zing/cql/dc-indexes.html
 # for the Maintenance Agency's work-in-progress mapping of Dublin Core
 # indexes to Attribute Architecture (util, XD and BIB-2)
 # attributes.
 
 # Identifiers for prefixes used in this file. (index.*)
 set.cql  = info:srw/cql-context-set/1/cql-v1.1
 set.rec  = info:srw/cql-context-set/2/rec-1.1
 set.dc   = info:srw/cql-context-set/1/dc-v1.1
 set.bath = http://zing.z3950.org/cql/bath/2.0/
 
 # The default set when an index doesn't specify one: Dublin Core
 set = info:srw/cql-context-set/1/dc-v1.1
 
 # The default index when none is specified by the query
 index.cql.serverChoice      = 1=any 2=102
 index.cql.allRecords        = 1=_ALLRECORDS 2=103
 index.rec.id                = 1=12
 index.dc.title              = 1=title 2=102
 index.dc.subject            = 1=subject 2=102
 index.dc.creator            = 1=1003 2=102
 index.dc.author             = 1=author 2=102
 index.dc.editor             = 1=1020
 index.dc.publisher          = 1=publisher
 index.dc.description        = 1=62
 index.dc.date               = 1=30
 index.dc.resourceType       = 1=1031
 index.dc.format             = 1=1034
 index.dc.resourceIdentifier = 1=key
 index.dc.source             = 1=1019
 index.dc.language           = 1=54
 index.dc.relation           = 1=?
 index.dc.coverage           = 1=?
 index.dc.rights             = 1=?
 
 # Relation attributes are selected according to the CQL relation by
 # looking up the "relation.<relation>" property:
 #
 relation.<     = 2=1
 relation.le    = 2=2
 relation.eq    = 2=3
 relation.exact = 2=3
 relation.ge    = 2=4
 relation.>     = 2=5
 relation.<>    = 2=6
 
 # These two are what Zebra uses -- may not work on other servers
 relation.all = 4=6
 relation.any = 4=105
 
 # BIB-1 doesn't have a server choice relation, so we just make the
 # choice here, and use equality (which is clearly correct).
 relation.scr = 2=3
 
 # Relation modifiers.
 relationModifier.relevant = 2=102
 relationModifier.fuzzy    = 5=103
 relationModifier.stem     = 2=101
 relationModifier.phonetic = 2=100
 
 # Non-standard extensions to provoke Zebra's inline sorting
 relationModifier.sort			= 7=1
 relationModifier.sort-desc		= 7=2
 relationModifier.numeric		= 4=109
 
 # Position attributes may be specified for anchored terms (those
 # beginning with "^", which is stripped) and unanchored (those not
 # beginning with "^").  This may change when we get a BIB-1 truncation
 # attribute that says "do what CQL does".
 position.first        = 3=1 6=1
 position.any          = 3=3 6=1
 position.last         = 3=4 6=1
 position.firstAndLast = 3=3 6=3
 
 # Structure attributes may be specified for individual relations; a
 # default structure attribute my be specified by the pseudo-relation
 # "*", to be used whenever a relation not listed here occurs.
 #
 structure.exact = 4=108
 structure.all   = 4=2
 structure.any   = 4=2
 structure.*     = 4=1
 
 # Truncation attributes used to implement CQL wildcard patterns.  The
 # simpler forms, left, right- and both-truncation will be used for the
 # simplest patterns, so that we produce PQF queries that conform more
 # closely to the Bath Profile.  However, when a more complex pattern
 # such as "foo*bar" is used, we fall back on Z39.58-style masking.
 truncation.right  = 5=1
 truncation.left   = 5=2
 truncation.both   = 5=3
 truncation.none   = 5=100
 truncation.regexp = 5=102
 truncation.z3958  = 5=104
 
 # Finally, any additional attributes that should always be included
 # with each term can be specified in the "always" property.
 always = 6=1
 
 # Bath Profile support, added Thu Dec 18 13:06:20 GMT 2003
 # See the Bath Profile for SRW at
 #	http://zing.z3950.org/cql/bath.html
 # including the Bath Context Set defined within that document.
 #
 # In this file, we only map index-names to BIB-1 use attributes, doing
 # so in accordance with the specifications of the Z39.50 Bath Profile,
 # and leaving the relations, wildcards, etc. to fend for themselves.
 
 index.bath.keyTitle			   = 1=33
 index.bath.possessingInstitution = 1=1044
 index.bath.name                  = 1=1002
 index.bath.personalName          = 1=1
 index.bath.corporateName         = 1=2
 index.bath.conferenceName        = 1=3
 index.bath.uniformTitle          = 1=6
 index.bath.isbn                  = 1=7
 index.bath.issn                  = 1=8
 index.bath.geographicName        = 1=58
 index.bath.notes                 = 1=63
 index.bath.topicalSubject        = 1=1079
 index.bath.genreForm             = 1=1075
 
 ## From: marc <marc@indexdata.dk>
 ## Date: December 20, 2006 9:55:24 AM EST
 ## To: Zebra Information Server <zebralist@lists.indexdata.dk>
 ## Subject: Re: [Zebralist] pqf.properties
 ## Reply-To: Zebra Information Server <zebralist@lists.indexdata.dk>
 ## 
 ## Eric Lease Morgan wrote:
 ## > On Dec 19, 2006, at 4:45 PM, marc wrote:
 ## >>> How do I edit pqf.properties so I can get zebra to search my  
 ## >>> indexes via SRU?
 ## >>> I suppose I get this because etc/pqf.properties does not know  
 ## >>> about my field names.
 ## >>
 ## >> Right.
 ## >>
 ## >> The CQL-to-PQF conversion configuration has always been a bit of a  
 ## >> hassle, and I'd really like this to improve.
 ## >>
 ## >> The problem is, of course, that one needs to type the same index  
 ## >> names over-and-over again in different parts of the zebra configs.
 ## > Maybe I could work the other way around.
 ## > For example, how might I re-write my alvis indexing XSLT file so  
 ## > they conform to the pqf.properties file that comes with the Zebra  
 ## > distribution? Specifically, how might I change the value of the  
 ## > name attribute below so I could use CQL and search by title:
 ## >   <xsl:template match="rdf:RDF/rdf:Description/dc:Title">
 ## >     <z:index name="title" type="w"><xsl:value-of select="." /></ 
 ## > z:index>
 ## >   </xsl:template>
 ## 
 ## The crucial part is that the CQLtoPQF config file needs to hit an  
 ## existing index.
 ## 
 ## so the lines and
 ## 
 ## index.dc.title                          = 1=4
 ##   <z:index name="title" type="w"><xsl:value-of select="." /></z:index>
 ## 
 ## need to match
 ## 
 ## If string indexes are used, easiest is to correct the standard config  
 ## file to
 ## 
 ## index.dc.title                          = 1=title
 ## 
 ## and so forth. The numeric value '4' refers to the specific bib-1  
 ## numeric attribute set, and is more confusion than help, so I suggest  
 ## you stick to your own defined string index names.
 ## 
 ## In addition, you have to take into account if the indexes are of type  
 ## 'p', 'w', '0' or otherwise specified.
 ## 
 ## so for  example:
 ##   <z:index name="thisandthat" type="p">...</z:index>
 ## 
 ## would be
 ## index.dc.something = 1=thisandthat 6=3
 ## 
 ## and
 ##   <z:index name="thisandthat" type="0">...</z:index>
 ## would be
 ## index.dc.something = 1=thisandthat 4=3
 ## 
 ## The reason why this is so complex is that people often want only to  
 ## provide a subset of functionality to CQL queries, and therefore those  
 ## are independent config files.
 ## 
 ## A better view of the way PQF queries are mapped to zebra indexes is  
 ## here:
 ## 
 ## http://www.indexdata.com/zebra/doc/querymodel-zebra.tkl#querymodel- 
 ## pqf-apt-mapping
 ## 
 ## section:
 ## Mapping of PQF APT structure and completeness to register type
 ## 
 ## and you need to run a PQF query to test that at least this part  
 ## works, before attempting a CQL-to-PQF query conversion.
 ## 
 ## 
 ## So, the way to build a working config is:
 ## 
 ## 1) define your indexing rues in the indexation stylesheet
 ## i.e define and index with
 ## 
 ##   <z:index name="thisandthat" type="0">...</z:index>
 ## 
 ## 
 ## 2) test that you worked out the correct PQF queries to acces them  
 ## using the above mentioned documentation section.
 ## 
 ## i.e test
 ## Z> querytype prefix
 ## Z> scan attr 1=thisandthat @attr 4=3 aterm
 ## 
 ## 3) showel that query into the right hand side of the index  
 ## definitions of the CQL to PQF converter
 ## 
 ## write
 ## index.dc.something = 1=thisandthat 4=3
 ## 
 ## 
 ## 4) test that the CQL is performed correctly
 ## 
 ## test (yes, yaz-client can send CQL queries)
 ## Z> querytype cql
 ## Z> scan dc.title=aterm
 ## 
 ## -- 
 ## Marc Cromme
 ## M.Sc and Ph.D in Mathematical Modelling and Computation
 ## Senior Developer, Project Manager

--Eric Lease Morgan 10:24, 18 June 2008 (PDT)