HAMR: Human/Authority Metadata Reconciliation

Revision as of 21:18, 7 February 2011 by Joshuago78 (Talk | contribs) (Code)

Revision as of 21:18, 7 February 2011 by Joshuago78 (Talk | contribs) (Code)

HAMR: Human/Authority Metadata Reconciliation

Sean Chen, Tim Donohue, Joshua Gomez, Ranti Junus, Ryan Scherle

A tool for a curator to determine whether the various fields of a metadata record are correct. Takes a metadata record, locates any identifiers (e.g., DOI, PMID). Retrieves a copy of the metadata record from an authoritative source (e.g., CrossRef, PubMed). Displays a human-readable page that compares fields in the initial record with fields in the authoritative record. Each field is color-coded based on how well it matches, so the curator can quickly identify discrepancies.


Narrowing the focus for today:

  • Dublin core (maybe qualified)
  • framework that allows multiple authority sources
  • NOT focusing on author names (ORCID is already working on this), except the fact that they are strings, and we'll do basic string matching
  • 1 to 1 matching. Even if you want to eventually match with multiple authorities, you'd only do one at a time

Possible authority sources:

Thoughts / Questions:

  • Is there a way to do most/all of this via Javascript/AJAX/JQuery? Could it be a simple Javascript framework you could "drop" into any metadata editing interface?
    • Unfortunately, it seems this wouldn't work out. In order to perform querying of external authorities, they'd all need to support JSONP or similar (and they don't)

Code

Draft Matching Algorithm

function compareRecords(localDubCore, authDubCore)
    for each element-type:
        loc = array of local values
        auth = array of authority values
        // arrays are actually lists of dictionaries
        // a1
        // 0    value="Benson, Arnold", match="", strength=""
        // 1    value="Terrence, D.", match="a2[3]", strength="100%"
        compareElements(loc, auth)


function compareElements(loc, auth)
    output = []
    for each element in loc
        for each element in auth
            strength = string distance between the two elements
            if strength = 100
                //pop each element and add their values to output array
                //output array is also list of dictionaries
                //0    loc="Hector", auth="Hector", strength="100"
                //1    loc="Albert", auth="Alberto", strength="90"

== Output Spec ==

* We will use a simple XML output consisting of paired (and possibly unpaired) values.
* The root element will contain an attribute signifying the source of the authority metadata.
* The <match> element will be used to pair values, with a strength attribute to signify the string distance.
* Within each match element will be exactly 2 metadata elements with attributes signifying the source of each value: either the local input or the remote authority data.
* An <nonmatch> element will be used for unpaired values.

=== Sample Output ===
<pre>
<hamr authority="PubMed">
    <match strength="100%">
        <creator src="input">Trojan, Tommy</creator>
        <creator src="authority">Trojan, Tommy</creator>
    </match>
    <match strength="90%">
        <title src="input">Great American Article</title>
        <title src="authority">Great American Article, The</title>
    </match>
    <nonmatch>
        <subject src="input">Medical Stuff</subject>
    </nonmatch>
    <nonmatch>
        <type src="authority">text</type>
    </nonmatch>
</hamr>

Need to do

  1. Implement metadata retrieval from authority (done for crossref in ryan's code)
  2. Design structure of plugins
    1. crosswalk from authority format to simple dc
  3. Design matching algorithm