Difference between revisions of "HAMR: Human/Authority Metadata Reconciliation"

From Code4Lib
Jump to: navigation, search
 
(30 intermediate revisions by 5 users not shown)
Line 1: Line 1:
 
[[HAMR: Human/Authority Metadata Reconciliation]]
 
[[HAMR: Human/Authority Metadata Reconciliation]]
  
Sean Chen, Tim Donohue, Joshua Gomez, Ranti Junus, Ryan Scherle
+
Initial design/prototype by: Sean Chen, Tim Donohue, Joshua Gomez, Ranti Junus, Ryan Scherle
  
 
A tool for a curator to determine whether the various fields of a metadata record are correct. Takes a metadata record, locates any identifiers (e.g., DOI, PMID). Retrieves a copy of the metadata record from an authoritative source (e.g., CrossRef, PubMed). Displays a human-readable page that compares fields in the initial record with fields in the authoritative record. Each field is color-coded based on how well it matches, so the curator can quickly identify discrepancies.
 
A tool for a curator to determine whether the various fields of a metadata record are correct. Takes a metadata record, locates any identifiers (e.g., DOI, PMID). Retrieves a copy of the metadata record from an authoritative source (e.g., CrossRef, PubMed). Displays a human-readable page that compares fields in the initial record with fields in the authoritative record. Each field is color-coded based on how well it matches, so the curator can quickly identify discrepancies.
  
 +
== UI Prototype (uses static data) ==
 +
http://dl.dropbox.com/u/9074989/code4lib/unverified.html
  
Focus:
+
== Basic design ==
 +
 
 +
Narrowing the focus for an initial usable version:
 
* Dublin core (maybe qualified)
 
* Dublin core (maybe qualified)
 
* framework that allows multiple authority sources
 
* framework that allows multiple authority sources
 
* NOT focusing on author names ([http://www.orcid.org/ ORCID] is already working on this), except the fact that they are strings, and we'll do basic string matching
 
* NOT focusing on author names ([http://www.orcid.org/ ORCID] is already working on this), except the fact that they are strings, and we'll do basic string matching
 +
* 1 to 1 matching.  Even if you want to eventually match with multiple authorities, you'd only do one at a time
  
 
Possible authority sources:
 
Possible authority sources:
* pubmed - Sample pubmed query (in Java): [https://wiki.duraspace.org/display/DSPACE/PubMedPrefill-PubmedPrefillStep.java DSpace PubMedPrefill] (From [https://wiki.duraspace.org/display/DSPACE/PopulateMetadataFromPubMed Populate Metadata from PubMed])
+
* PubMed
* crossref
+
** Sample pubmed query (in Java): [https://wiki.duraspace.org/display/DSPACE/PubMedPrefill-PubmedPrefillStep.java DSpace PubMedPrefillStep.java] (From [https://wiki.duraspace.org/display/DSPACE/PopulateMetadataFromPubMed Populate Metadata from PubMed])
** simply send the DOI to crossref, and get JSON back
+
*** See 'retrievePubmedXML()' in above java code for actual call to PubMed
** example: http://api.labs.crossref.org/10.1111/j.1558-5646.2009.00626.x.json
+
*** Mapping happens here: See [https://wiki.duraspace.org/display/DSPACE/PubMedPrefill-pmid+dim.xsl pmid-to-dim.xsl] for a sample XSLT crosswalk to translate PubMed format to a qualified dublin core (internal DSpace metadata format)
 +
** More examples of querying PubMed: http://www.my-whiteboard.com/how-to-automate-pubmed-search-using-perl-php-or-java/
 +
** Useful tool for finding PubMed IDs: http://www.ncbi.nlm.nih.gov/entrez/getids.cgi
 +
* CrossRef
 +
** simply send the DOI to crossref, and get JSON/XML back
 +
*** http://api.labs.crossref.org/10.1111/j.1558-5646.2009.00626.x.json
 +
*** http://api.labs.crossref.org/10.2307/1935157.xml
 +
*** [http://code.google.com/p/dryad/source/browse/trunk/dryad/dspace/modules/doi/dspace-doi-webapp/src/main/java/org/dspace/doi/DOIServlet.java java code that includes a lookup]
 +
** [http://labs.crossref.org/site/crossref_metadata_search.html Metadata Search] -- send a text query, receive a list of matching records
 +
** [http://labs.crossref.org/site/quick_and_dirty_api_guide.html OpenURL search]
 
* google scholar - does it have an API?
 
* google scholar - does it have an API?
 
* [http://www.mendeley.com mendeley] - [http://dev.mendeley.com/ Mendeley API]
 
* [http://www.mendeley.com mendeley] - [http://dev.mendeley.com/ Mendeley API]
 
* [http://vivoweb.org/ vivo]
 
* [http://vivoweb.org/ vivo]
 
* [http://bibapp.org/ bibapp]
 
* [http://bibapp.org/ bibapp]
 +
 +
Thoughts / Questions:
 +
* Is there a way to do most/all of this via Javascript/AJAX/JQuery?  Could it be a simple Javascript framework you could "drop" into any metadata editing interface?
 +
** Unfortunately, it seems this wouldn't work out.  In order to perform querying of external authorities, they'd all need to support [http://en.wikipedia.org/wiki/JSON#JSONP JSONP] or similar (and they don't)
 +
 +
== Code ==
 +
 +
* [http://gitref.org/ quick reference for Git]
 +
* [https://github.com/ryscher/hamr Ryan's really stupid scratch implementation]
 +
 +
=== Draft Matching Algorithm ===
 +
<pre>
 +
function compareRecords(localDubCore, authDubCore)
 +
    recordMatches = []
 +
    for each element-type:
 +
        loc = array of local values
 +
        auth = array of authority values
 +
        // arrays are actually lists of dictionaries
 +
        // a1
 +
        // 0    value="Benson, Arnold", match="", strength=""
 +
        // 1    value="Terrence, D.", match="a2[3]", strength="100%"
 +
        elementMatches = compareElements(loc, auth)
 +
        recordMatches.add(elementMatches)
 +
       
 +
 +
 +
function compareElements(loc, auth)
 +
    output = []
 +
    //nested loops run through values and assigns strongest matches to each element
 +
    for each element in loc
 +
        for each element in auth
 +
            strength = string distance between the two elements
 +
            if strength = 100
 +
                //if match is perfect go ahead pop each element and add their values to output array
 +
                //output array is also list of dictionaries
 +
                //0    loc="Hector", auth="Hector", strength="100"
 +
                //1    loc="Albert", auth="Alberto", strength="90"
 +
          if strength > auth element's current strength value
 +
              overwrite auth element's strength and match values
 +
          if strength > loc element's current strength value
 +
              overwrite loc element's strength and match values
 +
    //this second set of non-nested loops pull out the strongest matches
 +
    for each item in auth
 +
        //x = some arbitrary barrier for a decent enough match
 +
        if element strength > x AND if matching element is still in the loc list
 +
            pop each element and add their values to output array
 +
    for each item in loc
 +
        if element strength > x AND if matching element is still in the auth list
 +
            pop each element and add their values to output array
 +
    //now do cleanup and look for values that have no decent matches
 +
    for each element in loc
 +
        pop element and add to output array without match //x  loc="Heyward", auth="", strength=""
 +
    for each element in auth
 +
        pop element and add to output array without match //x  loc="", auth="Perry", strength=""
 +
    return output         
 +
</pre>
 +
 +
== Output Spec ==
 +
 +
* We will use a simple XML output consisting of paired (and possibly unpaired) values.
 +
* The root element will contain an attribute signifying the source of the authority metadata.
 +
* The <match> element will be used to pair values, with a strength attribute to signify the string distance.
 +
* Within each match element will be exactly 2 metadata elements with attributes signifying the source of each value: either the local input or the remote authority data.
 +
* An <nonmatch> element will be used for unpaired values.
 +
 +
=== Sample Output ===
 +
<pre>
 +
<hamr authority="PubMed">
 +
    <match strength="100%">
 +
        <creator src="input">Trojan, Tommy</creator>
 +
        <creator src="authority">Trojan, Tommy</creator>
 +
    </match>
 +
    <match strength="90%">
 +
        <title src="input">Great American Article</title>
 +
        <title src="authority">Great American Article, The</title>
 +
    </match>
 +
    <nonmatch>
 +
        <subject src="input">Medical Stuff</subject>
 +
    </nonmatch>
 +
    <nonmatch>
 +
        <type src="authority">text</type>
 +
    </nonmatch>
 +
</hamr>
 +
</pre>
 +
 +
== Need to do ==
 +
 +
# Implement metadata retrieval from authority ''(done for crossref in ryan's code)''
 +
# Design structure of plugins
 +
## crosswalk from authority format to simple dc
 +
# Design matching algorithm

Latest revision as of 20:29, 9 March 2012

HAMR: Human/Authority Metadata Reconciliation

Initial design/prototype by: Sean Chen, Tim Donohue, Joshua Gomez, Ranti Junus, Ryan Scherle

A tool for a curator to determine whether the various fields of a metadata record are correct. Takes a metadata record, locates any identifiers (e.g., DOI, PMID). Retrieves a copy of the metadata record from an authoritative source (e.g., CrossRef, PubMed). Displays a human-readable page that compares fields in the initial record with fields in the authoritative record. Each field is color-coded based on how well it matches, so the curator can quickly identify discrepancies.

UI Prototype (uses static data)

http://dl.dropbox.com/u/9074989/code4lib/unverified.html

Basic design

Narrowing the focus for an initial usable version:

  • Dublin core (maybe qualified)
  • framework that allows multiple authority sources
  • NOT focusing on author names (ORCID is already working on this), except the fact that they are strings, and we'll do basic string matching
  • 1 to 1 matching. Even if you want to eventually match with multiple authorities, you'd only do one at a time

Possible authority sources:

Thoughts / Questions:

  • Is there a way to do most/all of this via Javascript/AJAX/JQuery? Could it be a simple Javascript framework you could "drop" into any metadata editing interface?
    • Unfortunately, it seems this wouldn't work out. In order to perform querying of external authorities, they'd all need to support JSONP or similar (and they don't)

Code

Draft Matching Algorithm

function compareRecords(localDubCore, authDubCore)
    recordMatches = []
    for each element-type:
        loc = array of local values
        auth = array of authority values
        // arrays are actually lists of dictionaries
        // a1
        // 0    value="Benson, Arnold", match="", strength=""
        // 1    value="Terrence, D.", match="a2[3]", strength="100%"
        elementMatches = compareElements(loc, auth)
        recordMatches.add(elementMatches)
        


function compareElements(loc, auth)
    output = []
    //nested loops run through values and assigns strongest matches to each element
    for each element in loc
        for each element in auth
            strength = string distance between the two elements
            if strength = 100
                //if match is perfect go ahead pop each element and add their values to output array
                //output array is also list of dictionaries
                //0    loc="Hector", auth="Hector", strength="100"
                //1    loc="Albert", auth="Alberto", strength="90"
           if strength > auth element's current strength value
               overwrite auth element's strength and match values
           if strength > loc element's current strength value
               overwrite loc element's strength and match values
    //this second set of non-nested loops pull out the strongest matches
    for each item in auth
        //x = some arbitrary barrier for a decent enough match
        if element strength > x AND if matching element is still in the loc list
            pop each element and add their values to output array
    for each item in loc
        if element strength > x AND if matching element is still in the auth list
            pop each element and add their values to output array
    //now do cleanup and look for values that have no decent matches
    for each element in loc
        pop element and add to output array without match //x   loc="Heyward", auth="", strength=""
    for each element in auth
        pop element and add to output array without match //x   loc="", auth="Perry", strength=""
    return output           

Output Spec

  • We will use a simple XML output consisting of paired (and possibly unpaired) values.
  • The root element will contain an attribute signifying the source of the authority metadata.
  • The <match> element will be used to pair values, with a strength attribute to signify the string distance.
  • Within each match element will be exactly 2 metadata elements with attributes signifying the source of each value: either the local input or the remote authority data.
  • An <nonmatch> element will be used for unpaired values.

Sample Output

<hamr authority="PubMed">
    <match strength="100%">
        <creator src="input">Trojan, Tommy</creator>
        <creator src="authority">Trojan, Tommy</creator>
    </match>
    <match strength="90%">
        <title src="input">Great American Article</title>
        <title src="authority">Great American Article, The</title>
    </match>
    <nonmatch>
        <subject src="input">Medical Stuff</subject>
    </nonmatch>
    <nonmatch>
        <type src="authority">text</type>
    </nonmatch>
</hamr>

Need to do

  1. Implement metadata retrieval from authority (done for crossref in ryan's code)
  2. Design structure of plugins
    1. crosswalk from authority format to simple dc
  3. Design matching algorithm