Changes

Jump to: navigation, search

HAMR: Human/Authority Metadata Reconciliation

5,899 bytes added, 20:29, 9 March 2012
no edit summary
[[HAMR: Human/Authority Metadata Reconciliation]]
Initial design/prototype by: Sean Chen, Tim Donohue, Joshua Gomez, Ranti Junus, Ryan Scherle 
A tool for a curator to determine whether the various fields of a metadata record are correct. Takes a metadata record, locates any identifiers (e.g., DOI, PMID). Retrieves a copy of the metadata record from an authoritative source (e.g., CrossRef, PubMed). Displays a human-readable page that compares fields in the initial record with fields in the authoritative record. Each field is color-coded based on how well it matches, so the curator can quickly identify discrepancies.
== UI Prototype (uses static data) ==
http://dl.dropbox.com/u/9074989/code4lib/unverified.html
Focus== Basic design == Narrowing the focus for an initial usable version:
* Dublin core (maybe qualified)
* framework that allows multiple authority sources
* NOT focusing on author names([http://www.orcid.org/ ORCID] is already working on this), except the fact that they are strings, and we'll do basic string matching* 1 to 1 matching. Even if you want to eventually match with multiple authorities, you'd only do one at a time
Possible authority sources:
* PubMed** Sample pubmedquery (in Java): [https://wiki.duraspace.org/display/DSPACE/PubMedPrefill-PubmedPrefillStep.java DSpace PubMedPrefillStep.java] (From [https://wiki.duraspace.org/display/DSPACE/PopulateMetadataFromPubMed Populate Metadata from PubMed])* ** See 'retrievePubmedXML()' in above java code for actual call to PubMed*** Mapping happens here: See [https://wiki.duraspace.org/display/DSPACE/PubMedPrefill-pmid+dim.xsl pmid-to-dim.xsl] for a sample XSLT crosswalk to translate PubMed format to a qualified dublin core (internal DSpace metadata format)** More examples of querying PubMed: http://www.my-whiteboard.com/how-to-automate-pubmed-search-using-perl-php-or-java/** Useful tool for finding PubMed IDs: http://www.ncbi.nlm.nih.gov/entrez/getids.cgi* CrossRef** simply send the DOI to crossref, and get JSON/XML back*** http://api.labs.crossref.org/10.1111/j.1558-5646.2009.00626.x.json*** http://api.labs.crossref.org/10.2307/1935157.xml*** [http://code.google.com/p/dryad/source/browse/trunk/dryad/dspace/modules/doi/dspace-doi-webapp/src/main/java/org/dspace/doi/DOIServlet.java java code that includes a lookup]** [http://labs.crossref.org/site/crossref_metadata_search.html Metadata Search] -- send a text query, receive a list of matching records** [http://labs.crossref.org/site/quick_and_dirty_api_guide.html OpenURL search]* google scholar- does it have an API?* [http://www.mendeley.com mendeley] - [http://dev.mendeley.com/ Mendeley API]* [http://vivoweb.org/ vivo]* [http://bibapp.org/ bibapp] Thoughts / Questions:* Is there a way to do most/all of this via Javascript/AJAX/JQuery? Could it be a simple Javascript framework you could "drop" into any metadata editing interface?** Unfortunately, it seems this wouldn't work out. In order to perform querying of external authorities, they'd all need to support [http://en.wikipedia.org/wiki/JSON#JSONP JSONP] or similar (and they don't) == Code == * [http://gitref.org/ quick reference for Git]* [https://github.com/ryscher/hamr Ryan's really stupid scratch implementation] === Draft Matching Algorithm ===<pre>function compareRecords(localDubCore, authDubCore) recordMatches = [] for each element-type: loc = array of local values auth = array of authority values // arrays are actually lists of dictionaries // a1 // 0 value="Benson, Arnold", match="", strength="" // 1 value="Terrence, D.", match="a2[3]", strength="100%" elementMatches = compareElements(loc, auth) recordMatches.add(elementMatches)   function compareElements(loc, auth) output = [] //nested loops run through values and assigns strongest matches to each element for each element in loc for each element in auth strength = string distance between the two elements if strength = 100 //if match is perfect go ahead pop each element and add their values to output array //output array is also list of dictionaries //0 loc="Hector", auth="Hector", strength="100" //1 loc="Albert", auth="Alberto", strength="90" if strength > auth element's current strength value overwrite auth element's strength and match values if strength > loc element's current strength value overwrite loc element's strength and match values //this second set of non-nested loops pull out the strongest matches for each item in auth //x = some arbitrary barrier for a decent enough match if element strength > x AND if matching element is still in the loc list pop each element and add their values to output array for each item in loc if element strength > x AND if matching element is still in the auth list pop each element and add their values to output array //now do cleanup and look for values that have no decent matches for each element in loc pop element and add to output array without match //x loc="Heyward", auth="", strength="" for each element in auth pop element and add to output array without match //x loc="", auth="Perry", strength="" return output </pre> == Output Spec == * We will use a simple XML output consisting of paired (and possibly unpaired) values.* The root element will contain an attribute signifying the source of the authority metadata.* The <match> element will be used to pair values, with a strength attribute to signify the string distance.* Within each match element will be exactly 2 metadata elements with attributes signifying the source of each value: either the local input or the remote authority data.* An <nonmatch> element will be used for unpaired values. === Sample Output ===<pre><hamr authority="PubMed"> <match strength="100%"> <creator src="input">Trojan, Tommy</creator> <creator src="authority">Trojan, Tommy</creator> </match> <match strength="90%"> <title src="input">Great American Article</title> <title src="authority">Great American Article, The</title> </match> <nonmatch> <subject src="input">Medical Stuff</subject> </nonmatch> <nonmatch> <type src="authority">text</type> </nonmatch></hamr></pre> == Need to do == # Implement metadata retrieval from authority ''(done for crossref in ryan's code)''# Design structure of plugins## crosswalk from authority format to simple dc# Design matching algorithm
21
edits

Navigation menu