Changes

HAMR: Human/Authority Metadata Reconciliation

2,717 bytes added, 20:29, 9 March 2012

no edit summary

[[HAMR: Human/Authority Metadata Reconciliation]]

Initial design/prototype by: Sean Chen, Tim Donohue, Joshua Gomez, Ranti Junus, Ryan Scherle

A tool for a curator to determine whether the various fields of a metadata record are correct. Takes a metadata record, locates any identifiers (e.g., DOI, PMID). Retrieves a copy of the metadata record from an authoritative source (e.g., CrossRef, PubMed). Displays a human-readable page that compares fields in the initial record with fields in the authoritative record. Each field is color-coded based on how well it matches, so the curator can quickly identify discrepancies.

== UI Prototype (uses static data) ==

http://dl.dropbox.com/u/9074989/code4lib/unverified.html

== Basic design == Narrowing the focus for ~~today~~an initial usable version:

* Dublin core (maybe qualified)

* framework that allows multiple authority sources

*** Mapping happens here: See [https://wiki.duraspace.org/display/DSPACE/PubMedPrefill-pmid+dim.xsl pmid-to-dim.xsl] for a sample XSLT crosswalk to translate PubMed format to a qualified dublin core (internal DSpace metadata format)

** More examples of querying PubMed: http://www.my-whiteboard.com/how-to-automate-pubmed-search-using-perl-php-or-java/

** Useful tool for finding PubMed IDs: http://www.ncbi.nlm.nih.gov/entrez/getids.cgi

* CrossRef

** simply send the DOI to crossref, and get JSON/XML back

Thoughts / Questions:

* Is there a way to do most/all of this via Javascript/AJAX/JQuery? Could it be a simple Javascript framework you could "drop" into any metadata editing interface?

** Unfortunately, it seems this wouldn't work out. In order to perform querying of external authorities, they'd all need to support [http://en.wikipedia.org/wiki/JSON#JSONP JSONP] or similar (and they don't)

== Code ==

* [http://gitref.org/ quick reference for Git]

* [https://github.com/ryscher/hamr Ryan's really stupid scratch implementation]

=== Draft Matching Algorithm ===

<pre>

function compareRecords(localDubCore, authDubCore)

recordMatches = []

for each element-type:

loc = array of local values

auth = array of authority values

// arrays are actually lists of dictionaries

// a1

// 0 value="Benson, Arnold", match="", strength=""

// 1 value="Terrence, D.", match="a2[3]", strength="100%"

elementMatches = compareElements(loc, auth)

recordMatches.add(elementMatches)

function compareElements(loc, auth)

output = []

//nested loops run through values and assigns strongest matches to each element

for each element in loc

for each element in auth

strength = string distance between the two elements

if strength = 100

//if match is perfect go ahead pop each element and add their values to output array

//output array is also list of dictionaries

//0 loc="Hector", auth="Hector", strength="100"

//1 loc="Albert", auth="Alberto", strength="90"

if strength > auth element's current strength value

overwrite auth element's strength and match values

if strength > loc element's current strength value

overwrite loc element's strength and match values

//this second set of non-nested loops pull out the strongest matches

for each item in auth

//x = some arbitrary barrier for a decent enough match

if element strength > x AND if matching element is still in the loc list

pop each element and add their values to output array

for each item in loc

if element strength > x AND if matching element is still in the auth list

pop each element and add their values to output array

//now do cleanup and look for values that have no decent matches

for each element in loc

pop element and add to output array without match //x loc="Heyward", auth="", strength=""

for each element in auth

pop element and add to output array without match //x loc="", auth="Perry", strength=""

return output

</pre>

== Output Spec ==

=== Sample Output ===

<~~source lang="XML"~~pre> <hamr authority="PubMed"> <match strength="100%"> <creator src="input">Trojan, Tommy</creator> <creator src="authority">Trojan, Tommy</creator> </match> <match strength="90%"> <title src="input">Great American Article</title> <title src="authority">Great American Article, The</title> </match> <nonmatch> <subject src="input">Medical Stuff</subject> </nonmatch> <nonmatch> <type src="authority">text</type> </nonmatch> </hamr></~~source~~pre>

== Need to do ==

~~# Create basic code framework~~# Implement metadata retrieval from authority''(done for crossref in ryan's code)''# Design structure of plugins## crosswalk from authority format to simple dc

# Design matching algorithm

← Older edit

Ryscher

21

edits

Changes

HAMR: Human/Authority Metadata Reconciliation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools