Changes

2010talks Submissions

1 byte removed, 20:50, 13 November 2009

→‎Matching Dirty Data - Yet another wheel

* Jeff Sherwood, University of Washington Libraries, jeffs3 at u washington edu

Regular expressions ~~are~~ is a powerful tool to identify matching data between similar files. When one or both of these files has inconsistent data due to differing character encodings or miskeying, the use of regular expressions to find matches becomes impractically complex.

The Levenshtein distance (LD) algorithm is a basic sequence comparison technique that can be used to measure word similarity more flexibly. Employing the LD to calculate difference eliminates the need to identify and code into regex patterns all of the ways in which otherwise matching strings might be inconsistent. Instead, a similarity threshold is tuned to identify close matches while eliminating false positives.

Younga3

85

edits

Changes

2010talks Submissions

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools