85
edits
Changes
→Matching Dirty Data - Yet another wheel
* Jeff Sherwood, University of Washington Libraries, jeffs3 at u washington edu
Regular expressions are is a powerful tool to identify matching data between similar files. When one or both of these files has inconsistent data due to differing character encodings or miskeying, the use of regular expressions to find matches becomes impractically complex.
The Levenshtein distance (LD) algorithm is a basic sequence comparison technique that can be used to measure word similarity more flexibly. Employing the LD to calculate difference eliminates the need to identify and code into regex patterns all of the ways in which otherwise matching strings might be inconsistent. Instead, a similarity threshold is tuned to identify close matches while eliminating false positives.