| Deduplication can mean different things in different contexts.
In this case we look at the role of deduplication in identifying
database records which, based on programmatic rules and algorithms,
can be considered to be the same.
Deduplicating name and address data was around long before
computers. One of the earliest phonetic algorithms was Soundex
which was created to index US census data. Soundex was developed
by Robert Russell and Margaret Odell and patented in 1918. The
Soundex code for a name consists of a letter followed by three
numbers: the letter is the first letter of the name, and the
numbers encode the remaining consonants. Similar sounding consonants
share the same number so, for example; B, F, P and V are all
encoded as 1. A similar algorithm called "Reverse Soundex"
prefixes the last letter of the name instead of the first. Vowels
are dropped, except for the first letter of the name.
Such algorithms remove the reliance on words being spelt identically
for them to be considered a match. The method used by Soundex
is based on the six phonetic classifications of human speech
sounds (bilabial, labiodental, dental, alveolar, velar, and
glottal), which in turn are based on where you put your lips
and tongue to make the sounds.
If you are considering using Soundex for a commercial system
you might want to think again. Soundex is actually a pretty
poor algorithm for doing fuzzy name comparisons and will return
a high number of false positive matches (such as Wilson and
Wilkins, Brady and Broad), so much so that in 1970 New York
State commissioned a study of phonetic coding and came up with
a derivative of Soundex called New York State Identification
and Intelligence System (NYSIIS). The accuracy over Soundex
has been tested at an average 2.7%. Soundex and therefore NYSIIS
are limited to the 26 character western alphabet.
|