Continuing on from last week’s exploration of duplicates (and why we should care about duplication in our systems) this week we will explore some of the different methodologies that are employed to deal with duplication – the merits of “fuzzy logic”, and the age old conundrum of “over vs. under” deduping.

There are a number of approaches to take when seeking out duplicates. More often than not it comes down to two main points:

  1. What fields are available within the data? and
  2. What the data will be used for – i.e. the business requirements of the data.

However other points for consideration include:

  • Data Consistency/ Normalisation (or more importantly – the lack thereof)
  • Data Entry errors, such as keying the street number before the unit number
  • Misspelled information, such as ‘Jonse’ instead of ‘Jones’

Where data is consistent, and the final use of the data is for telemarketing purposes, then a straight dedupe on the phone number will often suffice. This may still leave multiple households within the data, but for some this will be acceptable and provide a greater opportunity for contact.

If the data is to be used for mailing, then a concatenation of the address elements would often allow for a rudimentary dedupe. The only “catch” will often be those pesky ‘vanity suburbs’! E.g Balmoral Beach, Clifton Gardens, Beauty Point, Spit Junction are all vanity suburbs of Mosman NSW 2088.

All of these are ok, and can even be done inside of Microsoft Excel 2007 and above. Yet be warned, the slightest discrepancy in the data will be treated as a distinct record, and thus not removed from the list. Furthermore, you’re unable to direct Excel to the record you’d prefer to keep in the final dataset.

Another method of deduping data is centred on “fuzzy logic”. The whole idea behind this approach is to find records that are approximately the same, rather than an entirely accurate match. The benefits of this are when information is incorrectly keyed, or when we are dealing with initials and full names for example.

One of the long standing industry benchmarks in this field is our Twins product. As mentioned last week – this application comes in both a desktop and Server API version, and allows for the effective matching of “like” datasets.

One of the tasks performed by the Twins software, is the creating of “Match Keys”. Match keys are essentially an alpha-numeric string of characters, derived by the inputs you send to the application and are proprietary to DataTools. By creating these, you are able to go back over this data at any point in the future, or wash additional datasets against it, without having to re-process all the data in question. This ensures a highly efficient approach is taken, and so long as the data is not changed post the creation of the keys, the match keys can be re-used time and time again.

The final question you’ll need to answer, before you go ahead and de-dupe your data, is “under” or “over”? And in this case, we’re not referring to a limbo pole! Do you want to Under dedupe the file and have an element of duplication still exist or Over dedupe your file & sacrifice a few to the gods.

The inherent challenge with automated deduping technology is that it requires you to make sacrifices. You either leave duplicates within the system, or you sacrifice data to the process. We generally recommend that businesses use both approaches, but in different circumstances.

For example, a prospect universe can generally sustain a level of duplication at a repository level. Not always, but generally speaking. If you’re using Match Keys, then you can perform additional deduping processes upon output, such as a Tight Address or Phone for either a Mailing or Telemarketing campaign. This would have the desired net effect, relative to the channel in question, and ensure the communication remains distinct.

When it comes time to remove duplicates from your prospect universe that exist in your customer universe, you’ll often want to be erring on the side of “over” deduping. This ensures that no basic acquisition type campaigns are sent out to existing customers, as this is both embarrassing for the person handling the response from the marketing, and a waste of money and time for all concerned. “Over” deduping can be performed at different levels, such as phone or address. It is generally accepted that a number of other data elements are included in these match methods, to reduce erroneous matches.

At the end of the day though, the data is yours, and yours to do with what you want. There are many different ways to ensure that your database is “clean” and duplicate free, and as long as you’re doing something to aid your cause, you’ll be on the right track.

For a more thorough understanding of the Twins product, please check out the DataTools site, or contact one of our friendly sales staff on (02) 9687 4666.