Blog author: Chuck Schardong, Senior Consultant, Utopia, Inc.
Applying data quality routines to name and address data can be a challenge when the source system does not provide reliable (or known) attribute metadata. For example, a column field called “NAME” that contains any combination of business and personal name data…is “Howard Johnson” the company or the a person? Or when that “NAME” field also contains addresses or phone numbers or dates, etc. Software solutions and best-practice methodology (most effective when combined) can often help achieve 95% accuracy or more in identifying these challenging conditions.
However, there is a rare but historic data quality problem that I have recently identified in several consecutive projects. In my opinion, it is one of the most rogue conditions – one that packaged software solutions simply were not designed to handle. These data quality products excel when all of the data to be cleansed is contained within a single column field. However many legacy applications have character limitations which often are not long enough to handle all of the necessary data the user is required to enter. So users get creative and manually truncate the data…often putting the remaining characters in a completely separate column field. To illustrate a couple of common scenarios:
A bank teller opening an account for “John, Mary, Peter, Taylor, Jennifer Smith Family Trust” may only have 25 characters in the form. So the user starts getting creative and could en
ter the data as follows:
John Mary Peter Taylor
Jennifer Smith Family TrustScenario 2
An entry clerk needs to enter the business name “International Processing Lab Inc” into a 25 character form. So the user may enter the data as follows:Name1
Lab IncScenario 3
A shipping supervisor has to enter an order for “Practical Everyday Solutions LLC” into a 20 character form. So the user gets a little lazy and splits the name as follows:Name1
Practical Everyday S
The data quality challenge, therefore, is to determine when each discrete field contains a discrete full name or identify/correct scenarios where an entity is split across multiple column fields. Ideally the users would put some type of indicator in the data to mark where the split occurred, but it is exceedingly rare in my observations.
As I stated earlier, my personal experience is that data quality products or services cannot handle these split conditions with their out-of-the-box rules. Best practice methodologies can be used to identify and resolve these scenarios so long as the data quality products allow the configuration and customization of their data dictionaries and pattern files. I have personally implemented successful solutions utilizing the SAP® BusinessObjects™ Data Services software.
Not saying it is easy, but it is
I’m curious about other scenarios others may have seen or other unique approaches to identify and resolve these conditions?
Catch more great blogs on our blog site bit.ly/btyHox
An email has been sent to: