What counts as clean data? It depends…

We’ve all been told, ad nauseum, that it’s important to have clean data, and, if you’ve looked into it, you know what it can cost to have someone clean up your dirty data. But what does it mean to say your data is clean? In this post, I provide a definition, and discuss why someone else’s clean data might not be clean for you.

In the simplest terms, to say that data is clean is to say that it’s up-to-date enough, complete enough, and in the right format for how you need to use it. It’s that “how you need to use it” part – more formally called your use cases and processes – that means that someone else’s clean data might not be clean for you.

Example: Outdated dirty data

Let’s say you bought a mailing list for a direct mail marketing campaign. The company selling the list will usually verify for you that the physical locations are indeed valid according to the postal service, but they’re not doing anything to verify that the business name or individual’s name associated with that address is correct. So, in one sense, that list is clean in that, if you sent postal mail to those addresses, it would end up at that physical location in the list, because that physical location exists. In another sense, if you’re trying to drum up new business using this list, how likely are you to succeed if the recipient sees that it’s addressed to someone that’s not them, even if they are “current resident”?

out-of-date mailing list sending to the wrong recipient name

This data is dirty, because it’s not up-to-date for how you want to use it. You probably have a good number of clean entries, but you’ll have some dirty duds as well. The impact of this dirty data on your budget and overall campaign success might be low, particularly if you’re cold contacting the people on this list, but it should give you a starting point for thinking about clean data and use cases.

Example: Incomplete dirty data

What if you’re using your address list to assign different salespeople to client leads based on region? You’d first need to make sure that you’ve supplemented your address list with region information, because it’s incomplete without this classification. Once you’re recording regions, if you forget to assign a region to a client lead, you’ve made your data dirty by making it incomplete in a different way.

incomplete / unfilled data

You might lose that sale, because it’s never assigned properly to someone to follow up. This type of dirtiness can also negatively impact your metrics and analyses, because the holes in your data mean that that data cannot be properly measured and assessed. So it’s junk, not information.

To alleviate this issue, you can make sure that the information you need is required by the system, so that no one can leave it blank.

Example: Wrong format dirty data

What if you’re using your address list to match up to someone else’s database, like US census tracts? That requires that you can enter an address following the US Census Bureau’s address format so that they can return the census tract information you need. To foster this process, if you know you need to use addresses this way, you can make sure that your address information will map cleanly to the US Census Bureau’s form fields.

mapping data to form fields

(If you’re curious about this example, you can find the form in question at https://geocoding.geo.census.gov/geocoder/geographies/address. This link will open in a new window.)

Your Turn

As you look to your own data and your own processes and use cases, you should ask yourself use-specific questions. There are templates available for defining use cases, but it can also be useful to do something called “starting with the end in mind.” Looking at your metrics and other reporting needs, ask yourself and the others that interact with your data:

  • What information do you need out of the system?
  • What do you think that information means, that is, what do you plan to do based on that information?

These questions, in part, inform what data you need to have to have clean data, what data needs to be required, and how timely and up-to-date that information needs to be in order to be useful.

This blog post is a reworking of the transcript of a video on the Blou Designs YouTube Channel. If you would like to “watch” this post, you can find it at: https://youtu.be/GPosdG8YkyU (new window)

If you prefer video content, don’t forget to subscribe to the Blou Designs YouTube Channel so you can get a notification each time we release a new video there! Make sure that your notification settings are set to All instead of Personalized.

Author: Barbara

Barbara is the Managing Member and Primary Consultant of Blou Designs LLC

Leave a Reply

Your email address will not be published. Required fields are marked *

Are you a robot? *