Cleaning Data – Why Bother?

The purpose of cleaning data is straightforward – to make it fit for purpose.

The level of what fitness for purpose means obviously leaves room for interpretation but I always look at it from a completely mechanical viewpoint because when it comes down to it we’re talking about data that’s being used by computer systems. Even if data values look the same, sometimes they just aren’t. Take the following four string values :

” Liverpool”, “Liverpool “, “Liverpool”, ” Liverpool ”

Yes it’s true that each one says the same thing to a human reader but to a computer they are different. Each value has a number of spaces within it and a computer system will not equate any of them to be the same. Now take a look at the following four values :

“01/06/2016”, “06-01-2016”, “1-JUN-2016”

Given the previous example you could say that each is different but it’s quite obvious that actually each value could be treated the same as, strangely, a computer system may interpret each as a correct Date value and equate all three values as being the same.

So we have two problems that you need to be aware of :

  1. What is the actual underlying value
  2. How is that value being represented and/or displayed to the human eye and is it different to how a computer is interpreting that value.

Therefore it can be necessary to be very specific when enquiring upon data to ensure that what you see is the real value, or an interpretation of it that is based upon an expected value. Take an age old problem that can catch people out – decimal number values in Microsoft Excel :

Enter a value of say “15.999” into a cell then change the number format to remove any decimal places shown. Of course, as those of you with some experience of Excel will know, the number will show as “16” on your screen. Perform any calculations using that cell however and the value being used in your calculation will be “15.999”. However if you then chose to export your file you need to be aware that performing this action in Excel can change your underlying values for formatted values. Any calculations will be based upon real values but formatted values can be exported exactly as shown on the screen which may not be what you required to happen!

So the whole point of cleaning data is to ensure that it’s fit for purpose. The next few pages suggest some starting points for different data types.

Next – Cleaning Strings


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s