Cleaning Data – Numbers


Numbers can manifest themselves in a number of ways and be subject to various restrictions. For example, there are a number of different types of Integer (whole number) where the difference is how large a value can be in both a positive and negative direction. There are four basic types of Integer :

Short. Uses 2 bytes. Allows values from -32,768 to 32,767
Unsigned Short. Uses 2 bytes. Allows values from 0 to 65,535
Long. Uses 4 bytes. Allows values from -2,147,483,648 to 2,147,483,647
Unsigned Long. Uses 4 bytes. Allows values from 0 to 4,294,967,295

Then there are floating point types which allow numbers with decimal places. There are three basic types :

Float. Uses 4 bytes. Allows values from 1.2E-38 to 3.4E+38. 6 decimal places
Double. Uses 8 bytes. Allows values from 2.3E-308 to 1.7E+308. 15 decimal places
Long Double. Uses 10 bytes. Allows values from 3.4E-4932 to 1.1E+4932. 19 decimal places

Cleaning numbers can be a slightly tricky affair as not only do you need to consider the target numeric types that you may require but there’s also the problem of what to do if you have any of the following values in your source data shown in perhaps Excel or a Text Editor for example :

  1. ‘blank’ values i.e no value present and shown as a blank
  2. ‘NULL’ or ‘null’ i.e a specific null value
  3. ‘NaN’ i.e “Not A Number”
  4. Scientific expressions of numeric values i.e a representation of a small or large real number using “E” notation e.g 1.2E-10 which represents the value of 0.00000000012.
  5. Infinity value i.e a value in a numeric that is used to denote the value of infinity e.g ‘inf’.

If you are working with a single cleaning transaction (i.e a one off cleaning task) then this is not so much of an issue but if this is something that you need to automate then you will need to either improve the data quality of your source or build a transformation rule that deals with all possible special values.

The issue with special values of course is that in some programming languages an invalid numeric type will cause an error. Therefore resolving how to deal with what are essentially string values will determine the success of any cleaning operation.

Cleaning Numbers :

Taking account of the information described above, the following should be taken into account when cleaning numeric data :

  1. Determine the source and target numeric formats. Where you are translating from one type of number to another you must be sure that you are not trying to force values greater than the largest value allowed. You will also lose precision when converting Real numbers into Integers i.e you will lose any value after a decimal point.
  2. Ensure the content of fields contain characters pertaining to numeric formats. Therefore characters such as ‘.’, or ‘,’ in some parts of the world, for real numbers should be allowed as should indicators of positive or negative numbers i.e ‘+’ and ‘-‘. If scientific numbers may be encountered then the ‘E’ or ‘e’ characters can be allowed. Format may also be important as sometimes the ‘-‘ may appear at the end of a numeric field to indicate the number is negative. Also be aware that sometimes negative numbers can be surrounded by brackets to indicate they are negative – this is a common format used by Accountants in Microsoft Excel e.g (5000) means -5000.
  3. Ensure correct range. If you are reading percentage data for example then it may be that you are only expecting values from 0 to 100, perhaps with decimal places. A range check may be required. Perhaps some of your percentage data might be suffixed with a ‘%’ symbol or even expressed as a value between 0 and 1 where 0.2 for example actually means 20%.
  4. If when transforming numeric data you are performing calculations you must be aware of implicit rounding that can introduce unwanted errors. This can occur when you store, or record the results of, a calculation that naturally produces values with part tiny real number values into a numeric type that does not have the precision required for holding the result. Rounding will occur implicitly due to the precision of the numeric types involved and you may get undesirable results. It is always good practice to perform calculations involving multiplication and division to the highest level of precision possible and to only round as a last step.
  5. Be aware of Proportionality in repeating groups. If you have data which exists on multiple rows, and you are perhaps calculating proportions between these rows, be prepared to adjust the values calculated to ensure correct proportionality or the total thereof. For example – if a number of rows of data are supposed to add up to 100% then you may need to adjust accordingly.

Next – The State Of Data (Coming soon)


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s