Cleaning Data – Strings

A string, as it is known, is a series of alphanumeric characters and will most commonly include those letters, numbers and punctuation present in sentences for any particular human language. It can however include what are known as special characters.

Cleaning Strings :

  1. Trim “spaces” from the beginning and end of a string. It is useful to note here that sometimes what looks like a space may not actually BE a space but a different character and it’s just that what is shown on screen is a space rather than the underlying character code. Confused? Well for those that don’t know, every letter or number or special character has a “code number” and it just so happens that this code number for a “space” that you would type into your computer using the space bar on your keyboard is 32. If you use something the “TRIM” function in Excel to remove spaces from a string value what you are actually asking is to remove characters with code 32 at the beginning and end of your string value. If the character, that is SHOWN on your screen is being represented as a space but actually isn’t, and has a different code number, then it will not be removed.
  2. Remove Special Characters. It is sometimes easier to decide which characters are allowed and remove those you don’t want. This is easy enough if you are working with data expressed in English but can get more complicated when you need to take account of other Languages that use for example umlauts (those characters with two dots above the letter like “ü”) as in the German language. It is possible to create replacement lists for characters so as to provide an “equivalent” value (e.g replace “â” or “ã” or “ä” or “å” with a straight “a”) but it really depends on what you need to achieve.
  3. Homogenise Values. Some string values within the same column of a file will actually represent a particular value that is a member of a larger set of possible values. For example a value such as “United Kingdom” represents a geographical area. Sometimes values such as these will have been typed into a computer system rather than being selected so it will possibly have many spelling mistakes. While it is accepted here that it can be difficult to provide a means to resolve every possible permutation of spelling mistake the objective would still be the same.  To make data fit for purpose, and to enable it to be used, the mistakes would need to be corrected. Sometimes this correction can be achieved through programmatic means but on many occasions the only way to deal with this problem is to get a complete list of values entered and then provide a replacement value for someone to manually correct or correct via a macro etc
  4. Homogenise Meaning. This subject becomes a little more complicated but essentially the purpose of homogeneity of meaning is to “clean” or provide a mutually compatible value for two different string values. For example where two strings use different combinations of values to mean the same object such as in city names like “Zurich” which can also be spelt “Zuerich” or “Zürich”. Another example is in ball bearing product names as described in this article about TAMR.

Next – Cleaning Numbers

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s