Blog

First Build

This evening I finally got to the first step of building my Pi3 Cluster. Over the last few weeks I’ve examined a lot of pictures and after a couple of false starts I think I have everything required.

IMG_6375

I bought from a number of sources :

Perspex (see through acrylic) from www.directplastics.co.uk

Blinkt! LED Indicators from shop.pimoroni.com/products/blinkt

Aluminium spacer kit from www.modmypi.com

4mm Aluminium spacers from uk.mouser.com

Corsair CX600 PC Power Supply from PC World

40 pin GPIO Connector Header Extender 90 Degree Angle from www.modmypi.com

Aluminium Heat Sink kit for Pi3 from a seller on eBay

Cat 6 Patch Cables from a seller on eBay

For each Pi3 I inserted the Blinkt! LED strip onto one side of the 90 degree GPIO Extender and then the other side of the GPIO strip onto the board. Then I attached two of the heat sinks onto the relevant size chips on the Pi3 board.

Next I took my precut perspex, the size of which I had figured out using a piece of card, and then using a small drill bit and an Archimedes drill I made 4 holes into my perspex.

IMG_6381

Once these small holes were made in the appropriate locations I drilled the proper sized holes. I managed to break a few pieces but on the third attempt I had success.

IMG_6382

Once I was happy with the location of the Pi3 mounting holes I drilled the three outer spacer holes. Then I started putting together the stack of Pi3’s. First one…

IMG_6383

Then the rest…

IMG_6385

Then it was just a case of attaching network patch cables and Mini-USB power cables to each Pi3 to my network switch (100MBps for now) and power supply respectively.

IMG_6386

The next job is to figure out the software required for each MicroSD card. Should I go for a custom development under Windows 10 IOT? Or maybe just go for Docker/Kubernetes. Or should I start with Hadoop?

Cleaning Data – Numbers

Numbers

Numbers can manifest themselves in a number of ways and be subject to various restrictions. For example, there are a number of different types of Integer (whole number) where the difference is how large a value can be in both a positive and negative direction. There are four basic types of Integer :

Short. Uses 2 bytes. Allows values from -32,768 to 32,767
Unsigned Short. Uses 2 bytes. Allows values from 0 to 65,535
Long. Uses 4 bytes. Allows values from -2,147,483,648 to 2,147,483,647
Unsigned Long. Uses 4 bytes. Allows values from 0 to 4,294,967,295

Then there are floating point types which allow numbers with decimal places. There are three basic types :

Float. Uses 4 bytes. Allows values from 1.2E-38 to 3.4E+38. 6 decimal places
Double. Uses 8 bytes. Allows values from 2.3E-308 to 1.7E+308. 15 decimal places
Long Double. Uses 10 bytes. Allows values from 3.4E-4932 to 1.1E+4932. 19 decimal places

Cleaning numbers can be a slightly tricky affair as not only do you need to consider the target numeric types that you may require but there’s also the problem of what to do if you have any of the following values in your source data shown in perhaps Excel or a Text Editor for example :

  1. ‘blank’ values i.e no value present and shown as a blank
  2. ‘NULL’ or ‘null’ i.e a specific null value
  3. ‘NaN’ i.e “Not A Number”
  4. Scientific expressions of numeric values i.e a representation of a small or large real number using “E” notation e.g 1.2E-10 which represents the value of 0.00000000012.
  5. Infinity value i.e a value in a numeric that is used to denote the value of infinity e.g ‘inf’.

If you are working with a single cleaning transaction (i.e a one off cleaning task) then this is not so much of an issue but if this is something that you need to automate then you will need to either improve the data quality of your source or build a transformation rule that deals with all possible special values.

The issue with special values of course is that in some programming languages an invalid numeric type will cause an error. Therefore resolving how to deal with what are essentially string values will determine the success of any cleaning operation.

Cleaning Numbers :

Taking account of the information described above, the following should be taken into account when cleaning numeric data :

  1. Determine the source and target numeric formats. Where you are translating from one type of number to another you must be sure that you are not trying to force values greater than the largest value allowed. You will also lose precision when converting Real numbers into Integers i.e you will lose any value after a decimal point.
  2. Ensure the content of fields contain characters pertaining to numeric formats. Therefore characters such as ‘.’, or ‘,’ in some parts of the world, for real numbers should be allowed as should indicators of positive or negative numbers i.e ‘+’ and ‘-‘. If scientific numbers may be encountered then the ‘E’ or ‘e’ characters can be allowed. Format may also be important as sometimes the ‘-‘ may appear at the end of a numeric field to indicate the number is negative. Also be aware that sometimes negative numbers can be surrounded by brackets to indicate they are negative – this is a common format used by Accountants in Microsoft Excel e.g (5000) means -5000.
  3. Ensure correct range. If you are reading percentage data for example then it may be that you are only expecting values from 0 to 100, perhaps with decimal places. A range check may be required. Perhaps some of your percentage data might be suffixed with a ‘%’ symbol or even expressed as a value between 0 and 1 where 0.2 for example actually means 20%.
  4. If when transforming numeric data you are performing calculations you must be aware of implicit rounding that can introduce unwanted errors. This can occur when you store, or record the results of, a calculation that naturally produces values with part tiny real number values into a numeric type that does not have the precision required for holding the result. Rounding will occur implicitly due to the precision of the numeric types involved and you may get undesirable results. It is always good practice to perform calculations involving multiplication and division to the highest level of precision possible and to only round as a last step.
  5. Be aware of Proportionality in repeating groups. If you have data which exists on multiple rows, and you are perhaps calculating proportions between these rows, be prepared to adjust the values calculated to ensure correct proportionality or the total thereof. For example – if a number of rows of data are supposed to add up to 100% then you may need to adjust accordingly.

Next – The State Of Data (Coming soon)

Cleaning Data – Strings

A string, as it is known, is a series of alphanumeric characters and will most commonly include those letters, numbers and punctuation present in sentences for any particular human language. It can however include what are known as special characters.

Cleaning Strings :

  1. Trim “spaces” from the beginning and end of a string. It is useful to note here that sometimes what looks like a space may not actually BE a space but a different character and it’s just that what is shown on screen is a space rather than the underlying character code. Confused? Well for those that don’t know, every letter or number or special character has a “code number” and it just so happens that this code number for a “space” that you would type into your computer using the space bar on your keyboard is 32. If you use something the “TRIM” function in Excel to remove spaces from a string value what you are actually asking is to remove characters with code 32 at the beginning and end of your string value. If the character, that is SHOWN on your screen is being represented as a space but actually isn’t, and has a different code number, then it will not be removed.
  2. Remove Special Characters. It is sometimes easier to decide which characters are allowed and remove those you don’t want. This is easy enough if you are working with data expressed in English but can get more complicated when you need to take account of other Languages that use for example umlauts (those characters with two dots above the letter like “ü”) as in the German language. It is possible to create replacement lists for characters so as to provide an “equivalent” value (e.g replace “â” or “ã” or “ä” or “å” with a straight “a”) but it really depends on what you need to achieve.
  3. Homogenise Values. Some string values within the same column of a file will actually represent a particular value that is a member of a larger set of possible values. For example a value such as “United Kingdom” represents a geographical area. Sometimes values such as these will have been typed into a computer system rather than being selected so it will possibly have many spelling mistakes. While it is accepted here that it can be difficult to provide a means to resolve every possible permutation of spelling mistake the objective would still be the same.  To make data fit for purpose, and to enable it to be used, the mistakes would need to be corrected. Sometimes this correction can be achieved through programmatic means but on many occasions the only way to deal with this problem is to get a complete list of values entered and then provide a replacement value for someone to manually correct or correct via a macro etc
  4. Homogenise Meaning. This subject becomes a little more complicated but essentially the purpose of homogeneity of meaning is to “clean” or provide a mutually compatible value for two different string values. For example where two strings use different combinations of values to mean the same object such as in city names like “Zurich” which can also be spelt “Zuerich” or “Zürich”. Another example is in ball bearing product names as described in this article about TAMR.

Next – Cleaning Numbers

 

Cleaning Data – Why Bother?

The purpose of cleaning data is straightforward – to make it fit for purpose.

The level of what fitness for purpose means obviously leaves room for interpretation but I always look at it from a completely mechanical viewpoint because when it comes down to it we’re talking about data that’s being used by computer systems. Even if data values look the same, sometimes they just aren’t. Take the following four string values :

” Liverpool”, “Liverpool “, “Liverpool”, ” Liverpool ”

Yes it’s true that each one says the same thing to a human reader but to a computer they are different. Each value has a number of spaces within it and a computer system will not equate any of them to be the same. Now take a look at the following four values :

“01/06/2016”, “06-01-2016”, “1-JUN-2016”

Given the previous example you could say that each is different but it’s quite obvious that actually each value could be treated the same as, strangely, a computer system may interpret each as a correct Date value and equate all three values as being the same.

So we have two problems that you need to be aware of :

  1. What is the actual underlying value
  2. How is that value being represented and/or displayed to the human eye and is it different to how a computer is interpreting that value.

Therefore it can be necessary to be very specific when enquiring upon data to ensure that what you see is the real value, or an interpretation of it that is based upon an expected value. Take an age old problem that can catch people out – decimal number values in Microsoft Excel :

Enter a value of say “15.999” into a cell then change the number format to remove any decimal places shown. Of course, as those of you with some experience of Excel will know, the number will show as “16” on your screen. Perform any calculations using that cell however and the value being used in your calculation will be “15.999”. However if you then chose to export your file you need to be aware that performing this action in Excel can change your underlying values for formatted values. Any calculations will be based upon real values but formatted values can be exported exactly as shown on the screen which may not be what you required to happen!

So the whole point of cleaning data is to ensure that it’s fit for purpose. The next few pages suggest some starting points for different data types.

Next – Cleaning Strings

C# vs Python Construct Comparison

This post shows some major C# and Python programming constructs side by side for comparison. Disregarding the fact that C# is a compiled language and Python is interpreted you can see there are many similarities. As a friend of mine says “the only reason we program in one language or another is because of the libraries we want to use”.

This is a working list and more side by side comparisons will be added as time goes on!

CSharpPythonComparison

Chains And Ladders – Part 1

…or sometimes known as development triangles is a modelling technique, and hence a prediction technique, for process data. It lends itself well to business planning where the focus is placed on deterministic targets and subsequent monitoring of actual versus expected results. That’s exactly what I built these models to achieve previously with great success – or anti-success should I say? More on that later in this series of posts though.

Creating The Right Environment

There are a number of elements of my push into the world of Data Science but ultimately the idea is to make Data Science my job. To understand how I am to get there I’ve needed to start looking at what is that is missing from my experience and knowledge and then start to fill the gaps. I need to learn the whole piece and then create the environment required. This means equipment and tools. I’ve made a good start on building the cluster and I’ve also looked into the tools. The obvious requirements are development environments for both R and Python Programming so I’ve picked those up and started working with them. I do need to get an IDE for Python though and I think I’ve decided on Komodo with the ActivePython For Data Science add ons. In addition I need data storage and a DB Server that has enough storage capacity for my ongoing needs. I’ve purchased rather a lot of DB space for projects over the last 18 months but to be fair it’s pretty costly compared with the option of picking up a Synology NAS and running MariaDB (MySQL) on it. I picked up a Synology DS216J and installed two 3Tb volumes on it for now but I will probably need to go for a larger Synology NAS at some point. The good thing about the ActivePython add ins is that a DB connector for MySQL is included.

So…In the not too distant future I will have a working Python development environment, Database space available for project work and a Pi3 Cluster for extra computation power – although some may find that last part funny! Seriously – Pi3s aren’t to be sniffed at especially when running massively parallel processes – that’s exactly where I’ll be going with it.

The other part of the equation here is Data Science Education…I need to learn the actual techniques which requires finding the right course to start out with…that’s the next task.

R & Python – A First Peek

I signed up to the John Hopkins Data Science course maybe a year or so ago then did the first module (The Data Scientists Toolbox) and got a distinction. Since then I procrastinated a little but I’m halfway through the R Programming Module at the moment.

Programming…it’s second nature to me really. I was a software developer after all so learning a new language is in many respects an exercise in just doing what I want to do and picking the language up as I go. Every program is after all either a Sequence, Selection or Iteration and programming constructs in any programming language are very similar to each other.

Therefore getting experience in either R or Python for that matter is just a case of picking something to do and getting on with it. What I’ve learnt really in my first few weeks of R Programming is the power contained in how data structures are built, what you can do with them, and in the extension libraries. Everything else just seems to be all about learning the actual Data Science techniques and what to apply them to. There’s also the question of Python and what that entails. I took a bit of a deeper dive into the detail today on Python in my lunch break and discovered that as far as the actual Programming Language is concerned there’s not much to it. I read through the Python Tutorial and as far as I can see I’m in the same boat as learning R – the real power is actually in the extension libraries and which techniques to apply to real problems. This explains why Data Science jobs always mention things like pandas, NumPy and SciPy.

So I really don’t see myself having TOO much trouble getting to grips with the R or Python languages so I just need to get myself the relevant IDE and any other libraries and just get on with it. I do have a slight grievance in a sense in that I’ve been working with C# on my own projects for a few years and being able to develop systems in Visual Studio is something I’m familiar with and I’m a little reluctant to drop it. Maybe I don’t need to? Maybe I can work with Python and R under a C# umbrella? Perhaps I should look at building some libraries myself. The obvious choice would be to build a Genetic Algorithm library to start with since I worked heavily on these at college and I think I still have the code somewhere. I’ve read, and still own, the Goldberg Book and it just so happened I came across it a couple of days back. I could write it in C# and then look at building examples of use in Python to see where that gets me.