Blog

First Build

This evening I finally got to the first step of building my Pi3 Cluster. Over the last few weeks I’ve examined a lot of pictures and after a couple of false starts I think I have everything required.

IMG_6375

I bought from a number of sources :

Perspex (see through acrylic) from www.directplastics.co.uk

Blinkt! LED Indicators from shop.pimoroni.com/products/blinkt

Aluminium spacer kit from www.modmypi.com

4mm Aluminium spacers from uk.mouser.com

Corsair CX600 PC Power Supply from PC World

40 pin GPIO Connector Header Extender 90 Degree Angle from www.modmypi.com

Aluminium Heat Sink kit for Pi3 from a seller on eBay

Cat 6 Patch Cables from a seller on eBay

For each Pi3 I inserted the Blinkt! LED strip onto one side of the 90 degree GPIO Extender and then the other side of the GPIO strip onto the board. Then I attached two of the heat sinks onto the relevant size chips on the Pi3 board.

Next I took my precut perspex, the size of which I had figured out using a piece of card, and then using a small drill bit and an Archimedes drill I made 4 holes into my perspex.

IMG_6381

Once these small holes were made in the appropriate locations I drilled the proper sized holes. I managed to break a few pieces but on the third attempt I had success.

IMG_6382

Once I was happy with the location of the Pi3 mounting holes I drilled the three outer spacer holes. Then I started putting together the stack of Pi3’s. First one…

IMG_6383

Then the rest…

IMG_6385

Then it was just a case of attaching network patch cables and Mini-USB power cables to each Pi3 to my network switch (100MBps for now) and power supply respectively.

IMG_6386

The next job is to figure out the software required for each MicroSD card. Should I go for a custom development under Windows 10 IOT? Or maybe just go for Docker/Kubernetes. Or should I start with Hadoop?

Advertisements

October 6th Update

Another update since it has been almost 3.5 weeks since the last one.

Work has been rather busy since we’re in UAT for a large software development within the London Insurance Market. I’m right in the middle of the action plus it’s happening just when I’m also doing the Southampton Data Science Academy Fundamentals Of Data Science course. So I’m working in detailed transaction/message data by day and studying Data Science by night. Here’s a summary of progress :

  1. Raspberry Pi Cluster – nothing done since last time.
  2. The Fundamentals Of Data Science course run by the Southampton Data Science Academy. I’m in week 4 of the 6 week course. I’ve finished two of the three assignments for the course. Both assignments I will have passed with at least 80% on each. I know this because the assignments are automatically assessed (built in Jupyter Notebooks) except on Assignment 1 there’s a prose question which I may or may not get marks for and Assignment 2 I haven’t figured out the answer to the final question. I may still have a go at it as you can re-submit the assignment as many times as you want but only the final submission is used for your result.
  3.  The Data Science With Python online course with Datacamp. I’ve completed 6 of the 20 modules and I’m halfway through module 7. There’s 13 more to go after that! Modules completed are :
    1. Intro To Python For Data Science
    2. Intermediate Python For Data Science
    3. Python Data Science Toolbox (Part 1)
    4. Python Data Science Toolbox (Part 2)
    5. Importing Data in Python (Part 1)
    6. Importing Data in Python (Part 2)
  4. Distributed Python “Describe” exploratory analysis engine not started.
  5. Distributed Python Genetic Algorithm not started.
  6. UK Property Market Analysis showcase DB. The database required for analysing the UK Property Market is being built as I type! I wrote the Python programme in Jupyter Notebooks to take stock of all the CSV files downloaded from Land Registry, loop through each of them, and insert rows into my UKPropertySales table on my local MariaDB (that lives on my NAS). This means I will have a DB of every Property Sale in the UK from January 1995 until August 2017. Next step will be to write a Python program that analyses the Property Sales and produces a set of descriptive analyses for every Postcode Area/Sector in the UK.
  7. UK Property Market Analysis showcase. I’ve also been pulling apart the Vector file to figure out how I can link any analysis by postcode to the vector shapes for each postcode Area/District. I started by purchasing an App called BoxySVG which allows access to the metadata for each shape and I started manually adding the Postcode Area/District label to the Title field in the metadata. This would allow me to link a shape directly to any Postcode Area/District. I then looked at how I could take the vector co-ordinates for each shape, do the same for the Postcode Area/District labels, from the SVG file and then figure out which label went against which shape automatically and add the label value into the metadata to save me a huge job. I sometimes start things manually because I believe that by getting your hands dirty in small details for a while you often discover the patterns you need to develop a better solution – because you know what you’re dealing with in a much more intimate way. After working on this for a while I switched to looking at the Property Sales data for a single Postcode Area/District. I chose one that I had lived in previously just for the extra interest value. I normalised the average Property Price by adjusting according to the . It was at the point when I built a chart in Excel of the Property Sales for that area that I discovered that on a monthly basis for a Postcode Area/District there are not many individual sales. This would cause a big problem when it came to plotting the data on the UK postcode map so at this point I decided to move up from plotting down to District level to a Sector level e.g rather than plot at the level of “AL1 A” I will plot at “AL1”. Doing this gives a much better picture of average Property sale prices. There are also much less Postcode Area/Sectors to deal with which will make processing of the visual plots much quicker. Something else I looked into was how I can turn the Postcode area shapes into 3D shapes rather than a flat 2D shape. Anyways, here is the chart that I produced that is the foundation of the analysis that will be undertaken per Postcode Area/Sector. The blue line is the average Property Sale price and the red line is a 12 month rolling average. The index used to normalise the average sale price is labelled as CDKO (“Long term indicator of prices of consumer goods and services (Jan 1974=100)”) downloaded from the data.gov.uk website. This gives an indication of the relative price of Property across the years.

HPNormalised

So…lots more progress but not enough time to write something more substantial than an update!

September 13th Update

Recap

I decided this evening that I ought to update on progress! It has been a couple of months since the last post and lots of progress has been made. Briefly I have figured out and solved :

  1. Raspberry Pi Cluster power supply issue. This involved purchasing a 600W PC Power Supply and some USB Power Distribution Boards. The power supply has been hacked and connected to the USB PDBs which in turn power 4 Raspberry Pi’s per PDB. The power supply has enough juice to run 12 Raspberry Pi’s simultaneously plus the 2 Gb Network switches required to connect the RPi’s to my local network router.
  2. Setup a MariaDB Database on my local Network Access Storage (NAS). This database will be used as needed for Raspberry Pi Cluster software projects.
  3. Signed up for, and started on 4th September, the Fundamentals Of Data Science course by the Southampton Data Science Academy.
  4. Signed up for, and started on 5th September, the Data Science With Python online course with Datacamp.
  5. Downloaded Anaconda and got started with Jupyter Notebooks. These will be two important packages/tools to be fluent with as I get into my Data Science career.
  6. Decided upon two initial, and interlinked, Data Science software projects to pursue. Firstly to build a distributed Python analysis engine based upon my previous automated analysis tool that I called “Describe”. Secondly to build a distributed Genetic Algorithm processing cluster for function optimisation. Both projects will be written mainly in Python but also using the Microsoft stack where appropriate.
  7. Using skills gained from the Datacamp course I’ve managed to connect to my local MariaDB database and my remote MS-SQL Databases that I rent through Fasthosts. This enables me to make use of my SQL and Data Analysis skills particularly with Microsofts T-SQL. I will be able to build complex queries using SQL Server Management Studio and then make use of them in Python.
  8. Decided upon an initial Data Science showcase analysis that uses the knowledge of the UK Residential Property Market that I gained when working in Risk Management. This analysis makes use of Python skills I’ve learnt doing the Datacamp course mentioned above. So far I have built :

a) A url data extraction program using Jupyter Notebooks / Python that loops through all UK Postcode Districts and downloads the Property sales transaction data from January 1st 1995 to present. It gets a file for each District and renames the files to the appropriate District Postcode.

b) A python program that pulls the data from a single UK Postcode District into a pandas dataframe, enriches it with some extra columns, then creates a new dataframe that groups the Property sales transactions by Year/Month and averages those sales against the Year/Month so that the data can be plotted.

c) Downloaded UK Retail Price Index (RPI) and Average Earnings Index (AEI) data.

d) Purchased a UK Postcode Districts and Sectors SVG vector map which will be used to present the Property sales and RPI/AEI data that I’ve downloaded in interesting ways! I’ve figured out how to edit the SVG file, and how I will be able to plot data geographically on the map.

That’s about it I think although there’s plenty of research, reading and thought that’s been going on behind all of the above.

The next 5 weeks will be all about completing the two Data Science courses that I’m taking plus completing the initial showcase.

Here’s to progress!

Cleaning Data – Numbers

Numbers

Numbers can manifest themselves in a number of ways and be subject to various restrictions. For example, there are a number of different types of Integer (whole number) where the difference is how large a value can be in both a positive and negative direction. There are four basic types of Integer :

Short. Uses 2 bytes. Allows values from -32,768 to 32,767
Unsigned Short. Uses 2 bytes. Allows values from 0 to 65,535
Long. Uses 4 bytes. Allows values from -2,147,483,648 to 2,147,483,647
Unsigned Long. Uses 4 bytes. Allows values from 0 to 4,294,967,295

Then there are floating point types which allow numbers with decimal places. There are three basic types :

Float. Uses 4 bytes. Allows values from 1.2E-38 to 3.4E+38. 6 decimal places
Double. Uses 8 bytes. Allows values from 2.3E-308 to 1.7E+308. 15 decimal places
Long Double. Uses 10 bytes. Allows values from 3.4E-4932 to 1.1E+4932. 19 decimal places

Cleaning numbers can be a slightly tricky affair as not only do you need to consider the target numeric types that you may require but there’s also the problem of what to do if you have any of the following values in your source data shown in perhaps Excel or a Text Editor for example :

  1. ‘blank’ values i.e no value present and shown as a blank
  2. ‘NULL’ or ‘null’ i.e a specific null value
  3. ‘NaN’ i.e “Not A Number”
  4. Scientific expressions of numeric values i.e a representation of a small or large real number using “E” notation e.g 1.2E-10 which represents the value of 0.00000000012.
  5. Infinity value i.e a value in a numeric that is used to denote the value of infinity e.g ‘inf’.

If you are working with a single cleaning transaction (i.e a one off cleaning task) then this is not so much of an issue but if this is something that you need to automate then you will need to either improve the data quality of your source or build a transformation rule that deals with all possible special values.

The issue with special values of course is that in some programming languages an invalid numeric type will cause an error. Therefore resolving how to deal with what are essentially string values will determine the success of any cleaning operation.

Cleaning Numbers :

Taking account of the information described above, the following should be taken into account when cleaning numeric data :

  1. Determine the source and target numeric formats. Where you are translating from one type of number to another you must be sure that you are not trying to force values greater than the largest value allowed. You will also lose precision when converting Real numbers into Integers i.e you will lose any value after a decimal point.
  2. Ensure the content of fields contain characters pertaining to numeric formats. Therefore characters such as ‘.’, or ‘,’ in some parts of the world, for real numbers should be allowed as should indicators of positive or negative numbers i.e ‘+’ and ‘-‘. If scientific numbers may be encountered then the ‘E’ or ‘e’ characters can be allowed. Format may also be important as sometimes the ‘-‘ may appear at the end of a numeric field to indicate the number is negative. Also be aware that sometimes negative numbers can be surrounded by brackets to indicate they are negative – this is a common format used by Accountants in Microsoft Excel e.g (5000) means -5000.
  3. Ensure correct range. If you are reading percentage data for example then it may be that you are only expecting values from 0 to 100, perhaps with decimal places. A range check may be required. Perhaps some of your percentage data might be suffixed with a ‘%’ symbol or even expressed as a value between 0 and 1 where 0.2 for example actually means 20%.
  4. If when transforming numeric data you are performing calculations you must be aware of implicit rounding that can introduce unwanted errors. This can occur when you store, or record the results of, a calculation that naturally produces values with part tiny real number values into a numeric type that does not have the precision required for holding the result. Rounding will occur implicitly due to the precision of the numeric types involved and you may get undesirable results. It is always good practice to perform calculations involving multiplication and division to the highest level of precision possible and to only round as a last step.
  5. Be aware of Proportionality in repeating groups. If you have data which exists on multiple rows, and you are perhaps calculating proportions between these rows, be prepared to adjust the values calculated to ensure correct proportionality or the total thereof. For example – if a number of rows of data are supposed to add up to 100% then you may need to adjust accordingly.

Next – The State Of Data (Coming soon)

Cleaning Data – Strings

A string, as it is known, is a series of alphanumeric characters and will most commonly include those letters, numbers and punctuation present in sentences for any particular human language. It can however include what are known as special characters.

Cleaning Strings :

  1. Trim “spaces” from the beginning and end of a string. It is useful to note here that sometimes what looks like a space may not actually BE a space but a different character and it’s just that what is shown on screen is a space rather than the underlying character code. Confused? Well for those that don’t know, every letter or number or special character has a “code number” and it just so happens that this code number for a “space” that you would type into your computer using the space bar on your keyboard is 32. If you use something the “TRIM” function in Excel to remove spaces from a string value what you are actually asking is to remove characters with code 32 at the beginning and end of your string value. If the character, that is SHOWN on your screen is being represented as a space but actually isn’t, and has a different code number, then it will not be removed.
  2. Remove Special Characters. It is sometimes easier to decide which characters are allowed and remove those you don’t want. This is easy enough if you are working with data expressed in English but can get more complicated when you need to take account of other Languages that use for example umlauts (those characters with two dots above the letter like “ü”) as in the German language. It is possible to create replacement lists for characters so as to provide an “equivalent” value (e.g replace “â” or “ã” or “ä” or “å” with a straight “a”) but it really depends on what you need to achieve.
  3. Homogenise Values. Some string values within the same column of a file will actually represent a particular value that is a member of a larger set of possible values. For example a value such as “United Kingdom” represents a geographical area. Sometimes values such as these will have been typed into a computer system rather than being selected so it will possibly have many spelling mistakes. While it is accepted here that it can be difficult to provide a means to resolve every possible permutation of spelling mistake the objective would still be the same.  To make data fit for purpose, and to enable it to be used, the mistakes would need to be corrected. Sometimes this correction can be achieved through programmatic means but on many occasions the only way to deal with this problem is to get a complete list of values entered and then provide a replacement value for someone to manually correct or correct via a macro etc
  4. Homogenise Meaning. This subject becomes a little more complicated but essentially the purpose of homogeneity of meaning is to “clean” or provide a mutually compatible value for two different string values. For example where two strings use different combinations of values to mean the same object such as in city names like “Zurich” which can also be spelt “Zuerich” or “Zürich”. Another example is in ball bearing product names as described in this article about TAMR.

Next – Cleaning Numbers

 

Cleaning Data – Why Bother?

The purpose of cleaning data is straightforward – to make it fit for purpose.

The level of what fitness for purpose means obviously leaves room for interpretation but I always look at it from a completely mechanical viewpoint because when it comes down to it we’re talking about data that’s being used by computer systems. Even if data values look the same, sometimes they just aren’t. Take the following four string values :

” Liverpool”, “Liverpool “, “Liverpool”, ” Liverpool ”

Yes it’s true that each one says the same thing to a human reader but to a computer they are different. Each value has a number of spaces within it and a computer system will not equate any of them to be the same. Now take a look at the following four values :

“01/06/2016”, “06-01-2016”, “1-JUN-2016”

Given the previous example you could say that each is different but it’s quite obvious that actually each value could be treated the same as, strangely, a computer system may interpret each as a correct Date value and equate all three values as being the same.

So we have two problems that you need to be aware of :

  1. What is the actual underlying value
  2. How is that value being represented and/or displayed to the human eye and is it different to how a computer is interpreting that value.

Therefore it can be necessary to be very specific when enquiring upon data to ensure that what you see is the real value, or an interpretation of it that is based upon an expected value. Take an age old problem that can catch people out – decimal number values in Microsoft Excel :

Enter a value of say “15.999” into a cell then change the number format to remove any decimal places shown. Of course, as those of you with some experience of Excel will know, the number will show as “16” on your screen. Perform any calculations using that cell however and the value being used in your calculation will be “15.999”. However if you then chose to export your file you need to be aware that performing this action in Excel can change your underlying values for formatted values. Any calculations will be based upon real values but formatted values can be exported exactly as shown on the screen which may not be what you required to happen!

So the whole point of cleaning data is to ensure that it’s fit for purpose. The next few pages suggest some starting points for different data types.

Next – Cleaning Strings

C# vs Python Construct Comparison

This post shows some major C# and Python programming constructs side by side for comparison. Disregarding the fact that C# is a compiled language and Python is interpreted you can see there are many similarities. As a friend of mine says “the only reason we program in one language or another is because of the libraries we want to use”.

This is a working list and more side by side comparisons will be added as time goes on!

CSharpPythonComparison

Chains And Ladders – Part 1

…or sometimes known as development triangles is a modelling technique, and hence a prediction technique, for process data. It lends itself well to business planning where the focus is placed on deterministic targets and subsequent monitoring of actual versus expected results. That’s exactly what I built these models to achieve previously with great success – or anti-success should I say? More on that later in this series of posts though.