Kaggle Learnings – Exploratory Data Analysis & Data Cleaning

I am a strong believer in worked examples and case studies. Theory is all nice and well, but without applying it in a real use-case, it can be quite a pointless exercise. Today on Kaggle, I came across a worked example on exploratory data analysis and data cleaning for the “House Pricing Advanced Regression”-Data Set by Pedro Marcelino (Here the pdf download with kind allowance from the author).

This guide through his solution was so fantastically well-written and I learnt so many things that I want to spend the remainder of this blog post to sum up my key learnings.

Please note that the tools used to analyse were Python and the associated libraries Pandas and Seaborn, however, it is not necessary to know these to understand the points of the article.

1. Exploratory Analysis and Data Cleaning is the most important part of any analysis

The very first take home message for me was, how important the exploratory analysis and the subsequent data cleaning is. I know this is an obvious one for everyone, but Pedro Marcelino went here from 80 factors down to 7. Not least of all, he did it using sound reasoning and a look at the information provided.

Unsurprisingly, all the following points revolve around how the author managed to whittle the variables down to only 7 from the aforementioned 80.

2. Visualize the relationship between the desired variable and categorical values

The difficulty with categorical values is that although you can transfer them into numbers, it is not always clear how much of an effect a change from 0 to 1 or from 3 to 4 will have on the output variable that we are interested in.

In this sense, visualization using box plots vs. the desired output variable is an excellent start as our author has done.

BoxPlot OverallQual vs. SalePrice

A BoxPlot of Overall Quality vs. Sales Price

Taking the example of SalePrice vs. OverallQual it is quite obvious that there is a clear correlation between the two.

My first very clear “DO” for my future endeavours. Thank you Pedro Marcelino!

3. Univariate Analysis – Identify variables with closest relationship to desired variable

In the Sales Housing Price data set, we are interested in how the Sales Price is affected by various other factors. As such, it made absolute sense to start looking at those relationships first. A neat visualization tool here is the heatmap that the seaborn library provides. I have been known to use this tool, as well, but never to such heights of efficiency!

First things first, create a correlation matrix between all factors measured quantitatively and check which ones show a strong correlation to SalePrice.


A heatmap looking at the correlations of different factors of the Sales Housing Prices data set to each other

Let’s make this text a little more interactive and you have a look first at the data before reading on. Click the button below to reveal the conclusions that Mr. Marcelino took from the heatmap.

TotalBsmtSF and 1stFlrSF, as well as, GarageYrBlt, GarageCars and GarageArea are so highly correlated, it suggests a strong case of collinearity.

In addition, at first sight SalePrice shows a strong correlation to OverallQual (that makes sense I guess), GrLivArea and TotalBsmtSF.


Looking closer, a top 10 of the variables with the highest correlation to SalePrice will give you even more insight into the relationships. By this alone, you can already discard most of the quantitative factors involved. This will be a first stop in my exploratory data analysis.

Heatmap with correlation annotated

A heatmap showing the correlations with the ten highest scores.

A final pairplot (with the help of seaborn) will then give you a final visualization tool to check how the relationships manifest themselves. I leave that for you guys to do.

My next “DO” in my exploratory data analyses.

4. Missing Data – Handle with Care

This one is fairly straightforward and obvious if you think about it. So far, whenever there were null values in a data set, I considered trying to fill it up in some form or another, unless it was very obvious that there are more null values than actual values.

The rule of thumb mentioned is, if 15% or more the data set is null values, delete it entirely. You are more likely to introduce a bias into it by trying to fill it artificially (with, e.g. the mean of the series).

Anything below that, consider deleting the row, unless there is a very clear path on how to fill it.

“DO” number 3!

5. Outliers

When dealing with outliers, it is crucial to look at the circumstances. In many cases, outliers can seriously affect the data set by e.g. affecting the mean. In other cases, such as in the credit card fraud department of a bank, it would be straight out hazardous to ignore outliers…

Therefore, our author looked at it in detail. He standardized the data first (here: the data have a mean of zero and a standard deviation of 1) and then checked out the 10 lowest and 10 highest values.

outer range (low) of the distribution:
 [-1.6166617 ]

outer range (high) of the distribution:
[[ 3.82758058]
 [ 4.0395221 ]
 [ 4.49473628]
 [ 4.70872962]
 [ 4.728631  ]
 [ 5.06034585]
 [ 5.42191907]
 [ 5.58987866]
 [ 7.10041987]
 [ 7.22629831]]

In this case, the two highest values need to be monitored, but it turns out that they, in fact, lie within the trend so they will remain in the data set.

6. Normality

Last but not least, let’s have a look at normality. Not in the sense that we are trying to lead a normal life, but in the sense of data. Many statistic tests rely on normality, i.e. a bell shaped curve to their data distributions.

Well, in this case, several variables showed skewness in their dataset, including our output variable SalePrice.

Histogram with positive Skewness

Histogram of SalePrice showing positive skewness


Probability plot showing positive skewness

Probability plot of SalePrice showing positive skewness

I have to admit, this one blew my mind a little bit (in as far as a mind can be blown “a little bit”)… and it still does now! This is one of those moments, when life is cool! Here you are, looking at suboptimal data thinking: “Well, p**p!”

But then, not all is lost. There are some very easy operations that can alleviate this little pickle that we are in. Here is what you can do to normalize skewness:

Positive Skewness: Take log10 of each value.

Negative Skewness: Take the maximum value of the data set +1 and subtract the value. Then take the log10 of that.

However, as we all know, this means that there cannot be a value of 0 or smaller in that data set. If this is the case, add to each value the amount of the smallest value+1. E.g. add +1 to zero values, or if the smalles value is -2 add +3.

Check out the result for the above case.

Normalized Histogram of SalePrice

Histogram of SalePrice. Normalized after log10

Probability Plot of Normalized SalePrice

A Probability Plot of SalePrice after data normalization using Log10


I know for many of you, this is bog standard and not new at all, but for me this is one of those moments why I studied science, then moved on to do a PhD and in general want to learn more about the world.

So check for skew and fix it if you can. DO!


Summing this up article up, Pedro Marcelino gave us an excellent how to guide for exploratory data analysis and data cleaning in real life case study on Kaggle. I gave you my 6 key take aways from his guide.

Check it out yourself and let me know in the comments section, if I missed something out. I hope you enjoyed the read. Take care.

This entry was posted in Data Cleaning, Data Science, Data Wrangling, Machine Learning and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *