Twitter feed Download and Analysis Part II: Tweet Management

In my last post[1], I was talking about my approach to downloading tweets from various twitter channels using python[2](version 3.6) and the tweepy library[3]. I would like to expand on this in this post, by looking into the downloaded tweet management. My final objective from this project is to download tweets and then analysing them with some machine learning techniques, therefore management of the tweets is a big part of this. If you want to see the entire code of this mini-project of mine, feel free to have a look at the github page for my project[4] and if you have improvements, please drop me a comment below.

Continue reading

Posted in Data Science, Data Wrangling, Pandas, Tutorial | Tagged , , , | Leave a comment

Twitter feed Download and Analysis Part I: Download of Tweets

During the last weeks and months, I have been reading up a bit more on natural language processing. I wanted to apply my newly learnt skills on that topic, but I needed a source of texts to analyze. As a source of information, I chose twitter thanks to its short texts, well written API and the tweepy library for python[1]. This post is Part I of a little journey that I undertook learning how to obtain my tweets, saving them and finally analyzing them. In this part, we will figure out how to obtain a twitter feed download.

Continue reading

Posted in Python, Tutorial | Tagged , , , | 1 Comment

Naive Bayes explained naively

To me, reading about the concept of naive Bayes is like following a very logical train of thought. Without really knowing (or caring) whether this is an accurate description, I call something like this a logic chain… and it goes something like this…


Probabilities are given as a number between 0 and 1, where 1 is absolutely certain and 0 is an impossibility.

What is the chance (or probability) that I will watch Netflix on a given weekday? About 0.75.

p(Netflix) = 0.75 Continue reading

Posted in Data Science | 1 Comment

Create an automated ML model evaluation report with scikit-learn, matplotlib and python-docx

As my knowledge on the subject of machine learning grows, I ended up writing code for several different models several times over. In order to better evaluate which model performs better, I wanted an automated ML model evaluation report that gives me a good overview.

I achieved this goal by doing the three following things:

  1. Writing the models for the evaluations with the help of scikit-learn
  2. Creating an overview graph using matplotlip.pyplot
  3. Compiling the report with the help of the python-docx library

Please find the code below. If you have any questions on the code, please read on below the code, where I explain some of the details. In case your question is still not answered, please leave a comment and I will try to answer as soon as I can. Continue reading

Posted in Data Science, Machine Learning, Python | Tagged , , | Leave a comment

Kaggle Learnings – Exploratory Data Analysis & Data Cleaning

I am a strong believer in worked examples and case studies. Theory is all nice and well, but without applying it in a real use-case, it can be quite a pointless exercise. Today on Kaggle, I came across a worked example on exploratory data analysis and data cleaning for the “House Pricing Advanced Regression”-Data Set by Pedro Marcelino (Here the pdf download with kind allowance from the author).

This guide through his solution was so fantastically well-written and I learnt so many things that I want to spend the remainder of this blog post to sum up my key learnings.

Please note that the tools used to analyse were Python and the associated libraries Pandas and Seaborn, however, it is not necessary to know these to understand the points of the article.

1. Exploratory Analysis and Data Cleaning is the most important part of any analysis

The very first take home message for me was, how important the exploratory analysis and the subsequent data cleaning is. I know this is an obvious one for everyone, but Pedro Marcelino went here from 80 factors down to 7. Not least of all, he did it using sound reasoning and a look at the information provided.

Unsurprisingly, all the following points revolve around how the author managed to whittle the variables down to only 7 from the aforementioned 80.

Continue reading

Posted in Data Cleaning, Data Science, Data Wrangling, Machine Learning | Tagged , , | 1 Comment

Linear Regression in a Nutshell

Explaining linear regression using the ordinary least squares method appears to be a bit of a rite of passage in data science judging by the amount of entries one can find on the web. True enough, it has the same feel to it as the first “Hello World!” script of the latest, newest programming language (I put a list of entries that I found on the topic below the sources for this article)

Whenever you want to describe the relationship between one variable and another (or several others) and subsequently predict other outcomes, linear regression is one of the most important tools to be used.

Continue reading

Posted in Data Science, Machine Learning, Theory | Leave a comment

How to create a stacked barchart with python and matplotlib

When data is being extracted and analysed, this very often falls to people who will not necessarily take decisions based on them. This, typically, means that you, the data analyst, need to present the data in a clear and concise manner to the people who will act on them. Unsurprisingly, the visualization of data in graphs is an essential skill here, no matter what you are trying to show. In this matplotlib data visualization tutorial, I want to explore the creation of a bar chart with the help of the Pandas library, Matplotlib and the Jupyter notebook.

I previously looked at data extraction of a CSV file using Python and the Pandas library from the World Bank data and I will utilize this extracted information for the purpose of creating a stacked bar chart.

Stacked bar charts are useful when comparing totals vs. its parts as Smashingmagazine put it so aptly. We will use it here, to look at some real world data and visualise how the demographics of Germany has changed over time.

Continue reading

Posted in Data Science, Pandas, Tutorial | Tagged , , , , | Leave a comment

Review: Pandas .loc vs. iloc

When you use Python (3.6.2) for data analysis, the Pandas library (0.20.3) is typically used to navigate efficiently through your datasets. You select single values, slice the datasets by row or column or transfer a subset of data to a different variable.

When you do that, the chances are very high that you will be using Pandas’ .loc[], .iloc[]method*. With this blog post, rather than writing a straightforward tutorial, I reviewed what is already out there and summed it up. I listed the links at the end of this blog post and tried to make a clear link to them.

Continue reading

Posted in Data Science, Python, Tutorial | Tagged , , | Leave a comment

Tutorial: Extracting World Bank Data from CSV using Python and Pandas

One of the most important activities in data science is the extraction and collection of data, followed by the transformation into formats that can be used to analyse and be interpreted.

In this post, I will use the Python programming language (Version 3.6.2) and the Pandas library to extract and convert data that the World Bank websites provides in the CSV format. For the sake of easiness, I use Jupyter, so if some of the commands do not work, maybe try to search for differences between Jupyter and other IDEs in the search engine of your choice. Continue reading

Posted in Data Science, Python, Tutorial | Tagged , , , | 1 Comment

Where to find useful sample data sets to practice with?

In order to become a good at anything, there is one thing that you need to do and that is to practice, practice… and then practice some more. When that something is data analysis, however, you actually need data sets to try stuff out.

From my two previous posts, you maybe saw that I randomly generated data just to do Pivot tables and charts. The purpose there, however, was just to go through the process of creating the visualizations. Now when you want to go a little deeper into identifying trends and using statistics, what you actually need is sample data.

While I was upskilling myself, I ran into this roadblock repeatedly, so I thought, I dedicate some time to researching where one can find suitable data.

The cool thing here: The governments of this world and other organisations have dedicated departments for this purpose “and” their data are very often openly available! They may not always be in the absolutely correct format, but hey converting data is also part of the job.

I give below a range of websites that I found, where you can find very interesting datasets to play with. If you are not from the UK or the US, I am sure, similar institutions exist in your part of the world.

Continue reading

Posted in Data Science | Tagged | Leave a comment