A Foray into R, ggplot2, MariaDB

By | March 25, 2017

This post will not be useful to anyone but me, but I wanted to do a little project to get me more familiarized with R (especially ggplot2), MariaDB and git, and see how they all can connect.

I’ve decided to use my new Fedora Scientific box to load some World Bank population data into a local database, and then make some pretty graphs out of it. I’ve written this post to record my steps.

Before I start, I want to list the following links which I found especially useful:

Now, I’m reasonably fluent in Matlab, so I my first impression of R has been something like that of a Spanish speaker trying to speak Italian. So, be patient with me.

My first step was to write an R script to load the csv files into my local SQL database. The code for that is here, in this GitHub repository.

Git was quite an experience for me, since I have only ever used perforce for version control. The basic stuff was simple enough, but I heard somewhere that branches are important, and once I started messing with branches I quickly found myself in Git Hell. So for now I’ll just stick to add, commit, push and pull.

You will notice my tables are a bit complicated in SQL — I am using four of them to store this data. I tried to structure it so it’s easy for me to add new features later. It also makes it easy to write loading functions that can arrange the data by any dimension. For example:

So calling get_worldBank_region_population("GEO",2015) would produce a list of the world’s geographic regions and their total populations as of 2015.

Now that I had a way to bring down my data easily, I decided to begin exploring ggplot2:

Instantly generates this graph, for any region type, and for any year:

Pie graph by income regions, 2015

And a quick call to plot_stackArea_histByRegion('GEO',1960:2015):

Instantly produces this neat graphic:

Stacked Area Graph by GEO regions

In conclusion, I’ve learned a lot about the tools. I found R to pretty easy to use (similar enough to Matlab, at least), and I thought the R/SQL connectivity worked fantastically right out of the box. RStudio has so far proved to be a much worse IDE than Matlab, but I guess you get what you pay for. ggplot2 seems like a very powerful plotting tool, but I haven’t thoroughly explored it yet.

As next steps, I want to use this data to explore the mapping capabilities of ggplot2, and maybe explore creating my own R package.