You Don't Need a Ph.D. to Learn Machine Learning


Surprise! Data science doesn’t have to be impossible.

What I learned: data scientists regularly help each other by creating – and re-using – publicly available algorithms. This clearly makes their lives easier, as they can lean on tried-and-true solutions without having to reinvent the wheel each time they manipulate datasets.

Wow. And here I thought, every data scientist comes up with his or her own algorithm every day. This is not the case, apparently.

I learned this after attending two recent machine learning Meetups. At the Machine Learning Hackathon I attended back in February (yes, I should have blogged about it back then, because it was so awesome), facilitator Luc Castera actually gave us a list of algorithms to choose from. I was stunned, thinking, ‘Why would a data scientist, after working so hard, make these publicly available?’ I actually thought that Luc was just trying to be nice to us, since so many of us in the group were newbies to machine learning.

The assignment that night: predict the survivors of the Titanic. WTH? I thought, then Luc explained that it was part of a public competition on competitive data science portal Kaggle, which was brilliant, even though my team ended up losing.

My team selected a NodeJS implementation of Decision Tree, which uses the ID3 algorithm.

The Kaggle competition.

A similar situation occurred this week when attending the Fort Lauderdale Machine Learning Meetup, ‘Zero to Hero in One Session: A Hands-On Approach to R Programming,’ led by data scientist and R evangelist Pierre Lafortune. (Yes, he is in the top 0.41% on Stack Overflow this year.)

The hands-on learning included the group having to download and be introduced to RStudio, which was a relief, since the Intro to R course on DataCamp I went through never actually explained where or how an R programmer programs in R.

Pierre provided all of the datasets and instructions beforehand, but walked us through step-by-step how to clean, organize, merge, visualize, and most importantly, use algorithms to make predictions for what was missing.

And you guessed it: he provided the algorithms for us.

One was Random Forest, for classification/clustering and regression. This was easily added via the Install function in RStudio.

I suppose I should not be surprised that data scientists use and re-use already-published algorithms: the open source community boasts Ruby gems and Javascript libraries numbering in the thousands.

As a result of the relative friendliness and approachability of the facilitators of these two machine learning events, I have taken a much stronger interest in learning data science. I’m not yet ready to attend a bootcamp, such as those offered by Galvanize or Metis, but I definitely plan to strengthen my R and Python skills.

Sadly, the Intro to Python course I took on Treehouse has been deprecated and I will now need to re-learn Python. Does anyone have any suggestions for a good Python course, with emphasis on its usage in data science?

I think another reason I am gravitating towards data science is that the discipline is actually doing something useful: answering questions, solving problems, visualizing numbers, making decisions, and simplifying what is seemingly complex. And isn’t that what software development is supposed to do for the world?