An Introduction to Machine Learning in Apache Spark
Over a year ago, a few friends and fellow engineers of mine decided that we were going to create a Kaggle competition team with the goal of attempting the challenges to learn how to develop systems using Apache Spark. Most of us involved knew that it would be a valuable, modern framework to learn, and I was one of the few at the time with previous experience in Spark. So I thought it would be a good idea to create a tutorial for the group on how to use the latest and greatest version (at the time 1.6) to solve challenging ML problems.
A year later and a tutorial left unfinished, I thought it valuable to provide to the online community due to the increased proliferation of Spark in the realm of data processing. I did not have the time available to complete the “Feature Engineering” section of this machine learning tutorial, but it does contain the important steps of importing data, data exploration, and data cleaning. Maybe someday, there will be time to update this with Spark 2.0 and a fully completed ML model. That day is not today.