BlogApril 12, 2018
The Machine Learning Primer
Machine Learning has been a buzzword in the business space recently as advances in technology and understanding have made wide-scale implementation of these technologies more feasible. The data science field is expecting a relatively higher compound annual growth rate (CAGR) compared to other areas within business intelligence, and Deloitte Global has predicted that machine learning based "pilots and implementations will double from 2017 to 2018, and again by 2020." As machine learning continues to permeate throughout the modern business and IT spaces, understanding what it is, where it fits within the data science lifecycle, and differences in implementation is increasingly necessary.
What is Machine Learning?
Machine Learning evolved from attempts to create artificial intelligence in the mid-20th century, but high data storage and computational requirements impeded widespread adoption. In recent years, however, the low cost of data storage and analysis resources fostered an explosion of machine learning and other data science implementations across the business space.
At its core, machine learning (often abbreviated as 'ML') is an algorithm-based data analysis technique that increases in accuracy and predictive power as the amount of information increases. By comparison, non-ML techniques lack the ability to adapt to data changes and can quickly become out of date. This allows machine learning implementations to have longer lifecycles and higher accuracy than other analysis techniques.
Within the data science process, machine learning is a potential technique implemented in the analysis step. By this point, the business unit has identified a problem; the data engineering team has gathered, organized, and cleansed a relevant dataset; and the data science team has familiarized themselves with structure and contents through rudimentary data exploration. The data scientist is then able to perform deeper exploratory analysis to tease out new insights or underlying trends. From there, the team can move towards experimenting with the selection of the models and algorithms that will be part of the final implementation.
Supervised vs. Unsupervised Learning
Machine Learning has applications in both the exploratory analysis and model selection/implementation steps of the data science lifecycle, split along the two sections of ML, supervised and unsupervised learning methods.
Supervised learning requires both an input and output structure, where some set of inputs directly result in the output, commonly expressed as y = f(x). In this type of machine learning, the training set functions as a 'teacher,' showing the algorithm a set of inputs that outputs a 'true' value for it to check against. The algorithms are then meant to extract underlying trends from the training set, and then apply these lessons to evaluating a test set, checking their work against the set of true values as they go. The data scientist can manipulate the inputs and weights of the model to achieve the highest rate of success. Once the model has reached a previously determined accuracy level, the algorithm is ready for implementation.
Comment: This plot shows the classification algorithm that intakes a mean daily temperature and 1000s of steps per day to determine whether an animal in question is a sheep or a goat.
However, supervised learning problems best lend themselves towards regression and classification problems, as shown in the y = f(x) structure of the data. The methods seek to intake a series of data points to provide a final measurable value. For example, in a regression problem, the model would predict a car's miles per gallon using the characteristics of the car. For a classification problem, the model would determine whether an animal is a goat or a sheep. The advantage of machine learning is that as more data is introduced and analyzed by the model, the more accurate the model becomes in predicting a car's miles per gallon or whether an animal is a goat or a sheep.
Unsupervised learning doesn't require an output structure, only an input structure. This type of analysis is geared towards exploring underlying trends in the data or the data structure, instead of providing a single relevant metric or answer. In unsupervised learning, the algorithms use non-representative metrics that will not correlate with a specific real-world application, like the MPG prediction or spam classification from the supervised examples. Instead, the different data points are displayed, and their performance relative to other points indicate underlying trends and associations. This encompasses applications like k-means clustering or principle component analysis that are meant to compare across the dataset to show where things are in relation to each other.
Comment: k-means clustering can distinguish underlying groups of labels across composite variables x1 and x2, showing previously unknowing relationships between subsets of data.
There are often problems where the data falls somewhere between supervised and unsupervised machine learning, as the unsupervised data is easier to collect, store, and maintain, but supervised learning can produce more valuable insights. One of the easiest to understand examples used by Machine Learning Mastery is labeled and unlabeled pictures, where the process of labeling pictures is time-consuming. In this case, it is also possible to use the labeled data as a 'training set' to categorize the unlabeled 'test set.'
Looking Towards Our Innovation Challenge
With machine learning being a hot item in the business space, it is tempting to want to apply machine learning to all facets of your business model. However, like Ben Harden discussed in (his CapTech blog about data science), successfully implementing data science solutions that provide legitimate value-add requires deep technical knowledge and an intimate understanding of the business space involved. Stepping into the machine learning space without the proper understanding of the problem and the technical know-how to implement your vision correctly can result in steering your business in the wrong direction.
In the CapTech internal Innovation challenge, we had numerous teams implement machine learning solutions to problems they identified. In our post detailing the Innovation Challenge, we can see different implementations and lessons learned by our own internal teams.