After claiming the easy use of Microsoft data mining tools, it is time to walk you through the process using Microsoft's SQL Server Analysis Services (SSAS) and SQL Server Data Tools.
You do not need an OLAP cube, dimensional data model or data warehouse to start mining data. SSAS' data source view (DSV) is a semantic layer between your physical data source and data mining processes. The DSV allows you to join tables to denormalize your source data and derive new values to transform your data. Yes, the DSV can correct some data sins, but those corrections come with a cost. The DSV extracts data from your physical data source every time you process a data mining model so DO NOT CONNECT YOUR DATA MINING DSV DIRECTLY TO YOUR TRANSACTIONAL DATABASE. Check to see if a separate reporting database or operational data store is available for your data mining efforts.
One way I frequently leverage the DSV is creating a single-column primary key for my data mining cases. Your physical data source probably supports a composite primary key, but SSAS does not. Carefully concatenate your composite primary key columns into a single primary key column in your DSV. Beware how SSAS pretends to support composite primary keys only to later throw errors when mining your data.
Once your DSV is ready, the next step is creating a data model structure. Think of the structure as the superset of data inputs and outputs to be mined. The structure requires a primary key for each case you are mining. Think of a case as a distinct set of all of the available inputs plus the outcomes you plan to predict.
For my examples here, I am using Lahman's Baseball Database to predict which Major League Baseball players are All Stars for a given season. A distinct case equates to the combination of player and season with the Yes/No All Star Flag being my predicted variable.
To create a new model structure, right-click the Mining Structures folder in Solutions Explorer and select New Mining Structure …. As is the Microsoft way, a Data Mining Wizard gets your new mining structure started. Select your Data Source View and specify the case to be analyzed. From the list of available columns, specify the primary key for the case, the input variables and the predicted variables to be mined. Remember, only one column should be chosen as the key. Select many columns as input variables, predicted variables or both, but exclude obviously unnecessary columns to make your data mining process more efficient.
Once you selected your case key and predicted variable, the wizard includes a Suggest button that will quickly test correlations between the predicted variable and available input variables. As with most wizards, beware how you simply accept suggested input variables based on the speedy test sample. You may want additional columns later for filtering and outputting your final results.
Confirm your data types. Continuous and Discrete columns are easily detected, but what are these Cyclical and Discretized options? Cyclical data represents a repeating ordered set. For example, weekly Day 1 follows Day 7 and annual Quarter 1 follows Quarter 4. Think of cyclical data as a continuous cycle of discrete values. Discretized values are easier to understand because SSAS simply creates discrete groups of continuous values. For example, SSAS might automatically divvy continuous values into virtual buckets for more simple data mining operations.
The wizard even sets aside a testing set for later validating predictions. The cases set aside are used later for determining the accuracy of mining model predictions.
On the last screen of the wizard, check "Allow drill through" now so you will be able to see detailed case data later.
With a Mining Structure and a single Mining Model, you could process and view your results now, but what is the probability
of the wizard producing the very best mining model? Even if that model is perfect, you need another model to prove the perfection of the first model. Data mining works best by using different algorithms with various data options and model properties to determine optimal predictions.
Assuming your first model uses Decision Trees, right click in the Mining Models white space, select New Mining Model … and choose the Clustering algorithm. Add a third model this time selecting Naïve Bayes. You probably received a message box stating Naïve Bayes does not support some of the data types found in your mining structure. In Part I, I mentioned some algorithms have the limitations. Naïve Bayes not supporting continuous numeric columns is one of those limitations. To enable the data mining process to execute successfully, those offending columns must be set to Ignore versus Input.
You can mix and match models and algorithms within a single mining structure. You can add multiple versions of the same algorithm then exclude certain Input columns by setting the column to Ignore or add a filter by right-clicking the model and selecting Set Model Filter … to exclude cases from a particular model. Just make sure you give each model a meaningful name for when you start comparing model results.
Now that your mining structure and models are in place, simply click the "Process the mining structure and all its related models" icon in the upper-left corner of your Mining Models tab then click Yes to deploy your project to SSAS instance and click Run to execute the process. Assuming your connections, settings and data are proper, the data mining process will crank the cases through your mining models.
The length of time it takes to process the mining models can vary from seconds to hours depending on the number of cases, columns and mining models involved.
Do not panic if your completed process reports "Process succeeded with warnings." Many of those warnings are simply suggestions to exclude certain columns with large number of distinct values because those columns are contributing nothing to the final prediction. Exclude those columns from future processing to expedite those future processes without degrading the accuracy of your predictions.
In Part III of this series, I will explain how to review Microsoft's data mining results.