In this post I am going to look at the ‘Detect Categories’ table analysis tool using SQL Server 2012 Data Mining Add-In for Excel 2010. The ‘Detect Categories’ tool uses Microsoft’s clustering algorithm to automatically categorize rows in a table that have similar characteristics. This is the simplest way to categorize data in a table and understand its characteristics.
I am going use a data set from the Windows Azure Marketplace called ‘Places of Interests in Singapore’.
The file downloaded as a ‘.csv’ and contains about 26,000 rows. For this example, I am going to run the analysis on a subset of this data by creating a sample set.
Here are the steps to create a sample. Click on the Data Mining table and choose ‘Sample Data’
Take a random sampling:
I’ll sample ~30%:
Click on ‘Detect Categories’ from under ‘Table Tools’>’Analyze’ menu:
In the column list, I’ll choose: company_type, category, status, food_cuisine. Set the number of categories to 4 and Run the analysis. A rule of thumb I use to select columns is choose those columns that have low distinct values (low cardinality). You want to skip columns such as names, addresses, dates etc that have large distinct values.
This creates a categories report:
The first category created is for Shopping based on the record count, so I am going to rename ‘Category 1’ to ‘Shopping’. This updates the table below and the Chart.
The 2nd category is “Food and Beverage”:
The 3rd category is a hodgepodge of different categories so I’ll call it Other and the 4th category is called ‘Beauty and Spa’.
To refresh the chart, please Ctl-Alt-F5
The new Category column is also added to the table so you can now sort or filter by category:
In this post, I am going to look at how Analyze Key Influencers works within the Data Mining 2012 Addin for Excel 2010. Analyze Key Influencers uses Microsoft’s Clustering Data Mining algorithm to analyze a column that contains the desired outcome of results. The algorithm looks at the patterns in the data to determine which of the factors had the strongest influence on that outcome.
For example you could answer:
What are the common characteristics of the customers who are our top purchasers?
What are those key influencing/decision-making factors?
What’s really influencing purchasing? Is it the price? Is it the quality of products? Is it service?
I will use the sample data that came with the Data Mining 2012 Addin. This file contains data sets suitable for experimenting with the Data Mining Add-Ins. The data is extracted from the AdventureWorks sample database.
Here is a snapshot of the raw data:
Note the last column “Purchased Bike” is what we’re going to analyze. This column only contains “Yes” and “No” values.
Under Table Tools, select “Analyze Key Influencers”
This will launch the wizard and choose “Purchased Bike” column.
Click on Choose Columns to be used for analysis otherwise the analysis will be run on all columns.
I will choose: Marital Status, Gender, Cars and Region and Run the analysis. If you streamline your analysis, we can improve the performance and accuracy of this report.
A new report called “Key Influencers Reports for ‘Purchased Bike’” is created. The report graphically displays the probability of the relationship. The results are listed only for the most influential columns and columns that did not influence the outcome is excluded from the report. In our example Gender is excluded even though I had chosen it for analysis.
The result above tells us:
The most influential factor that favored buying a bike is:
- customers did not have Cars
- followed by those customers who were single
- followed by those who had 1 car
- followed by those who lived in the Pacific
The most influential factor that did NOT favor buying a bike is:
- customers had 2 cars
- followed by those customers who were married
- followed by those who had 1 car
- followed by those who lived in the North America
In more complex examples, we could potentially analyze the outcome of a column that had many values.