In this post I am going to look at the ‘Detect Categories’ table analysis tool using SQL Server 2012 Data Mining Add-In for Excel 2010. The ‘Detect Categories’ tool uses Microsoft’s clustering algorithm to automatically categorize rows in a table that have similar characteristics. This is the simplest way to categorize data in a table and understand its characteristics.
I am going use a data set from the Windows Azure Marketplace called ‘Places of Interests in Singapore’.
The file downloaded as a ‘.csv’ and contains about 26,000 rows. For this example, I am going to run the analysis on a subset of this data by creating a sample set.
Here are the steps to create a sample. Click on the Data Mining table and choose ‘Sample Data’
Take a random sampling:
I’ll sample ~30%:
Click on ‘Detect Categories’ from under ‘Table Tools’>’Analyze’ menu:
In the column list, I’ll choose: company_type, category, status, food_cuisine. Set the number of categories to 4 and Run the analysis. A rule of thumb I use to select columns is choose those columns that have low distinct values (low cardinality). You want to skip columns such as names, addresses, dates etc that have large distinct values.
This creates a categories report:
The first category created is for Shopping based on the record count, so I am going to rename ‘Category 1’ to ‘Shopping’. This updates the table below and the Chart.
The 2nd category is “Food and Beverage”:
The 3rd category is a hodgepodge of different categories so I’ll call it Other and the 4th category is called ‘Beauty and Spa’.
To refresh the chart, please Ctl-Alt-F5
The new Category column is also added to the table so you can now sort or filter by category: