Just another Blog

My Technical, Professional and at times Personal Blog

Showing posts with label Detect Categories. Show all posts

Microsoft BI – Detect Categories using SQL Server 2012 Data Mining Add-In for Excel 2010

In this post I am going to look at the ‘Detect Categories’  table analysis tool using SQL Server 2012 Data Mining Add-In for Excel 2010. The ‘Detect Categories’ tool uses Microsoft’s clustering algorithm to automatically categorize rows in a table that have similar characteristics. This is the simplest way to categorize data in a table and understand its characteristics.

I am going use a data set from the Windows Azure Marketplace called ‘Places of Interests in Singapore’.

image

The file downloaded as a ‘.csv’ and contains about 26,000 rows. For this example, I am going to run the analysis on a subset of this data by creating a sample set.

image

Here are the steps to create a sample. Click on the Data Mining table and choose ‘Sample Data’

image

Take a random sampling:

image

I’ll sample ~30%:

image

Click on ‘Detect Categories’ from under ‘Table Tools’>’Analyze’ menu:

image

In the column list, I’ll choose: company_type, category, status, food_cuisine. Set the number of categories to 4 and Run the analysis. A rule of thumb I use to select columns is choose those columns that have low distinct values (low cardinality). You want to skip columns such as names, addresses, dates etc that have large distinct values.

This creates a categories report:

image

The first category created is for Shopping based on the record count, so I am going to rename ‘Category 1’ to ‘Shopping’. This updates the table below and the Chart.

image

The 2nd category is “Food and Beverage”:

image

The 3rd category is a hodgepodge of different categories so I’ll call it Other and the 4th category is called ‘Beauty and Spa’.

To refresh the chart, please Ctl-Alt-F5

image

The new Category column is also added to the table so you can now sort or filter by category:

image

Google+ Followers